Re: how do i view the local file system output of a mapper on cygwin + windows?

2012-04-05 Thread Harsh J
Jane,

Yes and thats documented:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reducer.html#reduce(K2,%20java.util.Iterator,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)

"The framework will reuse the key and value objects that are passed
into the reduce, therefore the application should clone the objects
they want to keep a copy of."

On Fri, Apr 6, 2012 at 6:26 AM, Jane Wayne  wrote:
> i found out what my problem was. apparently, when you iterate over
> Iterable values, that instance of Type is being used over and
> over. for example, in my reducer,
>
> public void reduce(Key key, Iterator values, Context context)
> throws IOException, InterruptedException {
>  Iterator it = values.iterator();
>  Value a = it.next();
>  Value b = it.next();
> }
>
> the variables, a and b of type Value, will be the same object
> instance! i suppose this behavior of the iterator is to optimize
> iterating so as to avoid the new operator.
>
>
>
> On Thu, Apr 5, 2012 at 4:55 PM, Jane Wayne  wrote:
>> i am currently testing my map reduce job on Windows + Cygwin + Hadoop
>> v0.20.205. for some strange reason, the list of values (i.e.
>> Iterable values) going into the reducer looks all wrong. i have
>> tracked the map reduce process with logging statements (i.e. logged
>> the input to the map, logged the output from the map, logged the
>> partitioner, logged the input to the reducer). at all stages,
>> everything looks correct except at the reducer.
>>
>> is there anyway (using Windows  + Cygwin) to view the local map
>> outputs before they are shuffled/sorted to the reducer? i need to know
>> why the values are incorrect.



-- 
Harsh J


Re: how do i view the local file system output of a mapper on cygwin + windows?

2012-04-05 Thread Jane Wayne
i found out what my problem was. apparently, when you iterate over
Iterable values, that instance of Type is being used over and
over. for example, in my reducer,

public void reduce(Key key, Iterator values, Context context)
throws IOException, InterruptedException {
 Iterator it = values.iterator();
 Value a = it.next();
 Value b = it.next();
}

the variables, a and b of type Value, will be the same object
instance! i suppose this behavior of the iterator is to optimize
iterating so as to avoid the new operator.



On Thu, Apr 5, 2012 at 4:55 PM, Jane Wayne  wrote:
> i am currently testing my map reduce job on Windows + Cygwin + Hadoop
> v0.20.205. for some strange reason, the list of values (i.e.
> Iterable values) going into the reducer looks all wrong. i have
> tracked the map reduce process with logging statements (i.e. logged
> the input to the map, logged the output from the map, logged the
> partitioner, logged the input to the reducer). at all stages,
> everything looks correct except at the reducer.
>
> is there anyway (using Windows  + Cygwin) to view the local map
> outputs before they are shuffled/sorted to the reducer? i need to know
> why the values are incorrect.


Re: Hadoop streaming or pipes ..

2012-04-05 Thread Charles Earl
Also bear in mind that there is a kind of detour involved, in the sense that a 
pipes map must send key,value data back to the Java process and then to reduce 
(more or less). 
I think that the Hadoop C Extension (HCE, there is a patch) is supposed to be 
faster. 
Would be interested to know if the community has any experience with HCE 
performance.
C

On Apr 5, 2012, at 3:49 PM, Robert Evans  wrote:

> Both streaming and pipes do very similar things.  They will fork/exec a 
> separate process that is running whatever you want it to run.  The JVM that 
> is running hadoop then communicates with this process to send the data over 
> and get the processing results back.  The difference between streaming and 
> pipes is that streaming uses stdin/stdout for this communication so 
> preexisting processing like grep, sed and awk can be used here.  Pipes uses a 
> custom protocol with a C++ library to communicate.  The C++ library is tagged 
> with SWIG compatible data so that it can be wrapped to have APIs in other 
> languages like python or perl.
> 
> I am not sure what the performance difference is between the two, but in my 
> own work I have seen a significant performance penalty from using either of 
> them, because there is a somewhat large overhead of sending all of the data 
> out to a separate process just to read it back in again.
> 
> --Bobby Evans
> 
> 
> On 4/5/12 1:54 PM, "Mark question"  wrote:
> 
> Hi guys,
>  quick question:
>   Are there any performance gains from hadoop streaming or pipes over
> Java? From what I've read, it's only to ease testing by using your favorite
> language. So I guess it is eventually translated to bytecode then executed.
> Is that true?
> 
> Thank you,
> Mark
> 


Re: Hadoop streaming or pipes ..

2012-04-05 Thread Robert Evans
It is a regular process, unless you explicitly say you want it to be java, 
which would be a bit odd to do, but possible.

--Bobby

On 4/5/12 3:14 PM, "Mark question"  wrote:

Thanks for the response Robert ..  so the overhead will be in read/write
and communication. But is the new process spawned a JVM or a regular
process?

Thanks,
Mark

On Thu, Apr 5, 2012 at 12:49 PM, Robert Evans  wrote:

> Both streaming and pipes do very similar things.  They will fork/exec a
> separate process that is running whatever you want it to run.  The JVM that
> is running hadoop then communicates with this process to send the data over
> and get the processing results back.  The difference between streaming and
> pipes is that streaming uses stdin/stdout for this communication so
> preexisting processing like grep, sed and awk can be used here.  Pipes uses
> a custom protocol with a C++ library to communicate.  The C++ library is
> tagged with SWIG compatible data so that it can be wrapped to have APIs in
> other languages like python or perl.
>
> I am not sure what the performance difference is between the two, but in
> my own work I have seen a significant performance penalty from using either
> of them, because there is a somewhat large overhead of sending all of the
> data out to a separate process just to read it back in again.
>
> --Bobby Evans
>
>
> On 4/5/12 1:54 PM, "Mark question"  wrote:
>
> Hi guys,
>  quick question:
>   Are there any performance gains from hadoop streaming or pipes over
> Java? From what I've read, it's only to ease testing by using your favorite
> language. So I guess it is eventually translated to bytecode then executed.
> Is that true?
>
> Thank you,
> Mark
>
>



how do i view the local file system output of a mapper on cygwin + windows?

2012-04-05 Thread Jane Wayne
i am currently testing my map reduce job on Windows + Cygwin + Hadoop
v0.20.205. for some strange reason, the list of values (i.e.
Iterable values) going into the reducer looks all wrong. i have
tracked the map reduce process with logging statements (i.e. logged
the input to the map, logged the output from the map, logged the
partitioner, logged the input to the reducer). at all stages,
everything looks correct except at the reducer.

is there anyway (using Windows  + Cygwin) to view the local map
outputs before they are shuffled/sorted to the reducer? i need to know
why the values are incorrect.


Re: Hadoop streaming or pipes ..

2012-04-05 Thread Mark question
Thanks for the response Robert ..  so the overhead will be in read/write
and communication. But is the new process spawned a JVM or a regular
process?

Thanks,
Mark

On Thu, Apr 5, 2012 at 12:49 PM, Robert Evans  wrote:

> Both streaming and pipes do very similar things.  They will fork/exec a
> separate process that is running whatever you want it to run.  The JVM that
> is running hadoop then communicates with this process to send the data over
> and get the processing results back.  The difference between streaming and
> pipes is that streaming uses stdin/stdout for this communication so
> preexisting processing like grep, sed and awk can be used here.  Pipes uses
> a custom protocol with a C++ library to communicate.  The C++ library is
> tagged with SWIG compatible data so that it can be wrapped to have APIs in
> other languages like python or perl.
>
> I am not sure what the performance difference is between the two, but in
> my own work I have seen a significant performance penalty from using either
> of them, because there is a somewhat large overhead of sending all of the
> data out to a separate process just to read it back in again.
>
> --Bobby Evans
>
>
> On 4/5/12 1:54 PM, "Mark question"  wrote:
>
> Hi guys,
>  quick question:
>   Are there any performance gains from hadoop streaming or pipes over
> Java? From what I've read, it's only to ease testing by using your favorite
> language. So I guess it is eventually translated to bytecode then executed.
> Is that true?
>
> Thank you,
> Mark
>
>


Re: Hadoop streaming or pipes ..

2012-04-05 Thread Robert Evans
Both streaming and pipes do very similar things.  They will fork/exec a 
separate process that is running whatever you want it to run.  The JVM that is 
running hadoop then communicates with this process to send the data over and 
get the processing results back.  The difference between streaming and pipes is 
that streaming uses stdin/stdout for this communication so preexisting 
processing like grep, sed and awk can be used here.  Pipes uses a custom 
protocol with a C++ library to communicate.  The C++ library is tagged with 
SWIG compatible data so that it can be wrapped to have APIs in other languages 
like python or perl.

I am not sure what the performance difference is between the two, but in my own 
work I have seen a significant performance penalty from using either of them, 
because there is a somewhat large overhead of sending all of the data out to a 
separate process just to read it back in again.

--Bobby Evans


On 4/5/12 1:54 PM, "Mark question"  wrote:

Hi guys,
  quick question:
   Are there any performance gains from hadoop streaming or pipes over
Java? From what I've read, it's only to ease testing by using your favorite
language. So I guess it is eventually translated to bytecode then executed.
Is that true?

Thank you,
Mark



Hadoop streaming or pipes ..

2012-04-05 Thread Mark question
Hi guys,
  quick question:
   Are there any performance gains from hadoop streaming or pipes over
Java? From what I've read, it's only to ease testing by using your favorite
language. So I guess it is eventually translated to bytecode then executed.
Is that true?

Thank you,
Mark


Hadoop pipes and streaming ..

2012-04-05 Thread Mark question
Hi guys,

   Two quick questions:
   1. Are there any performance gains from hadoop streaming or pipes ? As
far as I read, it is to ease testing using your favorite language. Which I
think implies that everything is translated to bytecode eventually and
executed.


Analysing Hadoop log files offline.

2012-04-05 Thread Raj Vishwanathan
Please take a look at 

https://github.com/rajvish/hadoop-summary 


These scripts enable you to take a hadoop job logs and can provide both a 
summary information or detailed task information.
I have tested it on both 0.20.* and 1.0.

The output is a CSV file that can be analysed using  other programs such as 
excel. 

I would love to get feedback.

-regards

Raj

Re: Doubt from the book "Definitive Guide"

2012-04-05 Thread Jean-Daniel Cryans
On Thu, Apr 5, 2012 at 7:03 AM, Mohit Anchlia  wrote:
> Only advantage I was thinking of was that in some cases reducers might be
> able to take advantage of data locality and avoid multiple HTTP calls, no?
> Data is anyways written, so last merged file could go on HDFS instead of
> local disk.
> I am new to hadoop so just asking question to understand the rational
> behind using local disk for final output.

So basically it's a tradeoff here, you get more replicas to copy from
but you have 2 more copies to write. Considering that that data's very
short lived and that it doesn't need to be replicated (since if the
machine fails the maps are replayed anyway) it seems that writing 2
replicas that are potentially unused would be hurtful.

Regarding locality, it might make sense on a small cluster but the
more you add nodes the smaller the chance to have local replicas for
each blocks of data you're looking for.

J-D


Re: Jobtracker history logs missing

2012-04-05 Thread Prashant Kommireddi
Thanks Nitin.

I believe the config key you mentioned controls the task attempts logs that
go under - ${hadoop.log.dir}/userlogs.

The ones that I mentioned are the job history logs that go under -
${hadoop.log.dir}/history
and are specified by the key "hadoop.job.history.location". Are these
cleaned up based on "mapred.userlog.retain.hours" too?

Also, this is what I am seeing in history dir

Available Conf files - Mar 3rd - April 5th
Available Job files   - Mar 3rd - April 3rd

There is no job file present after the 3rd of April, but conf files
continue to be written.

Thanks,
Prashant



On Thu, Apr 5, 2012 at 3:22 AM, Nitin Khandelwal <
nitin.khandel...@germinait.com> wrote:

> Hi Prashant,
>
> The userlogs for job are deleted after time specified by  "*
> mapred.userlog.retain.hours*"  property defined in mapred-site.xml (default
> is 24 Hrs).
>
> Thanks,
> Nitin
>
> On 5 April 2012 14:26, Prashant Kommireddi  wrote:
>
> > I am noticing something strange with JobTracker history logs on my
> cluster.
> > I see configuration files (*_conf.xml) under /logs/history/ but none of
> the
> > actual job logs. Anyone has ideas on what might be happening?
> >
> > Thanks,
> >
>
>
>
> --
>
>
> Nitin Khandelwal
>


Re: getting NullPointerException while running Word cont example

2012-04-05 Thread kasi subrahmanyam
Hi Sujit,

I think it is a problem with the host names configuration.
Could you please check whether you added the host names of the master and
the slaves in the etc/hosts file of all the nodes.


On Mon, Apr 2, 2012 at 8:00 PM, Sujit Dhamale wrote:

> Can some one please look in to below issue ??
> Thanks in Advance
>
> On Wed, Mar 7, 2012 at 9:09 AM, Sujit Dhamale  >wrote:
>
> > Hadoop version : hadoop-0.20.203.0rc1.tar
> > Operaring Syatem : ubuntu 11.10
> >
> >
> >
> > On Wed, Mar 7, 2012 at 12:19 AM, Harsh J  wrote:
> >
> >> Hi Sujit,
> >>
> >> Please also tell us which version/distribution of Hadoop is this?
> >>
> >> On Tue, Mar 6, 2012 at 11:27 PM, Sujit Dhamale <
> sujitdhamal...@gmail.com>
> >> wrote:
> >> > Hi,
> >> >
> >> > I am new to Hadoop., i install Hadoop as per
> >> >
> >>
> http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
> >> <
> >>
> http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluste
> >> >
> >> >
> >> >
> >> > while running Word cont example i am getting *NullPointerException
> >> >
> >> > *can some one please look in to this issue ?*
> >> >
> >> > *Thanks in Advance*  !!!
> >> >
> >> > *
> >> >
> >> >
> >> > duser@sujit:~/Desktop/hadoop$ bin/hadoop dfs -ls /user/hduser/data
> >> > Found 3 items
> >> > -rw-r--r--   1 hduser supergroup 674566 2012-03-06 23:04
> >> > /user/hduser/data/pg20417.txt
> >> > -rw-r--r--   1 hduser supergroup1573150 2012-03-06 23:04
> >> > /user/hduser/data/pg4300.txt
> >> > -rw-r--r--   1 hduser supergroup1423801 2012-03-06 23:04
> >> > /user/hduser/data/pg5000.txt
> >> >
> >> > hduser@sujit:~/Desktop/hadoop$ bin/hadoop jar hadoop*examples*.jar
> >> > wordcount /user/hduser/data /user/hduser/gutenberg-outputd
> >> >
> >> > 12/03/06 23:14:33 INFO input.FileInputFormat: Total input paths to
> >> process
> >> > : 3
> >> > 12/03/06 23:14:33 INFO mapred.JobClient: Running job:
> >> job_201203062221_0002
> >> > 12/03/06 23:14:34 INFO mapred.JobClient:  map 0% reduce 0%
> >> > 12/03/06 23:14:49 INFO mapred.JobClient:  map 66% reduce 0%
> >> > 12/03/06 23:14:55 INFO mapred.JobClient:  map 100% reduce 0%
> >> > 12/03/06 23:14:58 INFO mapred.JobClient: Task Id :
> >> > attempt_201203062221_0002_r_00_0, Status : FAILED
> >> > Error: java.lang.NullPointerException
> >> >at
> >> > java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
> >> >at
> >> >
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900)
> >> >at
> >> >
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820)
> >> >
> >> > 12/03/06 23:15:07 INFO mapred.JobClient: Task Id :
> >> > attempt_201203062221_0002_r_00_1, Status : FAILED
> >> > Error: java.lang.NullPointerException
> >> >at
> >> > java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
> >> >at
> >> >
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900)
> >> >at
> >> >
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820)
> >> >
> >> > 12/03/06 23:15:16 INFO mapred.JobClient: Task Id :
> >> > attempt_201203062221_0002_r_00_2, Status : FAILED
> >> > Error: java.lang.NullPointerException
> >> >at
> >> > java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
> >> >at
> >> >
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900)
> >> >at
> >> >
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820)
> >> >
> >> > 12/03/06 23:15:31 INFO mapred.JobClient: Job complete:
> >> job_201203062221_0002
> >> > 12/03/06 23:15:31 INFO mapred.JobClient: Counters: 20
> >> > 12/03/06 23:15:31 INFO mapred.JobClient:   Job Counters
> >> > 12/03/06 23:15:31 INFO mapred.JobClient: Launched reduce tasks=4
> >> > 12/03/06 23:15:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=22084
> >> > 12/03/06 23:15:31 INFO mapred.JobClient: Total time spent by all
> >> > reduces waiting after reserving slots (ms)=0
> >> > 12/03/06 23:15:31 INFO mapred.JobClient: Total time spent by all
> >> maps
> >> > waiting after reserving slots (ms)=0
> >> > 12/03/06 23:15:31 INFO mapred.JobClient: Launched map tasks=3
> >> > 12/03/06 23:15:31 INFO mapred.JobClient: Data-local map tasks=3
> >> > 12/03/06 23:15:31 INFO mapred.JobClient: Failed reduce tasks=1
> >> > 12/03/06 23:15:31 INFO mapred.JobClient:
> SLOTS_MILLIS_REDUCES=16799
> >> > 12/03/06 23:15:31 INFO mapred.JobClient:   FileSystemCounters
> >> > 12/03/06 23:15:31 INFO mapred.JobClient: FILE_BYTES_READ=740520
> >> > 12/03/06 23:15:31 INFO mapred.JobClient: HDFS_BYTES_READ=3671863
> >> > 12/03/06 23:15:31 INFO mapred.JobClient:
> FILE_BYTES_WRITTEN=2278287
> >> > 12/03/06 23:15:31 INFO mapred.JobC

Re: Doubt from the book "Definitive Guide"

2012-04-05 Thread Mohit Anchlia
On Wed, Apr 4, 2012 at 10:02 PM, Prashant Kommireddi wrote:

> Hi Mohit,
>
> What would be the advantage? Reducers in most cases read data from all
> the mappers. In the case where mappers were to write to HDFS, a
> reducer would still require to read data from other datanodes across
> the cluster.
>
>
Only advantage I was thinking of was that in some cases reducers might be
able to take advantage of data locality and avoid multiple HTTP calls, no?
Data is anyways written, so last merged file could go on HDFS instead of
local disk.
I am new to hadoop so just asking question to understand the rational
behind using local disk for final output.

> Prashant
>
> On Apr 4, 2012, at 9:55 PM, Mohit Anchlia  wrote:
>
> > On Wed, Apr 4, 2012 at 8:42 PM, Harsh J  wrote:
> >
> >> Hi Mohit,
> >>
> >> On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia 
> >> wrote:
> >>> I am going through the chapter "How mapreduce works" and have some
> >>> confusion:
> >>>
> >>> 1) Below description of Mapper says that reducers get the output file
> >> using
> >>> HTTP call. But the description under "The Reduce Side" doesn't
> >> specifically
> >>> say if it's copied using HTTP. So first confusion, Is the output copied
> >>> from mapper -> reducer or from reducer -> mapper? And second, Is the
> call
> >>> http:// or hdfs://
> >>
> >> The flow is simple as this:
> >> 1. For M+R job, map completes its task after writing all partitions
> >> down into the tasktracker's local filesystem (under mapred.local.dir
> >> directories).
> >> 2. Reducers fetch completion locations from events at JobTracker, and
> >> query the TaskTracker there to provide it the specific partition it
> >> needs, which is done over the TaskTracker's HTTP service (50060).
> >>
> >> So to clear things up - map doesn't send it to reduce, nor does reduce
> >> ask the actual map task. It is the task tracker itself that makes the
> >> bridge here.
> >>
> >> Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would
> >> be over Netty connections. This would be much more faster and
> >> reliable.
> >>
> >>> 2) My understanding was that mapper output gets written to hdfs, since
> >> I've
> >>> seen part-m-0 files in hdfs. If mapper output is written to HDFS
> then
> >>> shouldn't reducers simply read it from hdfs instead of making http
> calls
> >> to
> >>> tasktrackers location?
> >>
> >> A map-only job usually writes out to HDFS directly (no sorting done,
> >> cause no reducer is involved). If the job is a map+reduce one, the
> >> default output is collected to local filesystem for partitioning and
> >> sorting at map end, and eventually grouping at reduce end. Basically:
> >> Data you want to send to reducer from mapper goes to local FS for
> >> multiple actions to be performed on them, other data may directly go
> >> to HDFS.
> >>
> >> Reducers currently are scheduled pretty randomly but yes their
> >> scheduling can be improved for certain scenarios. However, if you are
> >> pointing that map partitions ought to be written to HDFS itself (with
> >> replication or without), I don't see performance improving. Note that
> >> the partitions aren't merely written but need to be sorted as well (at
> >> either end). To do that would need ability to spill frequently (cause
> >> we don't have infinite memory to do it all in RAM) and doing such a
> >> thing on HDFS would only mean slowdown.
> >>
> >> Thanks for clearing my doubts. In this case I was merely suggesting that
> > if the mapper output (merged output in the end or the shuffle output) is
> > stored in HDFS then reducers can just retrieve it from HDFS instead of
> > asking tasktracker for it. Once reducer threads read it they can continue
> > to work locally.
> >
> >
> >
> >> I hope this helps clear some things up for you.
> >>
> >> --
> >> Harsh J
> >>
>


Re: map task execution time

2012-04-05 Thread bikash sharma
Yes, how can we use "hadoop job" to get MR job stats, especially
constituent task finish times?


On Thu, Apr 5, 2012 at 9:02 AM, Jay Vyas  wrote:

> (excuse the typo in the last email : I meant "I've been playing with Cinch"
> , not "I've been with Cinch")
>
> On Thu, Apr 5, 2012 at 7:54 AM, Jay Vyas  wrote:
>
> > How can "hadoop job" be used to read m/r statistics ?
> >
> > On Thu, Apr 5, 2012 at 7:30 AM, bikash sharma  >wrote:
> >
> >> Thanks Kai, I will try those.
> >>
> >> On Thu, Apr 5, 2012 at 3:15 AM, Kai Voigt  wrote:
> >>
> >> > Hi,
> >> >
> >> > Am 05.04.2012 um 00:20 schrieb bikash sharma:
> >> >
> >> > > Is it possible to get the execution time of the constituent
> map/reduce
> >> > > tasks of a MapReduce job (say sort) at the end of a job run?
> >> > > Preferably, can we obtain this programatically?
> >> >
> >> >
> >> > you can access the JobTracker's web UI and see the start and stop
> >> > timestamps for every individual task.
> >> >
> >> > Since the JobTracker Java API is exposed, you can write your own
> >> > application to fetch that data through your own code.
> >> >
> >> > Also, "hadoop job" on the command line can be used to read job
> >> statistics.
> >> >
> >> > Kai
> >> >
> >> >
> >> > --
> >> > Kai Voigt
> >> > k...@123.org
> >> >
> >> >
> >> >
> >> >
> >> >
> >>
> >
> >
> >
> > --
> > Jay Vyas
> > MMSB/UCHC
> >
>
>
>
> --
> Jay Vyas
> MMSB/UCHC
>


Re: map task execution time

2012-04-05 Thread Jay Vyas
(excuse the typo in the last email : I meant "I've been playing with Cinch"
, not "I've been with Cinch")

On Thu, Apr 5, 2012 at 7:54 AM, Jay Vyas  wrote:

> How can "hadoop job" be used to read m/r statistics ?
>
> On Thu, Apr 5, 2012 at 7:30 AM, bikash sharma wrote:
>
>> Thanks Kai, I will try those.
>>
>> On Thu, Apr 5, 2012 at 3:15 AM, Kai Voigt  wrote:
>>
>> > Hi,
>> >
>> > Am 05.04.2012 um 00:20 schrieb bikash sharma:
>> >
>> > > Is it possible to get the execution time of the constituent map/reduce
>> > > tasks of a MapReduce job (say sort) at the end of a job run?
>> > > Preferably, can we obtain this programatically?
>> >
>> >
>> > you can access the JobTracker's web UI and see the start and stop
>> > timestamps for every individual task.
>> >
>> > Since the JobTracker Java API is exposed, you can write your own
>> > application to fetch that data through your own code.
>> >
>> > Also, "hadoop job" on the command line can be used to read job
>> statistics.
>> >
>> > Kai
>> >
>> >
>> > --
>> > Kai Voigt
>> > k...@123.org
>> >
>> >
>> >
>> >
>> >
>>
>
>
>
> --
> Jay Vyas
> MMSB/UCHC
>



-- 
Jay Vyas
MMSB/UCHC


Re: map task execution time

2012-04-05 Thread Jay Vyas
How can "hadoop job" be used to read m/r statistics ?

On Thu, Apr 5, 2012 at 7:30 AM, bikash sharma wrote:

> Thanks Kai, I will try those.
>
> On Thu, Apr 5, 2012 at 3:15 AM, Kai Voigt  wrote:
>
> > Hi,
> >
> > Am 05.04.2012 um 00:20 schrieb bikash sharma:
> >
> > > Is it possible to get the execution time of the constituent map/reduce
> > > tasks of a MapReduce job (say sort) at the end of a job run?
> > > Preferably, can we obtain this programatically?
> >
> >
> > you can access the JobTracker's web UI and see the start and stop
> > timestamps for every individual task.
> >
> > Since the JobTracker Java API is exposed, you can write your own
> > application to fetch that data through your own code.
> >
> > Also, "hadoop job" on the command line can be used to read job
> statistics.
> >
> > Kai
> >
> >
> > --
> > Kai Voigt
> > k...@123.org
> >
> >
> >
> >
> >
>



-- 
Jay Vyas
MMSB/UCHC


Re: map task execution time

2012-04-05 Thread bikash sharma
Thanks Kai, I will try those.

On Thu, Apr 5, 2012 at 3:15 AM, Kai Voigt  wrote:

> Hi,
>
> Am 05.04.2012 um 00:20 schrieb bikash sharma:
>
> > Is it possible to get the execution time of the constituent map/reduce
> > tasks of a MapReduce job (say sort) at the end of a job run?
> > Preferably, can we obtain this programatically?
>
>
> you can access the JobTracker's web UI and see the start and stop
> timestamps for every individual task.
>
> Since the JobTracker Java API is exposed, you can write your own
> application to fetch that data through your own code.
>
> Also, "hadoop job" on the command line can be used to read job statistics.
>
> Kai
>
>
> --
> Kai Voigt
> k...@123.org
>
>
>
>
>


Re: Jobtracker history logs missing

2012-04-05 Thread Nitin Khandelwal
Hi Prashant,

The userlogs for job are deleted after time specified by  "*
mapred.userlog.retain.hours*"  property defined in mapred-site.xml (default
is 24 Hrs).

Thanks,
Nitin

On 5 April 2012 14:26, Prashant Kommireddi  wrote:

> I am noticing something strange with JobTracker history logs on my cluster.
> I see configuration files (*_conf.xml) under /logs/history/ but none of the
> actual job logs. Anyone has ideas on what might be happening?
>
> Thanks,
>



-- 


Nitin Khandelwal


Re: cloudera vs apache hadoop migration stories ?

2012-04-05 Thread Todd Lipcon
Hi Jay,

Probably makes sense to move this to the cdh-user list if you have any
Cloudera-specific questions. But I just wanted to clarify: CDH doesn't
make any API changes that aren't already upstream. So, in some places,
CDH may be ahead of whatever Apache release you are comparing against,
but it is always made up of patches from the Apache trunk. In the
specific case of MultipleInputs, we did backport the new API
implementation from Apache Hadoop 0.21+.

If you find something in CDH that you would like backported to
upstream Apache Hadoop 1.0.x, please feel free to file a JIRA and
assign it to me - I'm happy to look into it for you.

Thanks
Todd

On Wed, Apr 4, 2012 at 10:15 AM, Jay Vyas  wrote:
> Seems like cloudera and standard apache-hadoop are really not cross
> compatible.  Things like MultipleInputs and stuff that we are finding don't
> work the same.  Any good (recent) war stories on the migration between the
> two ?
>
> Its interesting to me that cloudera and amazon are that difficult to swap
> in/out in cloud.



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: map task execution time

2012-04-05 Thread Kai Voigt
Hi,

Am 05.04.2012 um 00:20 schrieb bikash sharma:

> Is it possible to get the execution time of the constituent map/reduce
> tasks of a MapReduce job (say sort) at the end of a job run?
> Preferably, can we obtain this programatically?


you can access the JobTracker's web UI and see the start and stop timestamps 
for every individual task.

Since the JobTracker Java API is exposed, you can write your own application to 
fetch that data through your own code.

Also, "hadoop job" on the command line can be used to read job statistics.

Kai


-- 
Kai Voigt
k...@123.org