Re: map task execution time

2012-04-05 Thread Kai Voigt
Hi,

Am 05.04.2012 um 00:20 schrieb bikash sharma:

 Is it possible to get the execution time of the constituent map/reduce
 tasks of a MapReduce job (say sort) at the end of a job run?
 Preferably, can we obtain this programatically?


you can access the JobTracker's web UI and see the start and stop timestamps 
for every individual task.

Since the JobTracker Java API is exposed, you can write your own application to 
fetch that data through your own code.

Also, hadoop job on the command line can be used to read job statistics.

Kai


-- 
Kai Voigt
k...@123.org






Re: cloudera vs apache hadoop migration stories ?

2012-04-05 Thread Todd Lipcon
Hi Jay,

Probably makes sense to move this to the cdh-user list if you have any
Cloudera-specific questions. But I just wanted to clarify: CDH doesn't
make any API changes that aren't already upstream. So, in some places,
CDH may be ahead of whatever Apache release you are comparing against,
but it is always made up of patches from the Apache trunk. In the
specific case of MultipleInputs, we did backport the new API
implementation from Apache Hadoop 0.21+.

If you find something in CDH that you would like backported to
upstream Apache Hadoop 1.0.x, please feel free to file a JIRA and
assign it to me - I'm happy to look into it for you.

Thanks
Todd

On Wed, Apr 4, 2012 at 10:15 AM, Jay Vyas jayunit...@gmail.com wrote:
 Seems like cloudera and standard apache-hadoop are really not cross
 compatible.  Things like MultipleInputs and stuff that we are finding don't
 work the same.  Any good (recent) war stories on the migration between the
 two ?

 Its interesting to me that cloudera and amazon are that difficult to swap
 in/out in cloud.



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Jobtracker history logs missing

2012-04-05 Thread Nitin Khandelwal
Hi Prashant,

The userlogs for job are deleted after time specified by  *
mapred.userlog.retain.hours*  property defined in mapred-site.xml (default
is 24 Hrs).

Thanks,
Nitin

On 5 April 2012 14:26, Prashant Kommireddi prash1...@gmail.com wrote:

 I am noticing something strange with JobTracker history logs on my cluster.
 I see configuration files (*_conf.xml) under /logs/history/ but none of the
 actual job logs. Anyone has ideas on what might be happening?

 Thanks,




-- 


Nitin Khandelwal


Re: map task execution time

2012-04-05 Thread bikash sharma
Thanks Kai, I will try those.

On Thu, Apr 5, 2012 at 3:15 AM, Kai Voigt k...@123.org wrote:

 Hi,

 Am 05.04.2012 um 00:20 schrieb bikash sharma:

  Is it possible to get the execution time of the constituent map/reduce
  tasks of a MapReduce job (say sort) at the end of a job run?
  Preferably, can we obtain this programatically?


 you can access the JobTracker's web UI and see the start and stop
 timestamps for every individual task.

 Since the JobTracker Java API is exposed, you can write your own
 application to fetch that data through your own code.

 Also, hadoop job on the command line can be used to read job statistics.

 Kai


 --
 Kai Voigt
 k...@123.org







Re: map task execution time

2012-04-05 Thread Jay Vyas
How can hadoop job be used to read m/r statistics ?

On Thu, Apr 5, 2012 at 7:30 AM, bikash sharma sharmabiks...@gmail.comwrote:

 Thanks Kai, I will try those.

 On Thu, Apr 5, 2012 at 3:15 AM, Kai Voigt k...@123.org wrote:

  Hi,
 
  Am 05.04.2012 um 00:20 schrieb bikash sharma:
 
   Is it possible to get the execution time of the constituent map/reduce
   tasks of a MapReduce job (say sort) at the end of a job run?
   Preferably, can we obtain this programatically?
 
 
  you can access the JobTracker's web UI and see the start and stop
  timestamps for every individual task.
 
  Since the JobTracker Java API is exposed, you can write your own
  application to fetch that data through your own code.
 
  Also, hadoop job on the command line can be used to read job
 statistics.
 
  Kai
 
 
  --
  Kai Voigt
  k...@123.org
 
 
 
 
 




-- 
Jay Vyas
MMSB/UCHC


Re: map task execution time

2012-04-05 Thread Jay Vyas
(excuse the typo in the last email : I meant I've been playing with Cinch
, not I've been with Cinch)

On Thu, Apr 5, 2012 at 7:54 AM, Jay Vyas jayunit...@gmail.com wrote:

 How can hadoop job be used to read m/r statistics ?

 On Thu, Apr 5, 2012 at 7:30 AM, bikash sharma sharmabiks...@gmail.comwrote:

 Thanks Kai, I will try those.

 On Thu, Apr 5, 2012 at 3:15 AM, Kai Voigt k...@123.org wrote:

  Hi,
 
  Am 05.04.2012 um 00:20 schrieb bikash sharma:
 
   Is it possible to get the execution time of the constituent map/reduce
   tasks of a MapReduce job (say sort) at the end of a job run?
   Preferably, can we obtain this programatically?
 
 
  you can access the JobTracker's web UI and see the start and stop
  timestamps for every individual task.
 
  Since the JobTracker Java API is exposed, you can write your own
  application to fetch that data through your own code.
 
  Also, hadoop job on the command line can be used to read job
 statistics.
 
  Kai
 
 
  --
  Kai Voigt
  k...@123.org
 
 
 
 
 




 --
 Jay Vyas
 MMSB/UCHC




-- 
Jay Vyas
MMSB/UCHC


Re: map task execution time

2012-04-05 Thread bikash sharma
Yes, how can we use hadoop job to get MR job stats, especially
constituent task finish times?


On Thu, Apr 5, 2012 at 9:02 AM, Jay Vyas jayunit...@gmail.com wrote:

 (excuse the typo in the last email : I meant I've been playing with Cinch
 , not I've been with Cinch)

 On Thu, Apr 5, 2012 at 7:54 AM, Jay Vyas jayunit...@gmail.com wrote:

  How can hadoop job be used to read m/r statistics ?
 
  On Thu, Apr 5, 2012 at 7:30 AM, bikash sharma sharmabiks...@gmail.com
 wrote:
 
  Thanks Kai, I will try those.
 
  On Thu, Apr 5, 2012 at 3:15 AM, Kai Voigt k...@123.org wrote:
 
   Hi,
  
   Am 05.04.2012 um 00:20 schrieb bikash sharma:
  
Is it possible to get the execution time of the constituent
 map/reduce
tasks of a MapReduce job (say sort) at the end of a job run?
Preferably, can we obtain this programatically?
  
  
   you can access the JobTracker's web UI and see the start and stop
   timestamps for every individual task.
  
   Since the JobTracker Java API is exposed, you can write your own
   application to fetch that data through your own code.
  
   Also, hadoop job on the command line can be used to read job
  statistics.
  
   Kai
  
  
   --
   Kai Voigt
   k...@123.org
  
  
  
  
  
 
 
 
 
  --
  Jay Vyas
  MMSB/UCHC
 



 --
 Jay Vyas
 MMSB/UCHC



Re: Doubt from the book Definitive Guide

2012-04-05 Thread Mohit Anchlia
On Wed, Apr 4, 2012 at 10:02 PM, Prashant Kommireddi prash1...@gmail.comwrote:

 Hi Mohit,

 What would be the advantage? Reducers in most cases read data from all
 the mappers. In the case where mappers were to write to HDFS, a
 reducer would still require to read data from other datanodes across
 the cluster.


Only advantage I was thinking of was that in some cases reducers might be
able to take advantage of data locality and avoid multiple HTTP calls, no?
Data is anyways written, so last merged file could go on HDFS instead of
local disk.
I am new to hadoop so just asking question to understand the rational
behind using local disk for final output.

 Prashant

 On Apr 4, 2012, at 9:55 PM, Mohit Anchlia mohitanch...@gmail.com wrote:

  On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote:
 
  Hi Mohit,
 
  On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
  I am going through the chapter How mapreduce works and have some
  confusion:
 
  1) Below description of Mapper says that reducers get the output file
  using
  HTTP call. But the description under The Reduce Side doesn't
  specifically
  say if it's copied using HTTP. So first confusion, Is the output copied
  from mapper - reducer or from reducer - mapper? And second, Is the
 call
  http:// or hdfs://
 
  The flow is simple as this:
  1. For M+R job, map completes its task after writing all partitions
  down into the tasktracker's local filesystem (under mapred.local.dir
  directories).
  2. Reducers fetch completion locations from events at JobTracker, and
  query the TaskTracker there to provide it the specific partition it
  needs, which is done over the TaskTracker's HTTP service (50060).
 
  So to clear things up - map doesn't send it to reduce, nor does reduce
  ask the actual map task. It is the task tracker itself that makes the
  bridge here.
 
  Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would
  be over Netty connections. This would be much more faster and
  reliable.
 
  2) My understanding was that mapper output gets written to hdfs, since
  I've
  seen part-m-0 files in hdfs. If mapper output is written to HDFS
 then
  shouldn't reducers simply read it from hdfs instead of making http
 calls
  to
  tasktrackers location?
 
  A map-only job usually writes out to HDFS directly (no sorting done,
  cause no reducer is involved). If the job is a map+reduce one, the
  default output is collected to local filesystem for partitioning and
  sorting at map end, and eventually grouping at reduce end. Basically:
  Data you want to send to reducer from mapper goes to local FS for
  multiple actions to be performed on them, other data may directly go
  to HDFS.
 
  Reducers currently are scheduled pretty randomly but yes their
  scheduling can be improved for certain scenarios. However, if you are
  pointing that map partitions ought to be written to HDFS itself (with
  replication or without), I don't see performance improving. Note that
  the partitions aren't merely written but need to be sorted as well (at
  either end). To do that would need ability to spill frequently (cause
  we don't have infinite memory to do it all in RAM) and doing such a
  thing on HDFS would only mean slowdown.
 
  Thanks for clearing my doubts. In this case I was merely suggesting that
  if the mapper output (merged output in the end or the shuffle output) is
  stored in HDFS then reducers can just retrieve it from HDFS instead of
  asking tasktracker for it. Once reducer threads read it they can continue
  to work locally.
 
 
 
  I hope this helps clear some things up for you.
 
  --
  Harsh J
 



Re: getting NullPointerException while running Word cont example

2012-04-05 Thread kasi subrahmanyam
Hi Sujit,

I think it is a problem with the host names configuration.
Could you please check whether you added the host names of the master and
the slaves in the etc/hosts file of all the nodes.


On Mon, Apr 2, 2012 at 8:00 PM, Sujit Dhamale sujitdhamal...@gmail.comwrote:

 Can some one please look in to below issue ??
 Thanks in Advance

 On Wed, Mar 7, 2012 at 9:09 AM, Sujit Dhamale sujitdhamal...@gmail.com
 wrote:

  Hadoop version : hadoop-0.20.203.0rc1.tar
  Operaring Syatem : ubuntu 11.10
 
 
 
  On Wed, Mar 7, 2012 at 12:19 AM, Harsh J ha...@cloudera.com wrote:
 
  Hi Sujit,
 
  Please also tell us which version/distribution of Hadoop is this?
 
  On Tue, Mar 6, 2012 at 11:27 PM, Sujit Dhamale 
 sujitdhamal...@gmail.com
  wrote:
   Hi,
  
   I am new to Hadoop., i install Hadoop as per
  
 
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
  
 
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluste
  
  
  
   while running Word cont example i am getting *NullPointerException
  
   *can some one please look in to this issue ?*
  
   *Thanks in Advance*  !!!
  
   *
  
  
   duser@sujit:~/Desktop/hadoop$ bin/hadoop dfs -ls /user/hduser/data
   Found 3 items
   -rw-r--r--   1 hduser supergroup 674566 2012-03-06 23:04
   /user/hduser/data/pg20417.txt
   -rw-r--r--   1 hduser supergroup1573150 2012-03-06 23:04
   /user/hduser/data/pg4300.txt
   -rw-r--r--   1 hduser supergroup1423801 2012-03-06 23:04
   /user/hduser/data/pg5000.txt
  
   hduser@sujit:~/Desktop/hadoop$ bin/hadoop jar hadoop*examples*.jar
   wordcount /user/hduser/data /user/hduser/gutenberg-outputd
  
   12/03/06 23:14:33 INFO input.FileInputFormat: Total input paths to
  process
   : 3
   12/03/06 23:14:33 INFO mapred.JobClient: Running job:
  job_201203062221_0002
   12/03/06 23:14:34 INFO mapred.JobClient:  map 0% reduce 0%
   12/03/06 23:14:49 INFO mapred.JobClient:  map 66% reduce 0%
   12/03/06 23:14:55 INFO mapred.JobClient:  map 100% reduce 0%
   12/03/06 23:14:58 INFO mapred.JobClient: Task Id :
   attempt_201203062221_0002_r_00_0, Status : FAILED
   Error: java.lang.NullPointerException
  at
   java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
  at
  
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900)
  at
  
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820)
  
   12/03/06 23:15:07 INFO mapred.JobClient: Task Id :
   attempt_201203062221_0002_r_00_1, Status : FAILED
   Error: java.lang.NullPointerException
  at
   java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
  at
  
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900)
  at
  
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820)
  
   12/03/06 23:15:16 INFO mapred.JobClient: Task Id :
   attempt_201203062221_0002_r_00_2, Status : FAILED
   Error: java.lang.NullPointerException
  at
   java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
  at
  
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900)
  at
  
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820)
  
   12/03/06 23:15:31 INFO mapred.JobClient: Job complete:
  job_201203062221_0002
   12/03/06 23:15:31 INFO mapred.JobClient: Counters: 20
   12/03/06 23:15:31 INFO mapred.JobClient:   Job Counters
   12/03/06 23:15:31 INFO mapred.JobClient: Launched reduce tasks=4
   12/03/06 23:15:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=22084
   12/03/06 23:15:31 INFO mapred.JobClient: Total time spent by all
   reduces waiting after reserving slots (ms)=0
   12/03/06 23:15:31 INFO mapred.JobClient: Total time spent by all
  maps
   waiting after reserving slots (ms)=0
   12/03/06 23:15:31 INFO mapred.JobClient: Launched map tasks=3
   12/03/06 23:15:31 INFO mapred.JobClient: Data-local map tasks=3
   12/03/06 23:15:31 INFO mapred.JobClient: Failed reduce tasks=1
   12/03/06 23:15:31 INFO mapred.JobClient:
 SLOTS_MILLIS_REDUCES=16799
   12/03/06 23:15:31 INFO mapred.JobClient:   FileSystemCounters
   12/03/06 23:15:31 INFO mapred.JobClient: FILE_BYTES_READ=740520
   12/03/06 23:15:31 INFO mapred.JobClient: HDFS_BYTES_READ=3671863
   12/03/06 23:15:31 INFO mapred.JobClient:
 FILE_BYTES_WRITTEN=2278287
   12/03/06 23:15:31 INFO mapred.JobClient:   File Input Format Counters
   12/03/06 23:15:31 INFO mapred.JobClient: Bytes Read=3671517
   12/03/06 23:15:31 INFO mapred.JobClient:   Map-Reduce Framework
   12/03/06 23:15:31 INFO mapred.JobClient: Map output materialized
   bytes=1474341
   12/03/06 23:15:31 INFO mapred.JobClient: Combine output
  records=102322
   12/03/06 

Re: Jobtracker history logs missing

2012-04-05 Thread Prashant Kommireddi
Thanks Nitin.

I believe the config key you mentioned controls the task attempts logs that
go under - ${hadoop.log.dir}/userlogs.

The ones that I mentioned are the job history logs that go under -
${hadoop.log.dir}/history
and are specified by the key hadoop.job.history.location. Are these
cleaned up based on mapred.userlog.retain.hours too?

Also, this is what I am seeing in history dir

Available Conf files - Mar 3rd - April 5th
Available Job files   - Mar 3rd - April 3rd

There is no job file present after the 3rd of April, but conf files
continue to be written.

Thanks,
Prashant



On Thu, Apr 5, 2012 at 3:22 AM, Nitin Khandelwal 
nitin.khandel...@germinait.com wrote:

 Hi Prashant,

 The userlogs for job are deleted after time specified by  *
 mapred.userlog.retain.hours*  property defined in mapred-site.xml (default
 is 24 Hrs).

 Thanks,
 Nitin

 On 5 April 2012 14:26, Prashant Kommireddi prash1...@gmail.com wrote:

  I am noticing something strange with JobTracker history logs on my
 cluster.
  I see configuration files (*_conf.xml) under /logs/history/ but none of
 the
  actual job logs. Anyone has ideas on what might be happening?
 
  Thanks,
 



 --


 Nitin Khandelwal



Re: Doubt from the book Definitive Guide

2012-04-05 Thread Jean-Daniel Cryans
On Thu, Apr 5, 2012 at 7:03 AM, Mohit Anchlia mohitanch...@gmail.com wrote:
 Only advantage I was thinking of was that in some cases reducers might be
 able to take advantage of data locality and avoid multiple HTTP calls, no?
 Data is anyways written, so last merged file could go on HDFS instead of
 local disk.
 I am new to hadoop so just asking question to understand the rational
 behind using local disk for final output.

So basically it's a tradeoff here, you get more replicas to copy from
but you have 2 more copies to write. Considering that that data's very
short lived and that it doesn't need to be replicated (since if the
machine fails the maps are replayed anyway) it seems that writing 2
replicas that are potentially unused would be hurtful.

Regarding locality, it might make sense on a small cluster but the
more you add nodes the smaller the chance to have local replicas for
each blocks of data you're looking for.

J-D


Hadoop pipes and streaming ..

2012-04-05 Thread Mark question
Hi guys,

   Two quick questions:
   1. Are there any performance gains from hadoop streaming or pipes ? As
far as I read, it is to ease testing using your favorite language. Which I
think implies that everything is translated to bytecode eventually and
executed.


Hadoop streaming or pipes ..

2012-04-05 Thread Mark question
Hi guys,
  quick question:
   Are there any performance gains from hadoop streaming or pipes over
Java? From what I've read, it's only to ease testing by using your favorite
language. So I guess it is eventually translated to bytecode then executed.
Is that true?

Thank you,
Mark


Re: Hadoop streaming or pipes ..

2012-04-05 Thread Robert Evans
Both streaming and pipes do very similar things.  They will fork/exec a 
separate process that is running whatever you want it to run.  The JVM that is 
running hadoop then communicates with this process to send the data over and 
get the processing results back.  The difference between streaming and pipes is 
that streaming uses stdin/stdout for this communication so preexisting 
processing like grep, sed and awk can be used here.  Pipes uses a custom 
protocol with a C++ library to communicate.  The C++ library is tagged with 
SWIG compatible data so that it can be wrapped to have APIs in other languages 
like python or perl.

I am not sure what the performance difference is between the two, but in my own 
work I have seen a significant performance penalty from using either of them, 
because there is a somewhat large overhead of sending all of the data out to a 
separate process just to read it back in again.

--Bobby Evans


On 4/5/12 1:54 PM, Mark question markq2...@gmail.com wrote:

Hi guys,
  quick question:
   Are there any performance gains from hadoop streaming or pipes over
Java? From what I've read, it's only to ease testing by using your favorite
language. So I guess it is eventually translated to bytecode then executed.
Is that true?

Thank you,
Mark



Re: Hadoop streaming or pipes ..

2012-04-05 Thread Mark question
Thanks for the response Robert ..  so the overhead will be in read/write
and communication. But is the new process spawned a JVM or a regular
process?

Thanks,
Mark

On Thu, Apr 5, 2012 at 12:49 PM, Robert Evans ev...@yahoo-inc.com wrote:

 Both streaming and pipes do very similar things.  They will fork/exec a
 separate process that is running whatever you want it to run.  The JVM that
 is running hadoop then communicates with this process to send the data over
 and get the processing results back.  The difference between streaming and
 pipes is that streaming uses stdin/stdout for this communication so
 preexisting processing like grep, sed and awk can be used here.  Pipes uses
 a custom protocol with a C++ library to communicate.  The C++ library is
 tagged with SWIG compatible data so that it can be wrapped to have APIs in
 other languages like python or perl.

 I am not sure what the performance difference is between the two, but in
 my own work I have seen a significant performance penalty from using either
 of them, because there is a somewhat large overhead of sending all of the
 data out to a separate process just to read it back in again.

 --Bobby Evans


 On 4/5/12 1:54 PM, Mark question markq2...@gmail.com wrote:

 Hi guys,
  quick question:
   Are there any performance gains from hadoop streaming or pipes over
 Java? From what I've read, it's only to ease testing by using your favorite
 language. So I guess it is eventually translated to bytecode then executed.
 Is that true?

 Thank you,
 Mark




how do i view the local file system output of a mapper on cygwin + windows?

2012-04-05 Thread Jane Wayne
i am currently testing my map reduce job on Windows + Cygwin + Hadoop
v0.20.205. for some strange reason, the list of values (i.e.
IterableT values) going into the reducer looks all wrong. i have
tracked the map reduce process with logging statements (i.e. logged
the input to the map, logged the output from the map, logged the
partitioner, logged the input to the reducer). at all stages,
everything looks correct except at the reducer.

is there anyway (using Windows  + Cygwin) to view the local map
outputs before they are shuffled/sorted to the reducer? i need to know
why the values are incorrect.


Re: Hadoop streaming or pipes ..

2012-04-05 Thread Charles Earl
Also bear in mind that there is a kind of detour involved, in the sense that a 
pipes map must send key,value data back to the Java process and then to reduce 
(more or less). 
I think that the Hadoop C Extension (HCE, there is a patch) is supposed to be 
faster. 
Would be interested to know if the community has any experience with HCE 
performance.
C

On Apr 5, 2012, at 3:49 PM, Robert Evans ev...@yahoo-inc.com wrote:

 Both streaming and pipes do very similar things.  They will fork/exec a 
 separate process that is running whatever you want it to run.  The JVM that 
 is running hadoop then communicates with this process to send the data over 
 and get the processing results back.  The difference between streaming and 
 pipes is that streaming uses stdin/stdout for this communication so 
 preexisting processing like grep, sed and awk can be used here.  Pipes uses a 
 custom protocol with a C++ library to communicate.  The C++ library is tagged 
 with SWIG compatible data so that it can be wrapped to have APIs in other 
 languages like python or perl.
 
 I am not sure what the performance difference is between the two, but in my 
 own work I have seen a significant performance penalty from using either of 
 them, because there is a somewhat large overhead of sending all of the data 
 out to a separate process just to read it back in again.
 
 --Bobby Evans
 
 
 On 4/5/12 1:54 PM, Mark question markq2...@gmail.com wrote:
 
 Hi guys,
  quick question:
   Are there any performance gains from hadoop streaming or pipes over
 Java? From what I've read, it's only to ease testing by using your favorite
 language. So I guess it is eventually translated to bytecode then executed.
 Is that true?
 
 Thank you,
 Mark
 


Re: how do i view the local file system output of a mapper on cygwin + windows?

2012-04-05 Thread Jane Wayne
i found out what my problem was. apparently, when you iterate over
IterableType values, that instance of Type is being used over and
over. for example, in my reducer,

public void reduce(Key key, IteratorValue values, Context context)
throws IOException, InterruptedException {
 IteratorValue it = values.iterator();
 Value a = it.next();
 Value b = it.next();
}

the variables, a and b of type Value, will be the same object
instance! i suppose this behavior of the iterator is to optimize
iterating so as to avoid the new operator.



On Thu, Apr 5, 2012 at 4:55 PM, Jane Wayne jane.wayne2...@gmail.com wrote:
 i am currently testing my map reduce job on Windows + Cygwin + Hadoop
 v0.20.205. for some strange reason, the list of values (i.e.
 IterableT values) going into the reducer looks all wrong. i have
 tracked the map reduce process with logging statements (i.e. logged
 the input to the map, logged the output from the map, logged the
 partitioner, logged the input to the reducer). at all stages,
 everything looks correct except at the reducer.

 is there anyway (using Windows  + Cygwin) to view the local map
 outputs before they are shuffled/sorted to the reducer? i need to know
 why the values are incorrect.


Re: how do i view the local file system output of a mapper on cygwin + windows?

2012-04-05 Thread Harsh J
Jane,

Yes and thats documented:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reducer.html#reduce(K2,%20java.util.Iterator,%20org.apache.hadoop.mapred.OutputCollector,%20org.apache.hadoop.mapred.Reporter)

The framework will reuse the key and value objects that are passed
into the reduce, therefore the application should clone the objects
they want to keep a copy of.

On Fri, Apr 6, 2012 at 6:26 AM, Jane Wayne jane.wayne2...@gmail.com wrote:
 i found out what my problem was. apparently, when you iterate over
 IterableType values, that instance of Type is being used over and
 over. for example, in my reducer,

 public void reduce(Key key, IteratorValue values, Context context)
 throws IOException, InterruptedException {
  IteratorValue it = values.iterator();
  Value a = it.next();
  Value b = it.next();
 }

 the variables, a and b of type Value, will be the same object
 instance! i suppose this behavior of the iterator is to optimize
 iterating so as to avoid the new operator.



 On Thu, Apr 5, 2012 at 4:55 PM, Jane Wayne jane.wayne2...@gmail.com wrote:
 i am currently testing my map reduce job on Windows + Cygwin + Hadoop
 v0.20.205. for some strange reason, the list of values (i.e.
 IterableT values) going into the reducer looks all wrong. i have
 tracked the map reduce process with logging statements (i.e. logged
 the input to the map, logged the output from the map, logged the
 partitioner, logged the input to the reducer). at all stages,
 everything looks correct except at the reducer.

 is there anyway (using Windows  + Cygwin) to view the local map
 outputs before they are shuffled/sorted to the reducer? i need to know
 why the values are incorrect.



-- 
Harsh J