Re: 100x slower mapreduce compared to pig

2012-02-29 Thread Mohit Anchlia
I am going to try few things today. I have a JAXBContext object that
marshals the xml, this is static instance but my guess at this point is
that since this is in separate jar then the one where job runs and I used
DistributeCache.addClassPath this context is being created on every call
for some reason. I don't know why that would be. I am going to create this
instance as static in the mapper class itself and see if that helps. I also
add debugs. Will post the results after try it out.

On Tue, Feb 28, 2012 at 4:18 PM, Prashant Kommireddi prash1...@gmail.comwrote:

 It would be great if we can take a look at what you are doing in the UDF vs
 the Mapper.

 100x slow does not make sense for the same job/logic, its either the Mapper
 code or may be the cluster was busy at the time you scheduled MapReduce
 job?

 Thanks,
 Prashant

 On Tue, Feb 28, 2012 at 4:11 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  I am comparing runtime of similar logic. The entire logic is exactly same
  but surprisingly map reduce job that I submit is 100x slow. For pig I use
  udf and for hadoop I use mapper only and the logic same as pig. Even the
  splits on the admin page are same. Not sure why it's so slow. I am
  submitting job like:
 
  java -classpath
 
 
 .:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar
  com.services.dp.analytics.hadoop.mapred.FormMLProcessor
 
 
 /examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq
  /examples/output1/
 
  How should I go about looking the root cause of why it's so slow? Any
  suggestions would be really appreciated.
 
 
 
  One of the things I noticed is that on the admin page of map task list I
  see status as hdfs://dsdb1:54310/examples/testfile40.seq:0+134217728
 but
  for pig the status is blank.
 



Re: 100x slower mapreduce compared to pig

2012-02-29 Thread Mohit Anchlia
I think I've found the problem. There was one line of code that caused this
issue :)  that was output.collect(key, value);

I had to add more logging to the code to get to it. For some reason kill
-QUIT didn't send the stacktrace to the userLogs/job/attempt/syslog , I
searched all the logs and couldn't find one. Does anyone know where
stacktraces are generally sent?

On Wed, Feb 29, 2012 at 1:08 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I can't seem to find what's causing this slowness. Nothing in the logs.
 It's just painfuly slow. However, pig job is awesome in performance that
 has the same logic. Here is the mapper code and the pig code:


 *public* *static* *class* Map *extends* MapReduceBase

 *implements* MapperText, Text, Text, Text {

 *public* *void* map(Text key, Text value,

 OutputCollectorText, Text output,

 Reporter reporter)
 *throws* IOException {

 String line = value.toString();

 //log.info(output key: + key + value  + value + value  + line);

 FormMLType f;

 *try* {

 f = FormMLUtils.*convertToRows*(line);

 FormMLStack fm =
 *new* FormMLStack(f,key.toString());

 fm.parseFormML();

 *for* (String *row* : fm.getFormattedRecords(*false*)){

 output.collect(key, value);

 }

 }
 *catch* (JAXBException e) {

 *log*.error(Error processing record  + key, e);

 }

  }

 }

 And here is the pig udf:


 *public* DataBag exec(Tuple input) *throws* IOException {

 *try* {

 DataBag output =
 mBagFactory.newDefaultBag();

 Object o = input.get(1);

 *if* (!(o *instanceof* String)) {

 *throw* *new* IOException(

 Expected document input to be chararray, but got 

 + o.getClass().getName());

 }

 Object o1 = input.get(0);

 *if* (!(o1 *instanceof* String)) {

 *throw* *new* IOException(

 Expected input to be chararray, but got 

 + o.getClass().getName());

 }

 String document = (String)o;

 String filename = (String)o1;

 FormMLType f = FormMLUtils.*convertToRows*(document);

 FormMLStack fm =
 *new* FormMLStack(f,filename);

 fm.parseFormML();

 *for* (String row : fm.getFormattedRecords(*false*)){

 output.add(
 mTupleFactory.newTuple(row));

 }

 *return* output;

 }
 *catch* (ExecException ee) {

 log.error(Failed to Process , ee);

 *throw* ee;

 }
 *catch* (JAXBException e) {

 // *TODO* Auto-generated catch block

 log.error(Invalid xml, e);

 *throw* *new* IllegalArgumentException(invalid xml  +
 e.getCause().getMessage());

 }

 }

   On Wed, Feb 29, 2012 at 9:27 AM, Mohit Anchlia 
 mohitanch...@gmail.comwrote:

 I am going to try few things today. I have a JAXBContext object that
 marshals the xml, this is static instance but my guess at this point is
 that since this is in separate jar then the one where job runs and I used
 DistributeCache.addClassPath this context is being created on every call
 for some reason. I don't know why that would be. I am going to create this
 instance as static in the mapper class itself and see if that helps. I also
 add debugs. Will post the results after try it out.


 On Tue, Feb 28, 2012 at 4:18 PM, Prashant Kommireddi prash1...@gmail.com
  wrote:

 It would be great if we can take a look at what you are doing in the UDF
 vs
 the Mapper.

 100x slow does not make sense for the same job/logic, its either the
 Mapper
 code or may be the cluster was busy at the time you scheduled MapReduce
 job?

 Thanks,
 Prashant

 On Tue, Feb 28, 2012 at 4:11 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  I am comparing runtime of similar logic. The entire logic is exactly
 same
  but surprisingly map reduce job that I submit is 100x slow. For pig I
 use
  udf and for hadoop I use mapper only and the logic same as pig. Even
 the
  splits on the admin page are same. Not sure why it's so slow. I am
  submitting job like:
 
  java -classpath
 
 
 .:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar
  com.services.dp.analytics.hadoop.mapred.FormMLProcessor
 
 
 /examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq
  /examples/output1/
 
  How should I go about looking the root cause of why it's so slow? Any
  suggestions would be really appreciated.
 
 
 
  One of the things I noticed is that on the admin page of map task list
 I
  see status as hdfs://dsdb1:54310/examples/testfile40.seq:0+134217728
 but
  for pig the status is blank.
 






100x slower mapreduce compared to pig

2012-02-28 Thread Mohit Anchlia
I am comparing runtime of similar logic. The entire logic is exactly same
but surprisingly map reduce job that I submit is 100x slow. For pig I use
udf and for hadoop I use mapper only and the logic same as pig. Even the
splits on the admin page are same. Not sure why it's so slow. I am
submitting job like:

java -classpath
.:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar
com.services.dp.analytics.hadoop.mapred.FormMLProcessor
/examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq
/examples/output1/

How should I go about looking the root cause of why it's so slow? Any
suggestions would be really appreciated.



One of the things I noticed is that on the admin page of map task list I
see status as hdfs://dsdb1:54310/examples/testfile40.seq:0+134217728 but
for pig the status is blank.


Re: 100x slower mapreduce compared to pig

2012-02-28 Thread Prashant Kommireddi
It would be great if we can take a look at what you are doing in the UDF vs
the Mapper.

100x slow does not make sense for the same job/logic, its either the Mapper
code or may be the cluster was busy at the time you scheduled MapReduce job?

Thanks,
Prashant

On Tue, Feb 28, 2012 at 4:11 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I am comparing runtime of similar logic. The entire logic is exactly same
 but surprisingly map reduce job that I submit is 100x slow. For pig I use
 udf and for hadoop I use mapper only and the logic same as pig. Even the
 splits on the admin page are same. Not sure why it's so slow. I am
 submitting job like:

 java -classpath

 .:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar
 com.services.dp.analytics.hadoop.mapred.FormMLProcessor

 /examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq
 /examples/output1/

 How should I go about looking the root cause of why it's so slow? Any
 suggestions would be really appreciated.



 One of the things I noticed is that on the admin page of map task list I
 see status as hdfs://dsdb1:54310/examples/testfile40.seq:0+134217728 but
 for pig the status is blank.