I think I've found the problem. There was one line of code that caused this issue :) that was output.collect(key, value);
I had to add more logging to the code to get to it. For some reason kill -QUIT didn't send the stacktrace to the userLogs/<job>/<attempt>/syslog , I searched all the logs and couldn't find one. Does anyone know where stacktraces are generally sent? On Wed, Feb 29, 2012 at 1:08 PM, Mohit Anchlia <mohitanch...@gmail.com>wrote: > I can't seem to find what's causing this slowness. Nothing in the logs. > It's just painfuly slow. However, pig job is awesome in performance that > has the same logic. Here is the mapper code and the pig code: > > > *public* *static* *class* Map *extends* MapReduceBase > > *implements* Mapper<Text, Text, Text, Text> { > > *public* *void* map(Text key, Text value, > > OutputCollector<Text, Text> output, > > Reporter reporter) > *throws* IOException { > > String line = value.toString(); > > //log.info("output key:" + key + "value " + value + "value " + line); > > FormMLType f; > > *try* { > > f = FormMLUtils.*convertToRows*(line); > > FormMLStack fm = > *new* FormMLStack(f,key.toString()); > > fm.parseFormML(); > > *for* (String *row* : fm.getFormattedRecords(*false*)){ > > output.collect(key, value); > > } > > } > *catch* (JAXBException e) { > > *log*.error("Error processing record " + key, e); > > } > > } > > } > > And here is the pig udf: > > > *public* DataBag exec(Tuple input) *throws* IOException { > > *try* { > > DataBag output = > mBagFactory.newDefaultBag(); > > Object o = input.get(1); > > *if* (!(o *instanceof* String)) { > > *throw* *new* IOException( > > "Expected document input to be chararray, but got " > > + o.getClass().getName()); > > } > > Object o1 = input.get(0); > > *if* (!(o1 *instanceof* String)) { > > *throw* *new* IOException( > > "Expected input to be chararray, but got " > > + o.getClass().getName()); > > } > > String document = (String)o; > > String filename = (String)o1; > > FormMLType f = FormMLUtils.*convertToRows*(document); > > FormMLStack fm = > *new* FormMLStack(f,filename); > > fm.parseFormML(); > > *for* (String row : fm.getFormattedRecords(*false*)){ > > output.add( > mTupleFactory.newTuple(row)); > > } > > *return* output; > > } > *catch* (ExecException ee) { > > log.error("Failed to Process ", ee); > > *throw* ee; > > } > *catch* (JAXBException e) { > > // *TODO* Auto-generated catch block > > log.error("Invalid xml", e); > > *throw* *new* IllegalArgumentException("invalid xml " + > e.getCause().getMessage()); > > } > > } > > On Wed, Feb 29, 2012 at 9:27 AM, Mohit Anchlia > <mohitanch...@gmail.com>wrote: > >> I am going to try few things today. I have a JAXBContext object that >> marshals the xml, this is static instance but my guess at this point is >> that since this is in separate jar then the one where job runs and I used >> DistributeCache.addClassPath this context is being created on every call >> for some reason. I don't know why that would be. I am going to create this >> instance as static in the mapper class itself and see if that helps. I also >> add debugs. Will post the results after try it out. >> >> >> On Tue, Feb 28, 2012 at 4:18 PM, Prashant Kommireddi <prash1...@gmail.com >> > wrote: >> >>> It would be great if we can take a look at what you are doing in the UDF >>> vs >>> the Mapper. >>> >>> 100x slow does not make sense for the same job/logic, its either the >>> Mapper >>> code or may be the cluster was busy at the time you scheduled MapReduce >>> job? >>> >>> Thanks, >>> Prashant >>> >>> On Tue, Feb 28, 2012 at 4:11 PM, Mohit Anchlia <mohitanch...@gmail.com >>> >wrote: >>> >>> > I am comparing runtime of similar logic. The entire logic is exactly >>> same >>> > but surprisingly map reduce job that I submit is 100x slow. For pig I >>> use >>> > udf and for hadoop I use mapper only and the logic same as pig. Even >>> the >>> > splits on the admin page are same. Not sure why it's so slow. I am >>> > submitting job like: >>> > >>> > java -classpath >>> > >>> > >>> .:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar >>> > com.services.dp.analytics.hadoop.mapred.FormMLProcessor >>> > >>> > >>> /examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq >>> > /examples/output1/ >>> > >>> > How should I go about looking the root cause of why it's so slow? Any >>> > suggestions would be really appreciated. >>> > >>> > >>> > >>> > One of the things I noticed is that on the admin page of map task list >>> I >>> > see status as "hdfs://dsdb1:54310/examples/testfile40.seq:0+134217728" >>> but >>> > for pig the status is blank. >>> > >>> >> >> >