Re: 100x slower mapreduce compared to pig

Mohit Anchlia Wed, 29 Feb 2012 13:49:30 -0800

I think I've found the problem. There was one line of code that caused this
issue :)  that was output.collect(key, value);


I had to add more logging to the code to get to it. For some reason kill
-QUIT didn't send the stacktrace to the userLogs/<job>/<attempt>/syslog , I
searched all the logs and couldn't find one. Does anyone know where
stacktraces are generally sent?

On Wed, Feb 29, 2012 at 1:08 PM, Mohit Anchlia <mohitanch...@gmail.com>wrote:

> I can't seem to find what's causing this slowness. Nothing in the logs.
> It's just painfuly slow. However, pig job is awesome in performance that
> has the same logic. Here is the mapper code and the pig code:
>
>
> *public* *static* *class* Map *extends* MapReduceBase
>
> *implements* Mapper<Text, Text, Text, Text> {
>
> *public* *void* map(Text key, Text value,
>
> OutputCollector<Text, Text> output,
>
> Reporter reporter)
> *throws* IOException {
>
> String line = value.toString();
>
> //log.info("output key:" + key + "value " + value + "value " + line);
>
> FormMLType f;
>
> *try* {
>
> f = FormMLUtils.*convertToRows*(line);
>
> FormMLStack fm =
> *new* FormMLStack(f,key.toString());
>
> fm.parseFormML();
>
> *for* (String *row* : fm.getFormattedRecords(*false*)){
>
> output.collect(key, value);
>
> }
>
> }
> *catch* (JAXBException e) {
>
> *log*.error("Error processing record " + key, e);
>
> }
>
>  }
>
> }
>
> And here is the pig udf:
>
>
> *public* DataBag exec(Tuple input) *throws* IOException {
>
> *try* {
>
> DataBag output =
> mBagFactory.newDefaultBag();
>
> Object o = input.get(1);
>
> *if* (!(o *instanceof* String)) {
>
> *throw* *new* IOException(
>
> "Expected document input to be chararray, but got "
>
> + o.getClass().getName());
>
> }
>
> Object o1 = input.get(0);
>
> *if* (!(o1 *instanceof* String)) {
>
> *throw* *new* IOException(
>
> "Expected input to be chararray, but got "
>
> + o.getClass().getName());
>
> }
>
> String document = (String)o;
>
> String filename = (String)o1;
>
> FormMLType f = FormMLUtils.*convertToRows*(document);
>
> FormMLStack fm =
> *new* FormMLStack(f,filename);
>
> fm.parseFormML();
>
> *for* (String row : fm.getFormattedRecords(*false*)){
>
> output.add(
> mTupleFactory.newTuple(row));
>
> }
>
> *return* output;
>
> }
> *catch* (ExecException ee) {
>
> log.error("Failed to Process ", ee);
>
> *throw* ee;
>
> }
> *catch* (JAXBException e) {
>
> // *TODO* Auto-generated catch block
>
> log.error("Invalid xml", e);
>
> *throw* *new* IllegalArgumentException("invalid xml " +
> e.getCause().getMessage());
>
> }
>
> }
>
>   On Wed, Feb 29, 2012 at 9:27 AM, Mohit Anchlia 
> <mohitanch...@gmail.com>wrote:
>
>> I am going to try few things today. I have a JAXBContext object that
>> marshals the xml, this is static instance but my guess at this point is
>> that since this is in separate jar then the one where job runs and I used
>> DistributeCache.addClassPath this context is being created on every call
>> for some reason. I don't know why that would be. I am going to create this
>> instance as static in the mapper class itself and see if that helps. I also
>> add debugs. Will post the results after try it out.
>>
>>
>> On Tue, Feb 28, 2012 at 4:18 PM, Prashant Kommireddi <prash1...@gmail.com
>> > wrote:
>>
>>> It would be great if we can take a look at what you are doing in the UDF
>>> vs
>>> the Mapper.
>>>
>>> 100x slow does not make sense for the same job/logic, its either the
>>> Mapper
>>> code or may be the cluster was busy at the time you scheduled MapReduce
>>> job?
>>>
>>> Thanks,
>>> Prashant
>>>
>>> On Tue, Feb 28, 2012 at 4:11 PM, Mohit Anchlia <mohitanch...@gmail.com
>>> >wrote:
>>>
>>> > I am comparing runtime of similar logic. The entire logic is exactly
>>> same
>>> > but surprisingly map reduce job that I submit is 100x slow. For pig I
>>> use
>>> > udf and for hadoop I use mapper only and the logic same as pig. Even
>>> the
>>> > splits on the admin page are same. Not sure why it's so slow. I am
>>> > submitting job like:
>>> >
>>> > java -classpath
>>> >
>>> >
>>> .:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar
>>> > com.services.dp.analytics.hadoop.mapred.FormMLProcessor
>>> >
>>> >
>>> /examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq
>>> > /examples/output1/
>>> >
>>> > How should I go about looking the root cause of why it's so slow? Any
>>> > suggestions would be really appreciated.
>>> >
>>> >
>>> >
>>> > One of the things I noticed is that on the admin page of map task list
>>> I
>>> > see status as "hdfs://dsdb1:54310/examples/testfile40.seq:0+134217728"
>>> but
>>> > for pig the status is blank.
>>> >
>>>
>>
>>
>

Re: 100x slower mapreduce compared to pig

Reply via email to