Btw, do speak to Gora folks on fixing or at least documenting this flaw. I can imagine others hitting the same issue :)
On Mon, Jul 30, 2012 at 9:22 PM, Harsh J <ha...@cloudera.com> wrote: > I've mostly done it with logging, but this JIRA may interest you if > you still wish to attach a remote debugger to tasks: > https://issues.apache.org/jira/browse/MAPREDUCE-2637 > > On Mon, Jul 30, 2012 at 7:28 PM, Sriram Ramachandrasekaran > <sri.ram...@gmail.com> wrote: >> Harsh, >> I was waiting to try it on my cluster before I came back to report if it >> worked or not. >> I tried it and it works. The site wide configuration worked. >> The IOUtils.conf.addResource("job.xml") does the same thing as >> GoraMapReduceUtils.setIOSerialization(), so it did not help. >> >> Thanks for the help. I still would like to know, what would be a better way >> to debug distributed map reduce jobs. >> I know I can debug stand-alone jobs quite easily, but, I would like to know >> how folks do distributed map reduce jobs debugging. >> >> Thanks again! >> -Sriram >> >> >> On Sat, Jul 28, 2012 at 6:20 AM, Sriram Ramachandrasekaran >> <sri.ram...@gmail.com> wrote: >>> >>> aah! I always thought about setting io.serializations at the job level. I >>> never thought about this. will try this site wide thing. thanks again. >>> >>> On 28 Jul 2012 06:16, "Harsh J" <ha...@cloudera.com> wrote: >>>> >>>> Ah, that may be cause the core-site.xml has the property >>>> io.serializations fully defined for Gora as well? You can do that as >>>> an alternative fix, supply a core-site.xml across tasktrackers that >>>> also carry the serialization class Gora requires. I failed to think of >>>> that as a solution. >>>> >>>> On Sat, Jul 28, 2012 at 6:04 AM, Sriram Ramachandrasekaran >>>> <sri.ram...@gmail.com> wrote: >>>> > okay. But this issue didn't present itself when run in standalone mode. >>>> > :) >>>> > >>>> > On 28 Jul 2012 06:02, "Harsh J" <ha...@cloudera.com> wrote: >>>> >> >>>> >> I find it easier to run jobs via MRUnit (http://mrunit.apache.org, >>>> >> TDD) first, or via LocalJobRunner, for debug purposes. >>>> >> >>>> >> On Sat, Jul 28, 2012 at 5:53 AM, Sriram Ramachandrasekaran >>>> >> <sri.ram...@gmail.com> wrote: >>>> >> > hello harsh, >>>> >> > thanks for your investigations. while we were debugging, I saw the >>>> >> > exact >>>> >> > thing. As you pointed out, we suspected it to be a problem. So, we >>>> >> > set >>>> >> > the >>>> >> > job conf object directly on Gora's query object. >>>> >> > It goes something like this, >>>> >> > query.setConf..(job.getConfig..()) >>>> >> > >>>> >> > And, then I saw that it was not getting into creating a new object >>>> >> > at >>>> >> > getOrCreate(). >>>> >> > >>>> >> > OTOH, i've not tried the job.xml thing. I should give it a try n I >>>> >> > shall >>>> >> > keep the loop posted. >>>> >> > >>>> >> > I would also like to hear about standard practices for debugging >>>> >> > distributed >>>> >> > MR tasks. >>>> >> > >>>> >> > ----- >>>> >> > reply from a hh device. Pl excuse typos n lack of formatting. >>>> >> > >>>> >> > On 28 Jul 2012 03:30, "Harsh J" <ha...@cloudera.com> wrote: >>>> >> >> >>>> >> >> Hi Sriram, >>>> >> >> >>>> >> >> I suspect the following in Gora to somehow be causing this issue: >>>> >> >> >>>> >> >> IOUtils source: >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> http://svn.apache.org/viewvc/gora/trunk/gora-core/src/main/java/org/apache/gora/util/IOUtils.java?view=markup >>>> >> >> QueryBase source: >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> http://svn.apache.org/viewvc/gora/trunk/gora-core/src/main/java/org/apache/gora/query/impl/QueryBase.java?view=markup >>>> >> >> >>>> >> >> Notice that IOUtils.deserialize(…) calls expect a proper >>>> >> >> Configuration >>>> >> >> object. If not passed (i.e., if null), they call the following. >>>> >> >> >>>> >> >> 68 private static Configuration >>>> >> >> getOrCreateConf(Configuration >>>> >> >> conf) >>>> >> >> { >>>> >> >> 69 if(conf == null) { >>>> >> >> 70 if(IOUtils.conf == null) { >>>> >> >> 71 IOUtils.conf = new Configuration(); >>>> >> >> 72 } >>>> >> >> 73 } >>>> >> >> 74 return conf != null ? conf : IOUtils.conf; >>>> >> >> 75 } >>>> >> >> >>>> >> >> Now QueryBase, has in its readFields method, some >>>> >> >> IOUtils.deserialize(…) calls, that seem to pass a null for the >>>> >> >> configuration object. The IOUtils.deserialize(…) method hence calls >>>> >> >> this above method, and initializes a whole new Configuration >>>> >> >> object, >>>> >> >> as the passed conf object is null. >>>> >> >> >>>> >> >> If it does that, it would not be loading the "job.xml" file >>>> >> >> contents, >>>> >> >> which is the job's config file (thats something the map task's >>>> >> >> config >>>> >> >> set alone loads, and not a file thats loaded by default). So hence, >>>> >> >> custom serializers will disappear the moment it begins using this >>>> >> >> new >>>> >> >> Configuration object. >>>> >> >> >>>> >> >> This is what you'll want to investigate and fix or notify the Gora >>>> >> >> devs about (why QueryBase#readFields uses a null object, and if it >>>> >> >> can >>>> >> >> reuse some set conf object). As a cheap hack fix, maybe doing the >>>> >> >> following will make it work in an MR environment? >>>> >> >> >>>> >> >> IOUtils.conf = new Configuration(); >>>> >> >> IOUtils.conf.addResource("job.xml"); >>>> >> >> >>>> >> >> I haven't tried the above, but let us know how we can be of further >>>> >> >> assistance. An ideal fix would be to only use the MapTask's >>>> >> >> provided >>>> >> >> Configuration object everywhere, somehow, and never re-create one. >>>> >> >> >>>> >> >> P.s. If you want a thread ref link to share with other devs over >>>> >> >> Gora, >>>> >> >> here it is: http://search-hadoop.com/m/BXZA4dTUFC >>>> >> >> >>>> >> >> On Fri, Jul 27, 2012 at 1:24 PM, Sriram Ramachandrasekaran >>>> >> >> <sri.ram...@gmail.com> wrote: >>>> >> >> > Hello, >>>> >> >> > I have an MR job that talks to HBase. I use Gora to talk to >>>> >> >> > HBase. >>>> >> >> > Gora >>>> >> >> > also >>>> >> >> > provides couple of classes which can be extended to write Mappers >>>> >> >> > and >>>> >> >> > Reducers, if the mappers need input from an HBase store and >>>> >> >> > Reducers >>>> >> >> > need to >>>> >> >> > write it out to an HBase store. This is the reason why I use >>>> >> >> > Gora. >>>> >> >> > >>>> >> >> > Now, when I run my MR job, I get an exception as below. >>>> >> >> > (https://issues.apache.org/jira/browse/HADOOP-3093) >>>> >> >> > java.lang.RuntimeException: java.io.IOException: >>>> >> >> > java.lang.NullPointerException >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.gora.mapreduce.GoraInputFormat.setConf(GoraInputFormat.java:115) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) >>>> >> >> > at >>>> >> >> > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:723) >>>> >> >> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) >>>> >> >> > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) >>>> >> >> > at java.security.AccessController.doPrivileged(Native Method) >>>> >> >> > at javax.security.auth.Subject.doAs(Subject.java:415) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093) >>>> >> >> > at org.apache.hadoop.mapred.Child.main(Child.java:249) >>>> >> >> > Caused by: java.io.IOException: java.lang.NullPointerException >>>> >> >> > at org.apache.gora.util.IOUtils.loadFromConf(IOUtils.java:483) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.gora.mapreduce.GoraInputFormat.getQuery(GoraInputFormat.java:125) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.gora.mapreduce.GoraInputFormat.setConf(GoraInputFormat.java:112) >>>> >> >> > ... 9 more >>>> >> >> > Caused by: java.lang.NullPointerException >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.io.serializer.SerializationFactory.getDeserializer(SerializationFactory.java:77) >>>> >> >> > at org.apache.gora.util.IOUtils.deserialize(IOUtils.java:205) >>>> >> >> > at >>>> >> >> > >>>> >> >> > org.apache.gora.query.impl.QueryBase.readFields(QueryBase.java:234) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.io.DefaultStringifier.fromString(DefaultStringifier.java:75) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.io.DefaultStringifier.load(DefaultStringifier.java:133) >>>> >> >> > at org.apache.gora.util.IOUtils.loadFromConf(IOUtils.java:480) >>>> >> >> > ... 11 more >>>> >> >> > >>>> >> >> > I tried the following things to work through this issue. >>>> >> >> > 0. The stack trace indicates that, when setting up a new Mapper, >>>> >> >> > it >>>> >> >> > is >>>> >> >> > unable to deserialize something. (I could not get to understand >>>> >> >> > where >>>> >> >> > it >>>> >> >> > fails). >>>> >> >> > 1. I looked around the forums and realized that serialization >>>> >> >> > options >>>> >> >> > are >>>> >> >> > not getting passed, so, I tried setting up, io.serializations >>>> >> >> > config >>>> >> >> > on >>>> >> >> > the >>>> >> >> > job. >>>> >> >> > 1.1. I am not setting up the "io.serializations" myself, I use >>>> >> >> > GoraMapReduceUtils.setIOSerializations() to do it. I verified >>>> >> >> > that, >>>> >> >> > the >>>> >> >> > confs are getting proper serializers. >>>> >> >> > 2. I verified in the job xml to see if these confs have got >>>> >> >> > through, >>>> >> >> > they >>>> >> >> > were. But, it failed again. >>>> >> >> > 3. I tried starting the hadoop job runner with debug options >>>> >> >> > turned >>>> >> >> > on >>>> >> >> > and >>>> >> >> > in suspend mode, -XDebug suspend=y and I also set the VM options >>>> >> >> > for >>>> >> >> > mapred >>>> >> >> > child tasks, via the mapred.child.java.opts to see if I can debug >>>> >> >> > the >>>> >> >> > VM >>>> >> >> > that gets spawned newly. Although I get a message on my stdout >>>> >> >> > saying, >>>> >> >> > opening port X and waiting, when I try to attach a remote >>>> >> >> > debugger on >>>> >> >> > that >>>> >> >> > port, it does not work. >>>> >> >> > >>>> >> >> > I understand that, when SerializationFactory tries to deSerialize >>>> >> >> > 'something', it does not find an appropriate unmarshaller and so >>>> >> >> > it >>>> >> >> > fails. >>>> >> >> > But, I would like to know a way to find that 'something' and I >>>> >> >> > would >>>> >> >> > like to >>>> >> >> > get some idea on how (pseudo) distributed MR jobs should be >>>> >> >> > generally >>>> >> >> > debugged. I tried searching, did not find anything useful. >>>> >> >> > >>>> >> >> > Any help/pointers would be greatly useful. >>>> >> >> > >>>> >> >> > Thanks! >>>> >> >> > >>>> >> >> > -- >>>> >> >> > It's just about how deep your longing is! >>>> >> >> > >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> -- >>>> >> >> Harsh J >>>> >> >>>> >> >>>> >> >>>> >> -- >>>> >> Harsh J >>>> >>>> >>>> >>>> -- >>>> Harsh J >> >> >> >> >> -- >> It's just about how deep your longing is! >> > > > > -- > Harsh J -- Harsh J