Thanks for the explanation, Kumar. This looks like a testcase problem. We can get into a discussion on whether we should tweak the split selection process to output more/less splits when given an input set, but for now we should fix the testcase. Makes sense?
On Mar 23, 2012, at 7:26 AM, Kumar Ravi wrote: > Hi Devaraj, > > The issue Amir brings up has to do with the Testcase scenario. > > We are trying to determine if this is a Design issue with the getMoreSplits() > method in CombineFileInputFormat class or if the testcase needs modification. > Like I mentioned in my earlier note, the observation we made while debugging > this issue is that the order by which the racksToBlocks HashMap gets > populated seems to matter. From the comments by Robert Evans and you, it > appears by design that the order should not matter. > > Amir's point is - The reason order happens to play a role here is that as > soon as all the blocks are accounted for, getMoreSplits() stops iterating > through the racks, and depending upon which rack(s) each block is replicated > on, and depending upon when each rack is processed in the loop within > getMoreSplits(), one can end up with different split counts, and as a result > fail the testcase in some situations. > > Specifically for this testcase, there are 3 racks that are simulated where > each of these 3 racks have a datanode each. Datanode 1 has replicas of all > the blocks of all the 3 files (file1, file2, and file3) while Datanode 2 has > all the blocks of files file2 and file 3 and Datanode 3 has all the blocks of > only file3. As soon as Rack 1 is processed, getMoreSplits() exits with a > split count of the number of times it stays in this loop. So in this > scenario, if Rack1 gets processed last, one will end up with a split count of > 3. If Rack1 gets processed in the beginning, split count will be 1. The > testcase is expecting a return value of 3 but can get a 1 or 2 depending on > when it gets processed. > > Hope this clarifies things a bit. > > Regards, > Kumar > > > Kumar Ravi > IBM Linux Technology Center > Austin, TX > > Tel.: (512)286-8179 > > Devaraj Das ---03/22/2012 04:41:36 PM---On Mar 22, 2012, at 11:45 AM, Amir > Sanjar wrote: > > > From: > > Devaraj Das <d...@hortonworks.com> > > To: > > common-dev@hadoop.apache.org > > Cc: > > Jeffrey J Heroux/Poughkeepsie/IBM@IBMUS, John Williams/Austin/IBM@IBMUS > > Date: > > 03/22/2012 04:41 PM > > Subject: > > Re: Question about Hadoop-8192 and rackToBlocks ordering > > > > On Mar 22, 2012, at 11:45 AM, Amir Sanjar wrote: > > > Thanks for the reply Robert, > > However I believe the main design issue is: > > If there is a rack ( listed in rackToBlock hashMap) that contains all the > > blocks (stored in blockToNode hashMap), regardless of the order, the split > > operation terminates after the rack gets processed, That means remaining > > racks ( listed in rackToBlock hashMap) will not get processed . For more > > details look at file CombineFileInputFormat.JAVA, method getMoreSplits(), > > while loop starting at line 344. > > > > I haven't looked at the code much yet. But trying to understand your question > - what issue are you trying to bring out? Is it overloading one task with too > much input (there is a min/max limit on that one though)? > > > Best Regards > > Amir Sanjar > > > > Linux System Management Architect and Lead > > IBM Senior Software Engineer > > Phone# 512-286-8393 > > Fax# 512-838-8858 > > > > > > > > > > > > From: Robert Evans <ev...@yahoo-inc.com> > > To: "common-dev@hadoop.apache.org" <common-dev@hadoop.apache.org> > > Date: 03/22/2012 11:57 AM > > Subject: Re: Question about Hadoop-8192 and rackToBlocks ordering > > > > > > > > If it really is the ordering of the hash map I would say no it should not, > > and the code should be updated. If ordering matters we need to use a map > > that guarantees a given order, and hash map is not one of them. > > > > --Bobby Evans > > > > On 3/22/12 7:24 AM, "Kumar Ravi" <gokumarr...@gmail.com> wrote: > > > > Hello, > > > > We have been looking at IBM JDK junit failures on Hadoop-1.0.1 > > independently and have ran into the same failures as reported in this JIRA. > > I have a question based upon what I have observed below. > > > > We started debugging the problems in the testcase - > > org.apache.hadoop.mapred.lib.TestCombineFileInputFormat > > The testcase fails because the number of splits returned back from > > CombineFileInputFormat.getSplits() is 1 when using IBM JDK whereas the > > expected return value is 2. > > > > So far, we have found the reason for this difference in number of splits is > > because the order in which elements in the rackToBlocks hashmap get created > > is in the reverse order that Sun JDK creates. > > > > The question I have at this point is -- Should there be a strict dependency > > in the order in which the rackToBlocks hashmap gets populated, to determine > > the number of splits that get should get created in a hadoop cluster? Is > > this Working as designed? > > > > Regards, > > Kumar > > > > > > > >