Amir, let's continue the discussion on jira (now that we have got into disagreements :-) ).
Thanks, Devaraj. On Mar 23, 2012, at 10:55 AM, Amir Sanjar wrote: > Devaraj , > I respectfully disagree, fixing the testcase for IBM JVM will break SUN > JVM, unless we remove the assertion causing the problem. However that might > in return mask other problems . Before any changes we need to understand > the logic behind split process. Is there any documentation? Who owned the > code? > > > Best Regards > Amir Sanjar > > Linux System Management Architect and Lead > IBM Senior Software Engineer > Phone# 512-286-8393 > Fax# 512-838-8858 > > > > > > From: Devaraj Das <d...@hortonworks.com> > To: common-dev@hadoop.apache.org > Cc: Jeffrey J Heroux/Poughkeepsie/IBM@IBMUS, John > Williams/Austin/IBM@IBMUS > Date: 03/23/2012 12:15 PM > Subject: Re: Question about Hadoop-8192 and rackToBlocks ordering > > > > Thanks for the explanation, Kumar. > > This looks like a testcase problem. We can get into a discussion on whether > we should tweak the split selection process to output more/less splits when > given an input set, but for now we should fix the testcase. Makes sense? > > On Mar 23, 2012, at 7:26 AM, Kumar Ravi wrote: > >> Hi Devaraj, >> >> The issue Amir brings up has to do with the Testcase scenario. >> >> We are trying to determine if this is a Design issue with the > getMoreSplits() method in CombineFileInputFormat class or if the testcase > needs modification. >> Like I mentioned in my earlier note, the observation we made while > debugging this issue is that the order by which the racksToBlocks HashMap > gets populated seems to matter. From the comments by Robert Evans and you, > it appears by design that the order should not matter. >> >> Amir's point is - The reason order happens to play a role here is that as > soon as all the blocks are accounted for, getMoreSplits() stops iterating > through the racks, and depending upon which rack(s) each block is > replicated on, and depending upon when each rack is processed in the loop > within getMoreSplits(), one can end up with different split counts, and as > a result fail the testcase in some situations. >> >> Specifically for this testcase, there are 3 racks that are simulated > where each of these 3 racks have a datanode each. Datanode 1 has replicas > of all the blocks of all the 3 files (file1, file2, and file3) while > Datanode 2 has all the blocks of files file2 and file 3 and Datanode 3 has > all the blocks of only file3. As soon as Rack 1 is processed, getMoreSplits > () exits with a split count of the number of times it stays in this loop. > So in this scenario, if Rack1 gets processed last, one will end up with a > split count of 3. If Rack1 gets processed in the beginning, split count > will be 1. The testcase is expecting a return value of 3 but can get a 1 or > 2 depending on when it gets processed. >> >> Hope this clarifies things a bit. >> >> Regards, >> Kumar >> >> >> Kumar Ravi >> IBM Linux Technology Center >> Austin, TX >> >> Tel.: (512)286-8179 >> >> Devaraj Das ---03/22/2012 04:41:36 PM---On Mar 22, 2012, at 11:45 AM, > Amir Sanjar wrote: >> >> >> From: >> >> Devaraj Das <d...@hortonworks.com> >> >> To: >> >> common-dev@hadoop.apache.org >> >> Cc: >> >> Jeffrey J Heroux/Poughkeepsie/IBM@IBMUS, John Williams/Austin/IBM@IBMUS >> >> Date: >> >> 03/22/2012 04:41 PM >> >> Subject: >> >> Re: Question about Hadoop-8192 and rackToBlocks ordering >> >> >> >> On Mar 22, 2012, at 11:45 AM, Amir Sanjar wrote: >> >>> Thanks for the reply Robert, >>> However I believe the main design issue is: >>> If there is a rack ( listed in rackToBlock hashMap) that contains all > the >>> blocks (stored in blockToNode hashMap), regardless of the order, the > split >>> operation terminates after the rack gets processed, That means > remaining >>> racks ( listed in rackToBlock hashMap) will not get processed . For > more >>> details look at file CombineFileInputFormat.JAVA, method getMoreSplits > (), >>> while loop starting at line 344. >>> >> >> I haven't looked at the code much yet. But trying to understand your > question - what issue are you trying to bring out? Is it overloading one > task with too much input (there is a min/max limit on that one though)? >> >>> Best Regards >>> Amir Sanjar >>> >>> Linux System Management Architect and Lead >>> IBM Senior Software Engineer >>> Phone# 512-286-8393 >>> Fax# 512-838-8858 >>> >>> >>> >>> >>> >>> From: Robert Evans <ev...@yahoo-inc.com> >>> To: "common-dev@hadoop.apache.org" > <common-dev@hadoop.apache.org> >>> Date: 03/22/2012 11:57 AM >>> Subject: Re: Question about Hadoop-8192 and rackToBlocks > ordering >>> >>> >>> >>> If it really is the ordering of the hash map I would say no it should > not, >>> and the code should be updated. If ordering matters we need to use a > map >>> that guarantees a given order, and hash map is not one of them. >>> >>> --Bobby Evans >>> >>> On 3/22/12 7:24 AM, "Kumar Ravi" <gokumarr...@gmail.com> wrote: >>> >>> Hello, >>> >>> We have been looking at IBM JDK junit failures on Hadoop-1.0.1 >>> independently and have ran into the same failures as reported in this > JIRA. >>> I have a question based upon what I have observed below. >>> >>> We started debugging the problems in the testcase - >>> org.apache.hadoop.mapred.lib.TestCombineFileInputFormat >>> The testcase fails because the number of splits returned back from >>> CombineFileInputFormat.getSplits() is 1 when using IBM JDK whereas the >>> expected return value is 2. >>> >>> So far, we have found the reason for this difference in number of > splits is >>> because the order in which elements in the rackToBlocks hashmap get > created >>> is in the reverse order that Sun JDK creates. >>> >>> The question I have at this point is -- Should there be a strict > dependency >>> in the order in which the rackToBlocks hashmap gets populated, to > determine >>> the number of splits that get should get created in a hadoop cluster? > Is >>> this Working as designed? >>> >>> Regards, >>> Kumar >>> >>> >>> >> >> > > >