Thanks for the explanation, Kumar.

This looks like a testcase problem. We can get into a discussion on whether we 
should tweak the split selection process to output more/less splits when given 
an input set, but for now we should fix the testcase. Makes sense?

On Mar 23, 2012, at 7:26 AM, Kumar Ravi wrote:

> Hi Devaraj,
> 
>  The issue Amir brings up has to do with the Testcase scenario.
> 
> We are trying to determine if this is a Design issue with the getMoreSplits() 
> method in CombineFileInputFormat class or if the testcase needs modification. 
> Like I mentioned in my earlier note, the observation we made while debugging 
> this issue is that the order by which the racksToBlocks HashMap gets 
> populated seems to matter. From the comments by Robert Evans and you, it 
> appears by design that the order should not matter. 
> 
> Amir's point is - The reason order happens to play a role here is that as 
> soon as all the blocks are accounted for, getMoreSplits() stops iterating 
> through the racks, and depending upon which rack(s) each block is replicated 
> on, and depending upon when each rack is processed in the loop within 
> getMoreSplits(), one can end up with different split counts, and as a result 
> fail the testcase in some situations.
> 
> Specifically for this testcase, there are 3 racks that are simulated where 
> each of these 3 racks have a datanode each. Datanode 1 has replicas of all 
> the blocks of all the 3 files (file1, file2, and file3) while Datanode 2 has 
> all the blocks of files file2 and file 3 and Datanode 3 has all the blocks of 
> only file3. As soon as Rack 1 is processed, getMoreSplits() exits with a 
> split count of the number of times it stays in this loop. So in this 
> scenario, if Rack1 gets processed last, one will end up with a split count of 
> 3. If Rack1 gets processed in the beginning, split count will be 1. The 
> testcase is expecting a return value of 3 but can get a 1 or 2 depending on 
> when it gets processed.
> 
> Hope this clarifies things a bit. 
> 
> Regards,
> Kumar
> 
> 
> Kumar Ravi
> IBM Linux Technology Center 
> Austin, TX
> 
> Tel.: (512)286-8179
> 
> Devaraj Das ---03/22/2012 04:41:36 PM---On Mar 22, 2012, at 11:45 AM, Amir 
> Sanjar wrote:
> 
> 
> From:
> 
> Devaraj Das <d...@hortonworks.com>
> 
> To:
> 
> common-dev@hadoop.apache.org
> 
> Cc:
> 
> Jeffrey J Heroux/Poughkeepsie/IBM@IBMUS, John Williams/Austin/IBM@IBMUS
> 
> Date:
> 
> 03/22/2012 04:41 PM
> 
> Subject:
> 
> Re: Question about Hadoop-8192 and rackToBlocks ordering
> 
> 
> 
> On Mar 22, 2012, at 11:45 AM, Amir Sanjar wrote:
> 
> > Thanks for the reply Robert,
> > However I believe the main design issue is:
> > If there is a rack ( listed in rackToBlock hashMap) that contains all the
> > blocks (stored in blockToNode hashMap), regardless of the order, the split
> > operation terminates after the rack gets processed,  That means remaining
> > racks  ( listed in rackToBlock hashMap)  will not get processed . For more
> > details look at file CombineFileInputFormat.JAVA, method getMoreSplits(),
> > while loop starting at  line 344.
> > 
> 
> I haven't looked at the code much yet. But trying to understand your question 
> - what issue are you trying to bring out? Is it overloading one task with too 
> much input (there is a min/max limit on that one though)?
> 
> > Best Regards
> > Amir Sanjar
> > 
> > Linux System Management Architect and Lead
> > IBM Senior Software Engineer
> > Phone# 512-286-8393
> > Fax#      512-838-8858
> > 
> > 
> > 
> > 
> > 
> > From:        Robert Evans <ev...@yahoo-inc.com>
> > To:  "common-dev@hadoop.apache.org" <common-dev@hadoop.apache.org>
> > Date:        03/22/2012 11:57 AM
> > Subject:     Re: Question about Hadoop-8192 and rackToBlocks ordering
> > 
> > 
> > 
> > If it really is the ordering of the hash map I would say no it should not,
> > and the code should be updated.  If ordering matters we need to use a map
> > that guarantees a given order, and hash map is not one of them.
> > 
> > --Bobby Evans
> > 
> > On 3/22/12 7:24 AM, "Kumar Ravi" <gokumarr...@gmail.com> wrote:
> > 
> > Hello,
> > 
> > We have been looking at IBM JDK junit failures on Hadoop-1.0.1
> > independently and have ran into the same failures as reported in this JIRA.
> > I have a question based upon what I have observed below.
> > 
> > We started debugging the problems in the testcase -
> > org.apache.hadoop.mapred.lib.TestCombineFileInputFormat
> > The testcase fails because the number of splits returned back from
> > CombineFileInputFormat.getSplits() is 1 when using IBM JDK whereas the
> > expected return value is 2.
> > 
> > So far, we have found the reason for this difference in number of splits is
> > because the order in which elements in the rackToBlocks hashmap get created
> > is in the reverse order that Sun JDK creates.
> > 
> > The question I have at this point is -- Should there be a strict dependency
> > in the order in which the rackToBlocks hashmap gets populated, to determine
> > the number of splits that get should get created in a hadoop cluster? Is
> > this Working as designed?
> > 
> > Regards,
> > Kumar
> > 
> > 
> > 
> 
> 

Reply via email to