Re: CompositeInputFormat scalbility

jason hadoop Wed, 24 Jun 2009 21:39:21 -0700

The input split size is Long.MAX_VALUE.
and in actual fact, the contents of each directory are sorted separately.
The number of directory entries for each has to be identical.
and all files in index position I, where I varies from 0 to the number of
files in a directory, become the input to 1 task.


My book goes into this in some detail with examples.

Without patches mapside join can only handle 32 directories.

On Wed, Jun 24, 2009 at 9:09 PM, pmg <parmod.me...@gmail.com> wrote:

>
> And what decides part-0000, part-0001....input split, block size?
>
> So for example for 1G of data on HDFS with 64m block size get 16 blocks
> mapped to different map tasks?
>
>
>
> jason hadoop wrote:
> >
> > The join package does a streaming merge sort between each part-X in your
> > input directories,
> > part-0000 will be handled a single task,
> > part-0001 will be handled in a single task
> > and so on
> > These jobs are essentially io bound, and hard to beat for performance.
> >
> > On Wed, Jun 24, 2009 at 2:09 PM, pmg <parmod.me...@gmail.com> wrote:
> >
> >>
> >> I have two files FileA (with 600K records) and FileB (With 2million
> >> records)
> >>
> >> FileA has a key which is same of all the records
> >>
> >> 123    724101722493
> >> 123    781676672721
> >>
> >> FileB has the same key as FileA
> >>
> >> 123    5026328101569
> >> 123    5026328001562
> >>
> >> Using hadoop join package I can create output file with tuples and cross
> >> product of FileA and FileB.
> >>
> >> 123    [724101722493,5026328101569]
> >> 123    [724101722493,5026328001562]
> >> 123    [781676672721,5026328101569]
> >> 123    [781676672721,5026328001562]
> >>
> >> How does CompositeInputFormat scale when we want to join 600K with 2
> >> millions records. Does it run on the node with single map/reduce?
> >>
> >> Also how can I not write the result into a file instead input split the
> >> result into different nodes where I can compare the tuples e.g.
> comparing
> >> 724101722493 with 5026328101569 using some heuristics.
> >>
> >> thanks
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/CompositeInputFormat-scalbility-tp24192957p24192957.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >
> >
> > --
> > Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> > http://www.amazon.com/dp/1430219424?tag=jewlerymall
> > www.prohadoopbook.com a community for Hadoop Professionals
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/CompositeInputFormat-scalbility-tp24192957p24196664.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

Re: CompositeInputFormat scalbility

Reply via email to