And what decides part-0000, part-0001....input split, block size? So for example for 1G of data on HDFS with 64m block size get 16 blocks mapped to different map tasks?
jason hadoop wrote: > > The join package does a streaming merge sort between each part-X in your > input directories, > part-0000 will be handled a single task, > part-0001 will be handled in a single task > and so on > These jobs are essentially io bound, and hard to beat for performance. > > On Wed, Jun 24, 2009 at 2:09 PM, pmg <parmod.me...@gmail.com> wrote: > >> >> I have two files FileA (with 600K records) and FileB (With 2million >> records) >> >> FileA has a key which is same of all the records >> >> 123 724101722493 >> 123 781676672721 >> >> FileB has the same key as FileA >> >> 123 5026328101569 >> 123 5026328001562 >> >> Using hadoop join package I can create output file with tuples and cross >> product of FileA and FileB. >> >> 123 [724101722493,5026328101569] >> 123 [724101722493,5026328001562] >> 123 [781676672721,5026328101569] >> 123 [781676672721,5026328001562] >> >> How does CompositeInputFormat scale when we want to join 600K with 2 >> millions records. Does it run on the node with single map/reduce? >> >> Also how can I not write the result into a file instead input split the >> result into different nodes where I can compare the tuples e.g. comparing >> 724101722493 with 5026328101569 using some heuristics. >> >> thanks >> -- >> View this message in context: >> http://www.nabble.com/CompositeInputFormat-scalbility-tp24192957p24192957.html >> Sent from the Hadoop core-user mailing list archive at Nabble.com. >> >> > > > -- > Pro Hadoop, a book to guide you from beginner to hadoop mastery, > http://www.amazon.com/dp/1430219424?tag=jewlerymall > www.prohadoopbook.com a community for Hadoop Professionals > > -- View this message in context: http://www.nabble.com/CompositeInputFormat-scalbility-tp24192957p24196664.html Sent from the Hadoop core-user mailing list archive at Nabble.com.