If I may, I'd like to ask about that statement a little more. I think most of us agree that hadoop handles very large (10s of TB and up) exceptionally well for several reasons. And I've heard multiple times that hadoop does not handle small datasets well and that traditional tools like RDBMS and ETL are better suited for the small datasets. But what if I have a mixture of data. I work with datasets that range from 1GB to 10TB is size, and the work requires all that data to be grouped and aggregated. I would think that in such an environment where you have vast differences in the size of datasets that it would be better to keep them all in hadoop and do all the work there versus moving the small datasets out of hadoop to do some processing on them and then loading back into hadoop to group with the larger datasets and then possible taking them back out to do more processing and then back in again. I just don't see where the run times for jobs on small files in hadoop would be so long that it wouldn't be offset by moving things back and forth. Or is the performance on small files in hadoop really that bad. Thoughts?
-----Original Message----- From: samir das mohapatra [mailto:samir.help...@gmail.com] Sent: Wednesday, May 30, 2012 8:33 AM To: common-user@hadoop.apache.org Subject: Re: How to mapreduce in the scenario Yes . Hadoop Is only for Huge Dataset Computaion . May not good for small dataset. On Wed, May 30, 2012 at 6:53 AM, liuzhg <liu...@cernet.com> wrote: > Hi, > > Mike, Nitin, Devaraj, Soumya, samir, Robert > > Thank you all for your suggestions. > > Actually, I want to know if hadoop has any advantage than routine database > in performance for solving this kind of problem ( join data ). > > > > Best Regards, > > Gump > > > > > > On Tue, May 29, 2012 at 6:53 PM, Soumya Banerjee > <soumya.sbaner...@gmail.com> wrote: > > Hi, > > You can also try to use the Hadoop Reduce Side Join functionality. > Look into the contrib/datajoin/hadoop-datajoin-*.jar for the base MAP and > Reduce classes to do the same. > > Regards, > Soumya. > > > On Tue, May 29, 2012 at 4:10 PM, Devaraj k <devara...@huawei.com> wrote: > > > Hi Gump, > > > > Mapreduce fits well for solving these types(joins) of problem. > > > > I hope this will help you to solve the described problem.. > > > > 1. Mapoutput key and value classes : Write a map out put key > > class(Text.class), value class(CombinedValue.class). Here value class > > should be able to hold the values from both the files(a.txt and b.txt) as > > shown below. > > > > class CombinedValue implements WritableComparator > > { > > String name; > > int age; > > String address; > > boolean isLeft; // flag to identify from which file > > } > > > > 2. Mapper : Write a map() function which can parse from both the > > files(a.txt, b.txt) and produces common output key and value class. > > > > 3. Partitioner : Write the partitioner in such a way that it will Send > all > > the (key, value) pairs to same reducer which are having same key. > > > > 4. Reducer : In the reduce() function, you will receive the records from > > both the files and you can combine those easily. > > > > > > Thanks > > Devaraj > > > > > > ________________________________________ > > From: liuzhg [liu...@cernet.com] > > Sent: Tuesday, May 29, 2012 3:45 PM > > To: common-user@hadoop.apache.org > > Subject: How to mapreduce in the scenario > > > > Hi, > > > > I wonder that if Hadoop can solve effectively the question as following: > > > > ========================================== > > input file: a.txt, b.txt > > result: c.txt > > > > a.txt: > > id1,name1,age1,... > > id2,name2,age2,... > > id3,name3,age3,... > > id4,name4,age4,... > > > > b.txt: > > id1,address1,... > > id2,address2,... > > id3,address3,... > > > > c.txt > > id1,name1,age1,address1,... > > id2,name2,age2,address2,... > > ======================================== > > > > I know that it can be done well by database. > > But I want to handle it with hadoop if possible. > > Can hadoop meet the requirement? > > > > Any suggestion can help me. Thank you very much! > > > > Best Regards, > > > > Gump > > > > > > *************************************************************************** The information contained in this communication is confidential, is intended only for the use of the recipient named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please resend this communication to the sender and delete the original message or any copy of it from your computer system. Thank You. ****************************************************************************