On Tue, Aug 19, 2014 at 5:34 AM, J. Roeleveld <jo...@antarean.org> wrote: > On Monday, August 18, 2014 10:53:51 AM Alec Ten Harmsel wrote: >> On Mon 18 Aug 2014 10:50:23 AM EDT, Rich Freeman wrote: >> > Hadoop is a very specialized tool. It does what it does very well, >> > but if you want to use it for something other than map/reduce then >> > consider carefully whether it is the right tool for the job. >> >> Agreed; unless you have decent hardware and can comfortably measure >> your data in TB, it'll be quicker to use something else once you factor >> in the administration time and learning curve. > > The benefit of clustering technologies is that you don't need high-end > hardware to start with. You can use the old hardware you found collecting dust > in the basement. > > The learning curve isn't as steep as it used to be. There are plenty of tools > to make it easier to start using Hadoop. >
As long as you're counting words and don't mind coding everything in Java. :) I found that if you want to avoid using Java, then the available documentation plummets, and I'm pretty sure the version I was attempting to use was buggy - it was losing records in the sort/reduce phase I believe. Or perhaps I was just using it incorrectly, but the same exact code worked just fine when I ran it on a single host with a smaller dataset and just piped map | sort | reduce without using Hadoop. The documentation was pretty sparse on how to get Hadoop to work via stdin/out with non-Java code and it is quite possible I wasn't quite doing things right. In the end my problem wasn't big enough to necessitate using Hadoop and I used GNU parallel instead. -- Rich