We'd been using text input and output exclusively, but eventually realized some efficiency improvements by using slightly more sophisticated classes specific to our application.
Our main use of Hadoop is processing activity logs from a fleet of servers. We get about 6GB of compressed data per day. We were running reports based on different dimensions in our logs. At first, we were making a pass through the data for each dimension. The thing is, if we included the dimension as part of the key, we could actually do the first MR job we need in one pass. But this slightly improved version of our reports still uses the text input and output for keys, values, and output. Where we use a custom class is when we process these intermediate results into a final summary. Our Summarizer class is the OutputValueClass for our job, though the output forrmat is still text (which calls the toString method of our Summarizer.) Our final MR job operates on elements of Summarizer, after deciding what to do based on the dimension label in the key and based on certain charecteristics of the key and value from the initial MR job. This allows us to keep track of 4 independent tallies in our summarizing MR job. It was fairly easy to write the OutputValueClass, though our jobs are fairly straightforward. It's easy to see how it could be extended in really interesting ways to do more though. -Colin On Fri, Jun 20, 2008 at 1:10 PM, Mathos Marcer <[EMAIL PROTECTED]> wrote: > Presumedly like most I've started off with mainly using "Text" based > input and output formatters and using key and values as Text or > IntWritable. I've been looking more into the other formatters and > writable classes and wondering what they would do for me. To help > spur some best practices and lessons learned conversations: What are > the benefits of the other formatters? And benefits of MapFiles and > SequenceFiles? What are people out there using or have found gave > them the greatest benefits? > > == > MM >
