On Wednesday 02 July 2008 19:51:57 David J. O'Dell wrote: > Is anyone using hadoop for any part of the ETL process? > > Given its ability to process large amounts of log files this seems like > a good fit.
Well, we are doing the following data flow: 1.) webservers upload to S3 2.) hadoop jobs get started with a number of logfiles each. We use streaming.jar only, with a Python "framework" and a number of driver scripts for mapping, reducing (which is usually a completely generic behaviour assigned on a per job basis, e.g. FirstOnly, SumValues, CollectSet), and later on applying to MySQL. 3.) the results get written to MySQL. 4.) inside the hadoop cluster certain data from MySQL that is needed for efficient reducing (you cannot count persons by sex, if you do not know the sex of the person), are available as a REST-style http service. Each node has it's own squid, the http services create as much cachable content as possible, and the squids do ICP peering against all nodes. It works somehow find, although from time to time there are problems, e.g. my current one is that hadoop behaves really bad on long lines. (I know it's not exactly a trivial thing to read an arbitrary long line without knowing a limit beforehand, OTOH, Python does manage that for me, without me especially loosing to much sleep about it. Another of these situations where slow highlevel languages overwhelm the lowlevel optimization champions.) Andreas
signature.asc
Description: This is a digitally signed message part.