Colin: This sounds like a reasonable case for Hive or Pig on Hadoop (assuming you're right about your data growth being "big"). Hive will let you query a table using something that resembles SQL, similar to an RDBMS but bakes down to map reduce. Tables are made up by one or more files in a directory. Files can be text or something like Hadoop's SequenceFile format.
Generally, you want to build a system that batches up updates into a file and then move that file into a directory so it can be queried by Hive (again, or Pig). Given that both tools will operate on the entire directory, the data in the file will be seen as one large dataset (very common in Hadoop MR jobs). I'll answer some other questions inline. On Fri, Apr 16, 2010 at 2:26 PM, Colin Yates <[email protected]> wrote: > The problem is that a user wants to generate reports off this table. To > phrase it in terms of mapreduce (map-reduce or map/reduce or mapreduce? :) > The reduce is usually the same (a simple aggregation) but the map phase > will change with each query. This isn't a high concurrency requirement - it > is likely to be one or two users doing it once a month. This is more than fine and can be handled by Hadoop. > I realise the hadoop and mapreduce architecture isn't designed for real-time > analytics, but an execution time of minutes would be sufficient. It will depend on the dataset size, but with this amount of data, minutes should be easy enough to achieve. > The new rows will come in batch (maybe once a minute, maybe once a day > depending on the environment) and absolutely live data isn't essential. As I mentioned above, batch updates into larger files. Rather than 1 a minute, go for something like >= a few hundred megs or even gigs, or once a day, whichever comes first. Something like that gets a reasonable batch size. > My plan was to have lots and lots of (virtual) nodes with a small memory > footprint (<1GB) so that the parallelisation of map/reduce can be utilised > as much as possible. That's going to be hard. Hadoop is pure java (mostly) and can get memory hungry. Having a machine with < 1GB is going to be tough to pull off. Expect to "spend" about 4GB per node in a cluster with a few cores. See http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/ Virtualization hurts Hadoop in almost all cases. > Haven't quite thought out how to get the data into hdfs, whether to use > HBase or a single, ever-growing CSV. HBase is generally more useful when you need random access to the individual records. It doesn't sound like you want / need that. Go with a directory of X terminated field, \n terminated record files and make your life easier. > I don't quite get how hdfs figures out > which bit to give to each map job. I realise the file is chunked into (by > default) 64MB files but it can't be as simple as every map process gets the > full 64MB file? Do I need to split the file into chunks myself? If so, > that is fine. The short answer is that files are split into blocks purely on size and spread around the cluster. When an MR job processes a bunch of files, Hadoop calculates the number of splits (roughly) based on the number of HDFS blocks that make up the files. Records of the file are reconstructed from these splits. Records that span blocks are reconstructed for you, the way you'd expect. If you use one of the existing formats (like TextInputFormat for text or SequenceFileInputFormat for SequenceFiles) this splitting of data is transparent to you and you'll naturally get what you want - parallel processing of N files which are split into M chunks with each line processed once and only once. > I realise this data is quite easily managed by existing RDBMs but the data > will grow very very quickly, and there are other reasons I don't want to go > down that route. Honestly, 4M rows per year is tiny. The reason to use something like Hadoop is if the data is tough to squeeze into an RDBMS or - and it sounds like this might be your case - your queries are generally "full table scan" type operations over a moderate amount of data that can be easily performed in parallel or drown the db. Again, I think Hive is what you're after (simple summation, variable filtering logic). > So, am I barking up the wrong tree with this? Is there a better solution? > I have also evaluated mongo-db and couchDB (both excellent for their > use-cases). Document oriented databases are another beast. Definitely useful but I don't know if you need those access patterns here. -- Eric Sammer phone: +1-917-287-2675 twitter: esammer data: www.cloudera.com
