Hi all, I am experimenting with a number of different tools to find the best fit for my current problem. To simplify, I have a table with 12 columns (small numbers and booleans) that gets 1M rows every 3 months. We need to keep the data for a long time (even if we archive every year that is still at least 4M rows).
The problem is that a user wants to generate reports off this table. To phrase it in terms of mapreduce (map-reduce or map/reduce or mapreduce? :) The reduce is usually the same (a simple aggregation) but the map phase will change with each query. This isn't a high concurrency requirement - it is likely to be one or two users doing it once a month. I realise the hadoop and mapreduce architecture isn't designed for real-time analytics, but an execution time of minutes would be sufficient. The new rows will come in batch (maybe once a minute, maybe once a day depending on the environment) and absolutely live data isn't essential. My plan was to have lots and lots of (virtual) nodes with a small memory footprint (<1GB) so that the parallelisation of map/reduce can be utilised as much as possible. Haven't quite thought out how to get the data into hdfs, whether to use HBase or a single, ever-growing CSV. I don't quite get how hdfs figures out which bit to give to each map job. I realise the file is chunked into (by default) 64MB files but it can't be as simple as every map process gets the full 64MB file? Do I need to split the file into chunks myself? If so, that is fine. I realise this data is quite easily managed by existing RDBMs but the data will grow very very quickly, and there are other reasons I don't want to go down that route. So, am I barking up the wrong tree with this? Is there a better solution? I have also evaluated mongo-db and couchDB (both excellent for their use-cases). Many thanks, Col
