Hi all,

I am experimenting with a number of different tools to find the best fit for
my current problem.  To simplify, I have a table with 12 columns (small
numbers and booleans) that gets 1M rows every 3 months.  We need to keep the
data for a long time (even if we archive every year that is still at least
4M rows).

The problem is that a user wants to generate reports off this table.  To
phrase it in terms of mapreduce (map-reduce or map/reduce or mapreduce? :)
 The reduce is usually the same (a simple aggregation) but the map phase
will change with each query.  This isn't a high concurrency requirement - it
is likely to be one or two users doing it once a month.

I realise the hadoop and mapreduce architecture isn't designed for real-time
analytics, but an execution time of minutes would be sufficient.

The new rows will come in batch (maybe once a minute, maybe once a day
depending on the environment) and absolutely live data isn't essential.

My plan was to have lots and lots of (virtual) nodes with a small memory
footprint (<1GB) so that the parallelisation of map/reduce can be utilised
as much as possible.

Haven't quite thought out how to get the data into hdfs, whether to use
HBase or a single, ever-growing CSV.  I don't quite get how hdfs figures out
which bit to give to each map job.  I realise the file is chunked into (by
default) 64MB files but it can't be as simple as every map process gets the
full 64MB file?  Do I need to split the file into chunks myself?  If so,
that is fine.

I realise this data is quite easily managed by existing RDBMs but the data
will grow very very quickly, and there are other reasons I don't want to go
down that route.

So, am I barking up the wrong tree with this?  Is there a better solution?
 I have also evaluated mongo-db and couchDB (both excellent for their
use-cases).

Many thanks,

Col

Reply via email to