Re: Can I use hadoop/map-reduce for this usecase?

Eric Sammer Fri, 16 Apr 2010 14:55:20 -0700

Colin:

This sounds like a reasonable case for Hive or Pig on Hadoop (assuming
you're right about your data growth being "big"). Hive will let you
query a table using something that resembles SQL, similar to an RDBMS
but bakes down to map reduce. Tables are made up by one or more files
in a directory. Files can be text or something like Hadoop's
SequenceFile format.

Generally, you want to build a system that batches up updates into a
file and then move that file into a directory so it can be queried by
Hive (again, or Pig). Given that both tools will operate on the entire
directory, the data in the file will be seen as one large dataset
(very common in Hadoop MR jobs).

I'll answer some other questions inline.

On Fri, Apr 16, 2010 at 2:26 PM, Colin Yates <[email protected]> wrote:

> The problem is that a user wants to generate reports off this table.  To
> phrase it in terms of mapreduce (map-reduce or map/reduce or mapreduce? :)
>  The reduce is usually the same (a simple aggregation) but the map phase
> will change with each query.  This isn't a high concurrency requirement - it
> is likely to be one or two users doing it once a month.

This is more than fine and can be handled by Hadoop.

> I realise the hadoop and mapreduce architecture isn't designed for real-time
> analytics, but an execution time of minutes would be sufficient.

It will depend on the dataset size, but with this amount of data,
minutes should be easy enough to achieve.

> The new rows will come in batch (maybe once a minute, maybe once a day
> depending on the environment) and absolutely live data isn't essential.

As I mentioned above, batch updates into larger files. Rather than 1 a
minute, go for something like >= a few hundred megs or even gigs, or
once a day, whichever comes first. Something like that gets a
reasonable batch size.

> My plan was to have lots and lots of (virtual) nodes with a small memory
> footprint (<1GB) so that the parallelisation of map/reduce can be utilised
> as much as possible.

That's going to be hard. Hadoop is pure java (mostly) and can get
memory hungry. Having a machine with < 1GB is going to be tough to
pull off. Expect to "spend" about 4GB per node in a cluster with a few
cores. See 
http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/

Virtualization hurts Hadoop in almost all cases.

> Haven't quite thought out how to get the data into hdfs, whether to use
> HBase or a single, ever-growing CSV.

HBase is generally more useful when you need random access to the
individual records. It doesn't sound like you want / need that. Go
with a directory of X terminated field, \n terminated record files and
make your life easier.

> I don't quite get how hdfs figures out
> which bit to give to each map job.  I realise the file is chunked into (by
> default) 64MB files but it can't be as simple as every map process gets the
> full 64MB file?  Do I need to split the file into chunks myself?  If so,
> that is fine.

The short answer is that files are split into blocks purely on size
and spread around the cluster. When an MR job processes a bunch of
files, Hadoop calculates the number of splits (roughly) based on the
number of HDFS blocks that make up the files. Records of the file are
reconstructed from these splits. Records that span blocks are
reconstructed for you, the way you'd expect. If you use one of the
existing formats (like TextInputFormat for text or
SequenceFileInputFormat for SequenceFiles) this splitting of data is
transparent to you and you'll naturally get what you want - parallel
processing of N files which are split into M chunks with each line
processed once and only once.

> I realise this data is quite easily managed by existing RDBMs but the data
> will grow very very quickly, and there are other reasons I don't want to go
> down that route.

Honestly, 4M rows per year is tiny. The reason to use something like
Hadoop is if the data is tough to squeeze into an RDBMS or - and it
sounds like this might be your case - your queries are generally "full
table scan" type operations over a moderate amount of data that can be
easily performed in parallel or drown the db. Again, I think Hive is
what you're after (simple summation, variable filtering logic).

> So, am I barking up the wrong tree with this?  Is there a better solution?
>  I have also evaluated mongo-db and couchDB (both excellent for their
> use-cases).

Document oriented databases are another beast. Definitely useful but I
don't know if you need those access patterns here.

-- 
Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com

Re: Can I use hadoop/map-reduce for this usecase?

Reply via email to