Re: Caching frequently map input files

Ted Dunning Mon, 11 Feb 2008 05:20:44 -0800


You should be looking at HDFS (part of hadoop) plus hbase or code that you
write yourself.

Hadoop is built in two parts.  One part is the distributed file system that
provides replication and similar functions.  You can access this file system
pretty easily from Java.  Your requirements are pretty easy to meet using
HDFS.  HDFS provides the capabilities to write files (once without
modification after creation) and read them.  You can also tell where pieces
of a file are.  Your files are small enough so that they will fit in a
single HDFS block so their location will be simple.  HDFS is highly reliable
and can provide very high read bandwidth, especially for processes that run
on the same nodes as the storage nodes.

The second part is the map-reduce framework.  This handles the invocation of
map-reduce jobs which are batch oriented and take several seconds to start.
Hadoop's map reduce framework is not suitable for your application because
of your real-time query requirement.

Hbase is a database layer similar to Google's bigtables.  It might be
helpful to you, but it sounds like your queries are more complex than what
hbase provides.

My guess is that to satisfy your needs, you need to have a farm of query
servers that read files from HDFS, but which run essentially forever.  You
will need a front end server that accepts queries and forwards sub-queries
on to machines which read data from HDFS at startup time and then just
handle sub-queries.  Your front end machine should handle sending
sub-queries to nodes that have the data in question in local storage if
possible and should handle sending sub-queries to a different machine if an
answer doesn't come back when expected.  At 10 per second, you probably
don't need anything fancy for your front end machine, but a second machine
for fail-over would be good practice.

Look at the Hadoop source code for examples of writing simple servers using
Jetty (which is pretty sweet for embedding HTTP accessible services).  All
of the service components in Hadoop use that style.

More comments are in-line below.

On 2/11/08 4:36 AM, "Shimi K" <[EMAIL PROTECTED]> wrote:

> Here is the information based on your questions and some more information
> about what I am trying to do.
> 
>> A) first and most importantly, is your program batch oriented, or is it
>> supposed to respond quickly to random requests?  If it is batch oriented,
>> then it is likely that map-reduce will help.  If it is intended to respond
>> to random requests, then it is unlikely to be a match.
> 
> 
> My program process ad hoc real time queries.

This means you should not use map-reduce.

>> B) would do you intend to have a very large number of small files (large
>> is  1 million files, very large is greater than 10 million) or are your files
>> very small (small is less than 10MB or so, very small is less than 1MB).
> 
> 
> Most of my files will be small files (around 10 MB). The number of files
> will be around 200 k. Is HDFS not suitable for this amount of files? Is
> there a better alternative?

HDFS will work just fine for this.  There are alternative systems, but it is
unlikely that any will work better for you than HDFS.

> C) how long a program startup time can you allow?  Hadoop's map-reduce is
>> oriented mostly around the batch processing of very large data sets which
>> means that a fairly lengthy startup time is acceptable and even desirable
>> if
>> it allows faster overall throughput.  If you can't stand a startup time of
>> 10 seconds or so, then you need a non-map-reduce design.
> 
> 
> If by program startup you mean cluster startup then I don't mind if it will
> take seconds or even minutes. If you mean job startup then every millisecond
> is important. The system response time needs to be in milliseconds.

I meant jobs startup.  You need to have query processors that run forever.

> D) if you need real-time queries, can you use hbase?
>> 
> What exactly do you mean? Do you mean Hadoop map reduce with Hbase instead
> of Hadoop map reduce with HDFS?

I mean hbase with storage in HDFS and no map-reduce except possibly to load
the hbase tables.

>> If you have such a program, then keeping track of what data is already in
>> memory is pretty easy and
>> using HDFS as a file store could be really good for your application.
> 
> How can I keep track of what data is already in memory?

If you have many servers running on HDFS nodes, then these servers can read
files that relate to queries that they get and keep those files in memory
until they get queries for different files and can't keep them any more.  If
the front-end server passes requests to servers that probably have the files
on local disk, then things should move very fast.

> I have a code that does a complex search on binary files (non text files). I
> need to build a system around this code that will meet the following
> requirements:
> * The system will get requests for search against all the files in the
> system.
> * The system will have 200 k of files. File size is around 10 Mb.
> * The response time should be in milliseconds.
> * The system needs to be able to response to multiple requests at the same
> time. (I am not sure how much but I assume it will be around 10 per second).

This should be easy to implement with the front-end/sub-query architecture
that I described.  It will be hard (impossible) to implement using
map-reduce.

> I figured that I can use Hadoop for this purpose. I know that Hadoop was
> built for batch processing but I thought I can use it anyway for my purpose.

This was a good idea.

> My plan was to do the search in the map part. I can split a file in the
> cluster and then reduce the search results that came from the same file.

This is a fine idea in principle, but the startup time for map reduce will
kill you.

> About the tmpfs suggestion that was mentioned here, Is it possible to to
> upload the HDFS files from each node HDFS to the node ramfs and then to do
> the map reduce on the ramfs?

This is a non-starter because of the map-reduce startup time.  If you have
custom worker nodes on each HDFS storage node, then this is irrelevant.

> I hardly need to update the file system but
> what will happen if I will want to delete, update or add file to the system?

Make sure that your front end knows about the updates and then have it ask
the workers to access the new versions of the files.

HDFS is a write-once file system.  You can delete an old file, but you can't
update.  You just write a new copy and delete the old one when you are sure
that nobody is still using it.  That leads to nice update semantics.

> Does any of you think that Hadoop is not the right choice for this kind of
> job? Can you suggest something better?

Hadoop is great for this.  Just not all of it!

Re: Caching frequently map input files

Reply via email to