Re: Caching frequently map input files

Shimi K Mon, 11 Feb 2008 06:32:18 -0800

If I understood right, except the front end part and the caching at the
nodes, most of what you wrote Hadoop does. Anyway if I will use Hadoop I
will need a front end that will send sub queries to Hadoop (I already wrote
that part). To do such a thing I will need to implement something which is
very similar to Hadoop map reduce but with faster startup job time. Why does
it takes Hadoop so long to start the job?


On Feb 11, 2008 3:20 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>
>
> You should be looking at HDFS (part of hadoop) plus hbase or code that you
> write yourself.
>
> Hadoop is built in two parts.  One part is the distributed file system
> that
> provides replication and similar functions.  You can access this file
> system
> pretty easily from Java.  Your requirements are pretty easy to meet using
> HDFS.  HDFS provides the capabilities to write files (once without
> modification after creation) and read them.  You can also tell where
> pieces
> of a file are.  Your files are small enough so that they will fit in a
> single HDFS block so their location will be simple.  HDFS is highly
> reliable
> and can provide very high read bandwidth, especially for processes that
> run
> on the same nodes as the storage nodes.
>
> The second part is the map-reduce framework.  This handles the invocation
> of
> map-reduce jobs which are batch oriented and take several seconds to
> start.
> Hadoop's map reduce framework is not suitable for your application because
> of your real-time query requirement.
>
> Hbase is a database layer similar to Google's bigtables.  It might be
> helpful to you, but it sounds like your queries are more complex than what
> hbase provides.
>
> My guess is that to satisfy your needs, you need to have a farm of query
> servers that read files from HDFS, but which run essentially forever.  You
> will need a front end server that accepts queries and forwards sub-queries
> on to machines which read data from HDFS at startup time and then just
> handle sub-queries.  Your front end machine should handle sending
> sub-queries to nodes that have the data in question in local storage if
> possible and should handle sending sub-queries to a different machine if
> an
> answer doesn't come back when expected.  At 10 per second, you probably
> don't need anything fancy for your front end machine, but a second machine
> for fail-over would be good practice.
>
> Look at the Hadoop source code for examples of writing simple servers
> using
> Jetty (which is pretty sweet for embedding HTTP accessible services).  All
> of the service components in Hadoop use that style.
>
> More comments are in-line below.
>
>
>
>
> On 2/11/08 4:36 AM, "Shimi K" <[EMAIL PROTECTED]> wrote:
>
> > Here is the information based on your questions and some more
> information
> > about what I am trying to do.
> >
> >> A) first and most importantly, is your program batch oriented, or is it
> >> supposed to respond quickly to random requests?  If it is batch
> oriented,
> >> then it is likely that map-reduce will help.  If it is intended to
> respond
> >> to random requests, then it is unlikely to be a match.
> >
> >
> > My program process ad hoc real time queries.
>
> This means you should not use map-reduce.
>
> >> B) would do you intend to have a very large number of small files
> (large
> >> is  1 million files, very large is greater than 10 million) or are your
> files
> >> very small (small is less than 10MB or so, very small is less than
> 1MB).
> >
> >
> > Most of my files will be small files (around 10 MB). The number of files
> > will be around 200 k. Is HDFS not suitable for this amount of files? Is
> > there a better alternative?
>
> HDFS will work just fine for this.  There are alternative systems, but it
> is
> unlikely that any will work better for you than HDFS.
>
> > C) how long a program startup time can you allow?  Hadoop's map-reduce
> is
> >> oriented mostly around the batch processing of very large data sets
> which
> >> means that a fairly lengthy startup time is acceptable and even
> desirable
> >> if
> >> it allows faster overall throughput.  If you can't stand a startup time
> of
> >> 10 seconds or so, then you need a non-map-reduce design.
> >
> >
> > If by program startup you mean cluster startup then I don't mind if it
> will
> > take seconds or even minutes. If you mean job startup then every
> millisecond
> > is important. The system response time needs to be in milliseconds.
>
> I meant jobs startup.  You need to have query processors that run forever.
>
> > D) if you need real-time queries, can you use hbase?
> >>
> > What exactly do you mean? Do you mean Hadoop map reduce with Hbase
> instead
> > of Hadoop map reduce with HDFS?
>
> I mean hbase with storage in HDFS and no map-reduce except possibly to
> load
> the hbase tables.
>
> >> If you have such a program, then keeping track of what data is already
> in
> >> memory is pretty easy and
> >> using HDFS as a file store could be really good for your application.
> >
> > How can I keep track of what data is already in memory?
>
> If you have many servers running on HDFS nodes, then these servers can
> read
> files that relate to queries that they get and keep those files in memory
> until they get queries for different files and can't keep them any more.
>  If
> the front-end server passes requests to servers that probably have the
> files
> on local disk, then things should move very fast.
>
> > I have a code that does a complex search on binary files (non text
> files). I
> > need to build a system around this code that will meet the following
> > requirements:
> > * The system will get requests for search against all the files in the
> > system.
> > * The system will have 200 k of files. File size is around 10 Mb.
> > * The response time should be in milliseconds.
> > * The system needs to be able to response to multiple requests at the
> same
> > time. (I am not sure how much but I assume it will be around 10 per
> second).
>
> This should be easy to implement with the front-end/sub-query architecture
> that I described.  It will be hard (impossible) to implement using
> map-reduce.
>
> > I figured that I can use Hadoop for this purpose. I know that Hadoop was
> > built for batch processing but I thought I can use it anyway for my
> purpose.
>
> This was a good idea.
>
> > My plan was to do the search in the map part. I can split a file in the
> > cluster and then reduce the search results that came from the same file.
>
> This is a fine idea in principle, but the startup time for map reduce will
> kill you.
>
> > About the tmpfs suggestion that was mentioned here, Is it possible to to
> > upload the HDFS files from each node HDFS to the node ramfs and then to
> do
> > the map reduce on the ramfs?
>
> This is a non-starter because of the map-reduce startup time.  If you have
> custom worker nodes on each HDFS storage node, then this is irrelevant.
>
> > I hardly need to update the file system but
> > what will happen if I will want to delete, update or add file to the
> system?
>
> Make sure that your front end knows about the updates and then have it ask
> the workers to access the new versions of the files.
>
> HDFS is a write-once file system.  You can delete an old file, but you
> can't
> update.  You just write a new copy and delete the old one when you are
> sure
> that nobody is still using it.  That leads to nice update semantics.
>
> > Does any of you think that Hadoop is not the right choice for this kind
> of
> > job? Can you suggest something better?
>
> Hadoop is great for this.  Just not all of it!
>
>

Re: Caching frequently map input files

Reply via email to