Re: Caching frequently map input files

Shimi K Mon, 11 Feb 2008 04:36:37 -0800

Here is the information based on your questions and some more information
about what I am trying to do.


> A) first and most importantly, is your program batch oriented, or is it
> supposed to respond quickly to random requests?  If it is batch oriented,
> then it is likely that map-reduce will help.  If it is intended to respond
> to random requests, then it is unlikely to be a match.


My program process ad hoc real time queries.


> B) would do you intend to have a very large number of small files (large
> is
> >1 million files, very large is greater than 10 million) or are your files
> very small (small is less than 10MB or so, very small is less than 1MB).
>  If
> you need a very large number of files, you need to redesign your problem
> or
> look for a different file store.  If you are working with very small files
> that nevertheless fit into memory, then you may need to concatenate files
> together to get larger files.


Most of my files will be small files (around 10 MB). The number of files
will be around 200 k. Is HDFS not suitable for this amount of files? Is
there a better alternative?

C) how long a program startup time can you allow?  Hadoop's map-reduce is
> oriented mostly around the batch processing of very large data sets which
> means that a fairly lengthy startup time is acceptable and even desirable
> if
> it allows faster overall throughput.  If you can't stand a startup time of
> 10 seconds or so, then you need a non-map-reduce design.


If by program startup you mean cluster startup then I don't mind if it will
take seconds or even minutes. If you mean job startup then every millisecond
is important. The system response time needs to be in milliseconds. I can't
waste seconds for search that takes milliseconds.

D) if you need real-time queries, can you use hbase?
>
What exactly do you mean? Do you mean Hadoop map reduce with Hbase instead
of Hadoop map reduce with HDFS?


> Based on what you have said so far, it sounds like you either have batch
> oriented input for relatively small batches of inputs (less than 100,000
> or
> so) or you have a real-time query requirement.
>
I have a real-time query requirements.


> In either case, you may need to have a program that runs semi-permanently.

What do you mean by "program that runs semi-permanently"?

If you have such a program, then keeping track of what data is already in
> memory is pretty easy and
> using HDFS as a file store could be really good for your application.

How can I keep track of what data is already in memory?

One way to get such a long-lived program is to simply use hbase (if you
> can).

I am not sure if I can because my knowledge in Hbase is limited. Do you mean
Hbase instead of HDFS?


I have a code that does a complex search on binary files (non text files). I
need to build a system around this code that will meet the following
requirements:
* The system will get requests for search against all the files in the
system.
* The system will have 200 k of files. File size is around 10 Mb.
* The response time should be in milliseconds.
* The system needs to be able to response to multiple requests at the same
time. (I am not sure how much but I assume it will be around 10 per second).

I figured that I can use Hadoop for this purpose. I know that Hadoop was
built for batch processing but I thought I can use it anyway for my purpose.
My plan was to do the search in the map part. I can split a file in the
cluster and then reduce the search results that came from the same file. Or
since I need to do the search on all the files, I can tell Hadoop not to
split my files and each node will do the search on a different file. If we
take out the replication factor for this discussion, each node will always
do the search on the same files. Sine the files can fit into the machine
RAM, I do not want Hadoop to waste time on uploading those files over and
over again.

About the tmpfs suggestion that was mentioned here, Is it possible to to
upload the HDFS files from each node HDFS to the node ramfs and then to do
the map reduce on the ramfs? I hardly need to update the file system but
what will happen if I will want to delete, update or add file to the system?

Does any of you think that Hadoop is not the right choice for this kind of
job? Can you suggest something better?

On Feb 11, 2008 10:03 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>
> This begins to sound like you are trying to do something that is a very
> nice
> match to hadoop's map-reduce framework at all.  It may be that HDFS would
> be
> very helpful to you, but map-reduce may not be so much help.
>
> Here are a few question about your application that will help define the
> answer whether your application is a good map-reduce candidate:
>
> A) first and most importantly, is your program batch oriented, or is it
> supposed to respond quickly to random requests?  If it is batch oriented,
> then it is likely that map-reduce will help.  If it is intended to respond
> to random requests, then it is unlikely to be a match.
>
> B) would do you intend to have a very large number of small files (large
> is
> >1 million files, very large is greater than 10 million) or are your files
> very small (small is less than 10MB or so, very small is less than 1MB).
>  If
> you need a very large number of files, you need to redesign your problem
> or
> look for a different file store.  If you are working with very small files
> that nevertheless fit into memory, then you may need to concatenate files
> together to get larger files.
>
> C) how long a program startup time can you allow?  Hadoop's map-reduce is
> oriented mostly around the batch processing of very large data sets which
> means that a fairly lengthy startup time is acceptable and even desirable
> if
> it allows faster overall throughput.  If you can't stand a startup time of
> 10 seconds or so, then you need a non-map-reduce design.
>
> D) if you need real-time queries, can you use hbase?
>
> Based on what you have said so far, it sounds like you either have batch
> oriented input for relatively small batches of inputs (less than 100,000
> or
> so) or you have a real-time query requirement.  In either case, you may
> need
> to have a program that runs semi-permanently. If you have such a program,
> then keeping track of what data is already in memory is pretty easy and
> using HDFS as a file store could be really good for your application.  One
> way to get such a long-lived program is to simply use hbase (if you can).
> If that doesn't work for you, you might try using the map-file structure
> or
> lucene to implement your own long-running distributed search system.
>
> If you can be more specific about what you are trying to do, you are
> likely
> to get better answers.
>
> On 2/10/08 10:05 PM, "Shimi K" <[EMAIL PROTECTED]> wrote:
>
> > I choose Hadoop more for the distributed calculation then the support
> for
> > huge files and my files do fit into memory.
> > I have a lot of small files and my system needs to search for something
> in
> > those files very fast. I figured I can distribute the files on a Hadoop
> > cluster and then uses the distributed calculation to do the search in
> > parallel on many files as possible. This way I would be able to return a
> > result faster then if I would have used one machine.
> >
> > Is there a way to tell which files are in memory?
> >
> >
> > On Feb 10, 2008 10:33 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> >
> >>
> >> But if your files DO fit into memory then the datanodes that have
> copies
> >> of
> >> the blocks of your file will probably still have them in memory and
> since
> >> maps are typically data local, you will benefit as much as possible.
> >>
> >>
> >> On 2/10/08 11:17 AM, "Arun C Murthy" <[EMAIL PROTECTED]> wrote:
> >>
> >>>> Is Hadoop cache frequently/LRU/MRU map input files? Or does it
> >>>> upload files
> >>>> from the disk each time a file is needed no matter if it was the
> >>>> same file
> >>>> that was required by the last job on the same node?
> >>>>
> >>>
> >>> There is no concept of caching input files across jobs.
> >>>
> >>> Hadoop is geared towards dealing with _huge_ amounts of data which
> >>> don't fit into memory anyway... and hence doing it across jobs is
> moot.
> >>
> >>
>
>

Re: Caching frequently map input files

Reply via email to