Re: realtime hadoop

Konstantin Shvachko Wed, 25 Jun 2008 10:35:23 -0700


Daniel wrote:

Also HDFS might be critical since to access your data you need to close

the file

Not anymore. Since 0.16 files are readable while being written to.


Does this mean i can open some file as map input and the reduce output ? So
i can update the files instead of creating new ones.


No files are still write-once in hdfs, you cannot modify a file after it is 
closed.
But if it is not closed you can still write more data into it, and other 
clients will
be able to read this new data.

Also if i want to do query in the records, should i rather use Hbase instead
of HDFS? - say if we have large size of data stored as (key, value).


HDFS has file system api, there is no notion of a record in it, just files and 
bytes.
Depending on how you define a record you can use different systems including 
HBase and Pig.
These two work well for table-like data collections.
Or you can write your own MapReduce job to do processing of a big key-value 
dataset.

Regards,
--Konstantin

Thanks.

it as fast as possible. I need to be able to maintain some guaranteed
max. processing time, for example under 3 minutes.

It looks like you do not need very strict guarantees.
I think you can use hdfs as a data-storage.
Don't know what kind of data-processing you do, but I agree with Stefan
that map-reduce is designed for batch tasks rather than for real-time
processing.




Stefan Groschupf wrote:

Hadoop might be the wrong technology for you.
Map Reduce is a batch processing mechanism. Also HDFS might be critical
since to access your data you need to close the file - means you might have
many small file, a situation where hdfs is not very strong (namespace is
hold in memory).
Hbase might be an interesting tool for you, also zookeeper if you want to
do something home grown...



On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote:

 Hi!

I am considering using Hadoop for (almost) realime data processing. I
have data coming every second and I would like to use hadoop cluster
to process
it as fast as possible. I need to be able to maintain some guaranteed
max. processing time, for example under 3 minutes.

Does anybody have experience with using Hadoop in such manner? I will
appreciate if you can share your experience or give me pointers
to some articles or pages on the subject.

Vadim

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com

Re: realtime hadoop

Reply via email to