Usman,

HDFS is a distributed grid file system, as opposed to a live serving database. A simple analogy is the difference between the linux file system then mysql running on top of linux to enable it to be a database. Furthermore, HDFS is optimized for throughput as opposed to latency, hence it wont be very responsive for live serving.

That said, you should take a look at HBASE which is a key-value database written on top of HDFS, it is part of the official Apache Hadoop project. HBASE enables both lower request latency for live serving, but also provides you with basic transactional semantics (on a per key basis) so that you can do updates/inserts/deletes. I think StumbleUpon is now serving their live traffic direct from HBASE, you can take a look at their preso from the nosql event last month at:

http://blog.oskarsson.nu/2009/06/nosql-debrief.html

From that link you can also download presentations for a number of other scalable low-latency key-value stores.

Finally, you should take a look at this nice blog post from linkedin, it shows a good example of how to use hadoop raw processing muscle to prepare that data then parallel load the results into a live serving system:

http://project-voldemort.com/blog/2009/06/building-a-1-tb-data-cycle-at-linkedin-with-hadoop-and-project-voldemort/

Cheers,

-- amr

Usman Waheed wrote:
Hi All,

Is there a recommended way on how to extract data from HDFS and perform some computations on the data in order to display the results on a webpage. One thing that comes to my mind is to write simple CGI perl scripts that extract the data from HDFS and perform computational work on the data before sending the results to the browser.

or

Maybe run some scripts in the background that summarize the data in HDFS and insert into a DB table. Can then write a web GUI that interacts with the DB table and displays the desired stats with graphs using ploticus. Our data set in HDFS will eventually grow so speed will be important.

Thanks,
Usman


Reply via email to