Usman,
HDFS is a distributed grid file system, as opposed to a live serving
database. A simple analogy is the difference between the linux file
system then mysql running on top of linux to enable it to be a database.
Furthermore, HDFS is optimized for throughput as opposed to latency,
hence it wont be very responsive for live serving.
That said, you should take a look at HBASE which is a key-value
database written on top of HDFS, it is part of the official Apache
Hadoop project. HBASE enables both lower request latency for live
serving, but also provides you with basic transactional semantics (on a
per key basis) so that you can do updates/inserts/deletes. I think
StumbleUpon is now serving their live traffic direct from HBASE, you can
take a look at their preso from the nosql event last month at:
http://blog.oskarsson.nu/2009/06/nosql-debrief.html
From that link you can also download presentations for a number of
other scalable low-latency key-value stores.
Finally, you should take a look at this nice blog post from linkedin,
it shows a good example of how to use hadoop raw processing muscle to
prepare that data then parallel load the results into a live serving system:
http://project-voldemort.com/blog/2009/06/building-a-1-tb-data-cycle-at-linkedin-with-hadoop-and-project-voldemort/
Cheers,
-- amr
Usman Waheed wrote:
Hi All,
Is there a recommended way on how to extract data from HDFS and
perform some computations on the data in order to display the results
on a webpage. One thing that comes to my mind is to write simple CGI
perl scripts that extract the data from HDFS and perform computational
work on the data before sending the results to the browser.
or
Maybe run some scripts in the background that summarize the data in
HDFS and insert into a DB table. Can then write a web GUI that
interacts with the DB table and displays the desired stats with graphs
using ploticus. Our data set in HDFS will eventually grow so speed
will be important.
Thanks,
Usman