Hello, I'm working on a hadoop project where my data is comprised of many HTML files (websites). One aspect of the project involves traditional MapReduce analysis on the data set, but I would also like to use hadoop as a sort of "cache server," i.e, having the ability to retrieve the HTML for a website that I have already been to.
My question is this: what is the best way to interact with HDFS to make simple existance queries and retrieve specific files for reading. Ideally I would like to do this at an application level, (most likely written in Ruby). So far I have explored the option of using one of the FUSE packages to mount it in the userspace, but, I ran into quite a bit of difficulty installing either of the two popular packages. My second option seems to be Hive, but I haven't been able to find any bindings for Ruby or Python, etc. Any suggestions or advice would be greatly appreciated! Cheers, Mike
