Andy Liu wrote:
I'm exploring the possibility of using the Hadoop records framework to
store
these document records on disk. Here are my questions:
1. Is this a good application of the Hadoop records framework, keeping in
mind that my goals are speed and scalability? I'm assuming the answer is
yes, especially considering Nutch uses the same approach
For read-only access, performance should be decent. However Hadoop's
file structures do not permit incremental updates. Rather they are
primarily designed for batch operations, like MapReduce outputs. If you
need to incrementally update your data, then you might look at something
like BDB, a relational DB, or perhaps experiment with HBase. (HBase is
designed to be a much more scalable, incrementally updateable DB than
BDB or relational DBs, but its implementation is not yet complete.)
Doug