Andy Liu wrote:
I'm exploring the possibility of using the Hadoop records framework to store
these document records on disk.  Here are my questions:

1. Is this a good application of the Hadoop records framework, keeping in
mind that my goals are speed and scalability?  I'm assuming the answer is
yes, especially considering Nutch uses the same approach

For read-only access, performance should be decent. However Hadoop's file structures do not permit incremental updates. Rather they are primarily designed for batch operations, like MapReduce outputs. If you need to incrementally update your data, then you might look at something like BDB, a relational DB, or perhaps experiment with HBase. (HBase is designed to be a much more scalable, incrementally updateable DB than BDB or relational DBs, but its implementation is not yet complete.)

Doug

Reply via email to