Hi, I'm working at Mendeley.com and we are just looking into the ways in which to store a large collection of document metadata that we need to process and update our live site with. HBase is one of the systems we are looking at as it integrates with Hadoop which we will definitely be using in the near future.
One question I have for this is if we start using the standalone operation on a single server initially whilst we setup and test ours systems, is possible to migrate from this to the distributed system without having to rebuild the data store? A second question is more tying to understand the way in which to use HBase. If we have documents that have many authors, which themselves have a varying amount of metadata, how is a good approach to store this? From reading about HBase I see it could be done using a column family on the document for say author_name:, author_email: but if there are an unknown number of author properties this probably isn't the best way.. Would using a separate table be better to store the author data in? My last question is using Map/Reduce on top of HBase, is the Map/Reduce code still location aware for where the data is stored in HDFS or does using Map/Reduce create a larger I/O bottleneck than using HDFS normally? If we choose to use HBase I hope to start being more active in the community here soon! Thanks, Dan Harvey
