2009/10/12 stack <[email protected]> > > (Mendeley looks great). > > See below: > > On Mon, Oct 12, 2009 at 1:46 AM, Dan Harvey <[email protected]> wrote: > > > > > One question I have for this is if we start using the standalone operation > > on a single server initially whilst we setup and test ours systems, is > > possible to migrate from this to the distributed system without having to > > rebuild the data store? > > > > Should just be a matter of copy from local disk up to hdfs. It wouldn't be > hard to confirm this for yourself. > > Running a single instance of hbase in anything but a test setup is not > really recommended. Or rather, we've not spent any time on making sure this > sort of deploy is performant.
If we think HBase is a possibility then I will look into testing this to see how well it works, but we've chosen to stick with mysql for a little bit longer so we will probably just go to a cluster straight away when we start using something else so this may not be a problem. > > > > A second question is more tying to understand the way in which to use > > HBase. > > If we have documents that have many authors, which themselves have a > > varying amount of metadata, how is a good approach to store this? From > > reading about HBase I see it could be done using a column family on the > > document for say author_name:, author_email: but if there are an unknown > > number of author properties this probably isn't the best way.. Would using > > a separate table be better to store the author data in? > > > > How do you think you will be accessing the data? Will you be doing lookups > on the attribute or by author or both? In the near future we will only be needing to do lookups by a document id to get all the meta data for a document. We will also be running over all the data offline to process and work on it. Eventually we may need to look up by the author as well but I think HBase may be used to just store all the data then we can build indexes for unique authors we can derive from this data. I guess that would be a good way to go about it? > > > > > > My last question is using Map/Reduce on top of HBase, is the Map/Reduce > > code > > still location aware for where the data is stored in HDFS or does using > > Map/Reduce create a larger I/O bottleneck than using HDFS normally? > > > > > The TableInputFormat in hbase passes the mapreduce framework the address of > running regionservers. In our experience, the mapreduce framework will near > always run tasks on the tasktracker that is running on the same machine as > the regionserver hosting the task source region. > > > > > > If we choose to use HBase I hope to start being more active in the > > community > > here soon! > > > > Let us know if there is anything we can do to help you with your evaluation. One other question about performance, we are looking to use the system to store all documents that users upload or enter so we will also need to continually update document metadata when a user edits it. The edits will be a lot less frequent than new documents added. How well does HBase perform with updating cell values? and how well does adding rows and updating them scale as the cluster grows? > > St.Ack > > > > > > Thanks, > > Dan Harvey > >
