Re: Standalone to distributed migration

Dan Harvey Wed, 14 Oct 2009 02:44:43 -0700

2009/10/12 stack <[email protected]>
>
> (Mendeley looks great).
>
> See below:
>
> On Mon, Oct 12, 2009 at 1:46 AM, Dan Harvey <[email protected]> wrote:
>
> >
> > One question I have for this is if we start using the standalone operation
> > on a single server initially whilst we setup and test ours systems, is
> > possible to migrate from this to the distributed system without having to
> > rebuild the data store?
> >
>
> Should just be a matter of copy from local disk up to hdfs.  It wouldn't be
> hard to confirm this for yourself.
>
> Running a single instance of hbase in anything but a test setup is not
> really recommended.  Or rather, we've not spent any time on making sure this
> sort of deploy is performant.


If we think HBase is a possibility then I will look into testing this to see
how well it works, but we've chosen to stick with mysql for a little
bit longer so
we will probably just go to a cluster straight away when we start
using something
else so this may not be a problem.

>
>
> > A second question is more tying to understand the way in which to use
> > HBase.
> > If we have documents that have many authors, which themselves have a
> > varying amount of metadata, how is a good approach to store this? From
> > reading about HBase I see it could be done using a column family on the
> > document for say author_name:, author_email: but if there are an unknown
> > number of author properties this probably isn't the best way.. Would using
> > a separate table be better to store the author data in?
> >
>
> How do you think you will be accessing the data?  Will you be doing lookups
> on the attribute or by author or both?

In the near future we will only be needing to do lookups by a document id to get
all the meta data for a document. We will also be running over all the
data offline
to process and work on it. Eventually we may need to look up by the author
as well but I think HBase may be used to just store all the data then
we can build
indexes for unique authors we can derive from this data. I guess that would be a
good way to go about it?

>
>
> >
> > My last question is using Map/Reduce on top of HBase, is the Map/Reduce
> > code
> > still location aware for where the data is stored in HDFS or does using
> > Map/Reduce create a larger I/O bottleneck than using HDFS normally?
> >
>
>
> The TableInputFormat in hbase passes the mapreduce framework the address of
> running regionservers.  In our experience, the mapreduce framework will near
> always run tasks on the tasktracker that is running on the same machine as
> the regionserver hosting the task source region.
>
>
> >
> > If we choose to use HBase I hope to start being more active in the
> > community
> > here soon!
> >
>
> Let us know if there is anything we can do to help you with your evaluation.

One other question about performance, we are looking to use the system
to store all
documents that users upload or enter so we will also need to continually update
document metadata when a user edits it. The edits will be a lot less
frequent than new
documents added. How well does HBase perform with updating cell
values? and how well
does adding rows and updating them scale as the cluster grows?

>
> St.Ack
>
>
> >
> > Thanks,
> > Dan Harvey
> >

Re: Standalone to distributed migration

Reply via email to