These are all good questions, let me see if I can answer them.

Blur can index data both through map reduce and through directing on the
blur shard servers.  So you are not limited to only map reduce for index
updates.

Blur contains a file system cache for the performance problems related to
random access looks into HDFS.  You can control how much ram the file
system cache uses on each server.  It is a LRU cache so the "hot" or most
commonly accessed portions of the index files are kept in memory.

At this point you can define the number of shards (indexes) per table in
Blur.  The number of shards does not have to be the same as the number of
servers you have.  Blur will evenly lay the shards across all the online
servers.  If a server goes offline, Blur will move the serving of the
shards (indexes) that were on the server that went down to other online
servers automatically.  In the future, splits or additional shards may be
allowed.

In practice I usually run 1 to 1.5 times the number of shards per table as
I have cpu cores across all the machines.  So if you have 8 servers each
with 8 cores, I would try to keep the number of shards somewhere between 64
and 96 shards.  As far as cpu load, that will vary greatly on the data you
have the number of queries you are trying to execute concurrently, and the
types of queries you are trying to execute.

In my experience RAM is usually a more limiting factor than CPU, meaning
that if you don't have enough memory to keep the parts of the index that
are in heavy use in-memory then there will be a lot of HDFS accesses which
will be slow.  But if you don't have enough memory to keep the index hot,
then it won't matter if you are running Solr, Lucene or Blur, not enough
RAM will make all of these performance badly.

Aaron


On Sat, Feb 2, 2013 at 8:51 PM, Li Li <[email protected]> wrote:

> yes, that is my environment. as i know, blur use map reduce to index
> document and save it to hdfs. but for searching, it just use hdfs as a
> lucene directory. because hdfs is not suitable for random access, blur use
> some tricks to tackle this problem. so as far as l know, it only using hdfs
> when doing searching, does that means i need some machines to do searching
> which is cpu heavy tasks? if i am right, lucene is not easy to scale to
> large data as solr, it can not split data to shards. how do blur deal with
> this problem?
> 在 2013-2-2 晚上11:46,"Aaron McCurry" <[email protected]>写道:
>
> > If I understand your setup correctly, you have a Hadoop Cluster running
> > MapReduce and HDFS.  You have permissions to read and write to HDFS
> however
> > you cannot add any new software to the machine running Hadoop.  You also
> > have some other machines where you can manage the software and they have
> > write/read access to HDFS.  So if I'm understanding your question and
> setup
> > correctly then yes you can install Blur on machines that do not run
> Hadoop
> > (HDFS) locally.  Blur only needs access to HDS the service, however
> > I believe that you will get better performance by running a separate HDFS
> > instance just for Blur, but it is not required.
> >
> > Please let us know if you have any issues or questions.  Thanks!
> >
> > Aaron
> >
> >
> > On Fri, Feb 1, 2013 at 6:08 AM, Tim Williams <[email protected]>
> wrote:
> >
> > > On Fri, Feb 1, 2013 at 4:14 AM, Li Li <[email protected]> wrote:
> > > > hi all
> > > >    I want to use hadoop and it's hdfs to provide searching
> > functionality.
> > > >    I can use hadoop to run map reduce jobs and store/retrive data in
> > > hdfs.
> > > > but I can't the permission to manage hadoop cluster.
> > > >    I have another a few machines which can communicate with hadoop
> > > > cluster(my own machines can use hdfs and also hadoop map-reduce job
> can
> > > get
> > > > mysql database data resided in my own machines)
> > > >    can I setting up blur for searching? thanks.
> > >
> > > Hi Li,
> > > If you can run jobs and permissions to write to hdfs, you should be
> > > fine - if you encounter problems, let us know?
> > >
> > > Thanks
> > > --tim
> > >
> >
>

Reply via email to