Solr on Hbase

Jay Booth Fri, 15 May 2009 12:52:15 -0700

Hey guys, I have a lot of experience with Lucene and Solr (not much of an
emailer though) and was planning on spending the weekend doing a code-binge
and contributing something to Hbase so I can put it on my resume.  Any
suggestions as far as things you're really trying to get out for .20 and
could use some help would be appreciated, I also had the following idea for
running Solr on Hadoop:


- Initially entirely client-side, with potentially big chunks moved over to
the cluster side in a hbase-solr.jar later for efficiency
-  Client maintains a mapping of schema names to solr schema.xml
-    On first load of a schema, creates a main table with rowkeys and a
bunch of secondary tables for secondary indices, tokenizing as appropriate
based on the config
-  Client accepts update, delete, query and "edit" requests
  -  first 3 are handled just like they are now in solr although update
(delete/re-insert all columns for a row) will likely be pretty inefficient
on HBase's architecture, hence introduction of "edit" to reduce row bloat in
HBASE
  -  queries are automatically handled via multiple queries to hbase,
pulling sets of rowid's and reducing to only those that hit, then pulling
the actual documents of all hits.  this will be pretty inefficient compared
to running inside the cluster but trying to limit what I take on in the
first pass.

I'll probably only get a small subset of the functionality defined in a solr
schema.xml over the weekend, but wanted to bounce the idea out there..  is
this something the community's interested in?  Lots of Solr users out there,
if they could seamlessly switch to this and I don't create massive
performance problems, seems like it would be useful.

Otherwise, what are open bugs that could particularly use attention right
now?

Solr on Hbase

Reply via email to