Hey guys, I have a lot of experience with Lucene and Solr (not much of an emailer though) and was planning on spending the weekend doing a code-binge and contributing something to Hbase so I can put it on my resume. Any suggestions as far as things you're really trying to get out for .20 and could use some help would be appreciated, I also had the following idea for running Solr on Hadoop:
- Initially entirely client-side, with potentially big chunks moved over to the cluster side in a hbase-solr.jar later for efficiency - Client maintains a mapping of schema names to solr schema.xml - On first load of a schema, creates a main table with rowkeys and a bunch of secondary tables for secondary indices, tokenizing as appropriate based on the config - Client accepts update, delete, query and "edit" requests - first 3 are handled just like they are now in solr although update (delete/re-insert all columns for a row) will likely be pretty inefficient on HBase's architecture, hence introduction of "edit" to reduce row bloat in HBASE - queries are automatically handled via multiple queries to hbase, pulling sets of rowid's and reducing to only those that hit, then pulling the actual documents of all hits. this will be pretty inefficient compared to running inside the cluster but trying to limit what I take on in the first pass. I'll probably only get a small subset of the functionality defined in a solr schema.xml over the weekend, but wanted to bounce the idea out there.. is this something the community's interested in? Lots of Solr users out there, if they could seamlessly switch to this and I don't create massive performance problems, seems like it would be useful. Otherwise, what are open bugs that could particularly use attention right now?
