If I understand your system (and Lucene) correctly, you obviously must input all queried fields to Lucene. And the indexes will be stored for the documents.
Your question is about whether to also store the raw fields in Lucene or just store indexes in Lucene? A few things you might consider... - Scaling Lucene is much more difficult than scaling HBase. Storing indexes and raw content is going to grow your Lucene instance fast. Scaling HBase is easy and you're going to have constant performance whereas Lucene performance will degrade significantly as it grows. - Random access to HBase currently leaves something to be desired. What kind of performance are you looking for with 1M random fetches? There is major work being done for 0.19 and 0.20 that will really help with performance as stack mentioned. - With 1M random reads, you might never get the performance out of HBase that you want, certainly not if you're expecting 1M fetches to be done in "realtime" (~100ms or so). However, depending on your dataset and access patterns, you might be able to get sufficient performance with caching (either block that is currently available, or record caching slated for 0.20 but likely with a patch available soon). We are using Lucene by way of Solr and are not storing the raw data in Lucene. We have an external Memcached-like cache so that our raw content fetches are sufficiently quick. My team is currently working on building this cache into HBase. I'm not sure if the highlighting features in Solr are only part of Solr or also in Lucene, but of course you lose the ability to do those things if you don't put the raw content into Lucene. JG > -----Original Message----- > From: stack [mailto:[email protected]] > Sent: Tuesday, December 16, 2008 2:37 PM > To: [email protected] > Subject: Re: Lucene from HBase - raw values in Lucene index or not? > > Interesting question. > > Would be grand if you didn't have to duplicate the hbase data in the > lucene index, just store the hbase locations -- or, just store small > stuff in the lucene index and leave big-stuff back in hbase -- but > perhaps the double hop of lucene first and then to hbase will not > perform well enough? 0.19.0 hbase will be better than 0.18.0 if you > can > wait a week or so for the release candiate to test. > > Let us know how it goes Tim, > St.Ack > > > tim robertson wrote: > > Hi All, > > > > I have HBase running now, building Lucene indexes on Hadoop > > successfully and then I will get Katta running for distributing my > > indexes. > > > > I have around 15 search fields indexed that I wish to extract and > > return those 15 to the user in the result set - my result sets will > be > > up to millions of records... > > > > Should I: > > > > a) have the values stored in the Lucene index which will make it > > slower to search but returns the results immediately in pages without > > hitting HBase > > > > or > > > > b) Not store the data in the index but page over the Lucene index > > and do millions of "get by ROWKEY" on HBase > > > > Obviously this is not happening synchronously while the user waits, > > but looking forward to hear if people have done similar scenarios and > > what worked out nicely... > > > > Lucene degrades in performance at large page numbers (100th page of > > 1000 results) right? > > > > Thanks for any insights, > > > > Tim > >
