Interesting question.
Would be grand if you didn't have to duplicate the hbase data in the
lucene index, just store the hbase locations -- or, just store small
stuff in the lucene index and leave big-stuff back in hbase -- but
perhaps the double hop of lucene first and then to hbase will not
perform well enough? 0.19.0 hbase will be better than 0.18.0 if you can
wait a week or so for the release candiate to test.
Let us know how it goes Tim,
St.Ack
tim robertson wrote:
Hi All,
I have HBase running now, building Lucene indexes on Hadoop
successfully and then I will get Katta running for distributing my
indexes.
I have around 15 search fields indexed that I wish to extract and
return those 15 to the user in the result set - my result sets will be
up to millions of records...
Should I:
a) have the values stored in the Lucene index which will make it
slower to search but returns the results immediately in pages without
hitting HBase
or
b) Not store the data in the index but page over the Lucene index
and do millions of "get by ROWKEY" on HBase
Obviously this is not happening synchronously while the user waits,
but looking forward to hear if people have done similar scenarios and
what worked out nicely...
Lucene degrades in performance at large page numbers (100th page of
1000 results) right?
Thanks for any insights,
Tim