Thanks Jonathan, Your feedback really is appreciated - I will just keep pestering the list and try and get to more specific problems as I come across them!
Cheers, Tim On Wed, Dec 17, 2008 at 5:46 PM, Jonathan Gray <[email protected]> wrote: > Tim, > > Your scenario is certainly a familiar one, distribution of your mysql > solution will be cumbersome at best and performance will degrade quickly. > > Since ranking is not so important, then you really won't have any issues > with sharding your index in Solr, or Katta and Lucene. Distributed Solr > queries are really just performing the query on each index and merging the > results; this should scale fairly well for you it seems. And yes, relevancy > ranking I mean search ranking (relevancy of the results to your original > query). > > With a wide range of potential results you are dealing with (most you say > are small, but some could be 10M+), you'll need to either design a > principled approach that will give you the best average-case, or start > thinking about dealing with them differently. It certainly does seem clear > that HBase is a great persistent storage solution for you regardless of how > you handle querying. > > Tokyo Cabinet is a very simple piece of software, C implementations of > disk-based key/val Hash Tables or B-Trees. You can get constant-time access > to a very large set of data with surprisingly efficient use of disk space > and memory (a small index is kept in memory so fetches are single-seek > generally). > > My suggestion to you would be to start (or continue) experimenting. For > one, again, it seems clear that you should insert/persist your data in HBase > and query from there to build your indexes. After that, you should look at > the tradeoffs between keeping the raw data in Lucene versus keeping it > external. Rather than focusing on the time to fetch the records when > keeping external (lots can be done to speed that up), let's see how it > affects the size of your index and time to execute the queries. I still > think the hardest piece to scale is the sharded index because performance > can degrade quickly as it grows, but I'm unsure what affect storing the raw > data will have on index size and query time. > > As far as getting involved, this is about the extent to which I'm able to. > Keep sending your findings and questions to the list and I or others will be > more than happy to respond. If you need someone to review your code, scope > out your cluster, answer a quick question, etc.. you might also jump into > the IRC channel #hbase on freenode where many of us hang out. > > Good luck with everything. Looking forward to seeing your results. > > JG > > > >> -----Original Message----- >> From: tim robertson [mailto:[email protected]] >> Sent: Wednesday, December 17, 2008 8:11 AM >> To: [email protected] >> Subject: Re: Lucene from HBase - raw values in Lucene index or not? >> >> Hey Jonathan >> >> Firstly, thank you for your considered response and offer for help... >> this is much appreciated. >> >> History: >> So, we have mysql solution now, and a big fat server (64G ram), with >> 150M or so records in the biggest table. It acts as an index of >> several thousand resources that are harvested from around the world >> (all biodiversity data and all this is open source / 'open access with >> citation'). Because it is an index and the nature of the data (150M >> specimens, but tied to hierarchical trees - "taxonomy") it does not >> partition nicely (geospatially, taxonomically, temporally) since all >> partitions would need hit for one or other queries. As we have grown >> we have had to limit the fields we offer as search fields. We expect >> and aim to grow to Billions in the coming months. This is probably a >> familiar scenario huh? >> I am working alone and in my 'spare time' to try and produce a >> prototype HDFS based version as I described and learning as fast as I >> can - I must admit that Hadoop, Hbase, Lucene are all new to me in the >> recent months. I blog things (I am not an eloquent blogger as it is >> normally late on Sunday night) here: http://biodivertido.blogspot.com/ >> >> To specifically answer your questions: >> - ranking is largely irrelevant for us, since all records are >> considered equal >> - I am not entirely sure what "relevancy ranking" and assume it is >> the search ranking? >> >> - regarding the percentage - well this really depends: >> - most searches will be small results - so index makes sense >> - also, I look to generate maps quickly so might need latitude and >> longitude in indexes >> - we need to offer slicing of data >> - we are an international network, and countries / thematic >> networks want data dumps for their own analysis >> - therefore USA is 50%+ of the data, but a specialist Butterfly >> site is small.... therefore perhaps an index to do a count and then >> pick a slice strategy? >> >> So you have now added Tokyo to the long list of things I need to learn >> - but I will look into it... >> >> My working so far is my spare time and on EC2 and S3 and I am working >> alone - I am not sure how interested you are in getting involved? I >> would greatly appreciate a knowledgeable pair of eyes to sanity check >> things I am doing (e.g. scanning HBase, parsing XML to raw fields, >> then scanning again and parsing to interpreted values etc), but would >> welcome anyone who is interested in contributing. It might take some >> set-up time though as I have all the pieces working in isolation and >> don't keep ec2 running all the time. >> >> Thanks again, I'll continue to post progress >> >> Tim >> >> >> >> >> >> On Wed, Dec 17, 2008 at 4:20 PM, Jonathan Gray <[email protected]> >> wrote: >> > Hey Tim, >> > >> > I have dabbled with sharding of a Solr index. We applied a >> consistent >> > hashing algorithm to our IDs to determine which node to insert to. >> > >> > One downside, not sure if this exists with Katta, is that you don't >> have >> > good relevancy across indexes. For example, distributed querying is >> really >> > just querying each shard. Unfortunately the relevancy ranking is >> only >> > relevant within each individual index, there is no global rank. One >> idea is >> > to shard based on another parameter which might allow you to apply >> > relative-relevancy ;) given any domain-specific information. >> > >> > I'm very interested in your problem, right now our indexes are small >> enough >> > that we will be able to get by with 1 or 2 well-equipped nodes, but >> soon >> > enough we will outgrow that and be looking at sharding across 5-10 >> nodes. >> > Our results are usually "page" size (10-20) so we don't have the same >> issue >> > with how to efficiently fetch them. >> > >> > In these cases where you might be looking for 10M records, what >> percentage >> > of the total dataset is that? 100M, 1B? If you turn to a full scan >> as your >> > solution, you're going to serious limit how fast you can go even with >> good >> > caching and faster IO. But if you're returning a significant number >> of >> > total rows, then this would definitely make sense. >> > >> > If your data is relatively static, you might look at writing a very >> simple >> > disk-based key/val cache like Berkeley DB or my favorite Tokyo >> Cabinet. >> > These can handle high numbers of records, stored on disk, but >> accessible in >> > sub-ms time. I have C and Java code to work with Tokyo and HBase >> together. >> > With such a high number of records, it's probably not feasible to >> keep them >> > in memory, so a solution like this could be your best bet. Also, >> stay tuned >> > to this issue as it would create a situation similar to running a >> disk-based >> > key/val by using Direct IO (preliminary testing shows 10X random-read >> > improvement): https://issues.apache.org/jira/browse/HADOOP-4801 >> > >> > And this is an old issue that will have new life soon: >> > https://issues.apache.org/jira/browse/HBASE-80 >> > >> > Like I said, I have an interest in seeing how to solve this problem, >> so let >> > me know if you have any other questions or if we can help in any way. >> > >> > Jonathan Gray >> > >> >> -----Original Message----- >> >> From: tim robertson [mailto:[email protected]] >> >> Sent: Tuesday, December 16, 2008 11:42 PM >> >> To: [email protected] >> >> Subject: Re: Lucene from HBase - raw values in Lucene index or not? >> >> >> >> Hi, >> >> >> >> Thanks for the help. >> >> >> >> My Lucene indexes are for sure going to be too large for one >> machine, >> >> so I plan to put the indexes on the HDFS, and then let Katta >> >> distribute them around a few machines. Because of Katta's ability >> to >> >> do this, I went for Lucene and not SOLR, which requires me to do all >> >> the sharding myself, if I understand distributed SOLR correctly - I >> >> would much prefer SOLR's primitive handling as right now I convert >> all >> >> dates and Ints manually. If someone has distributed SOLR (really is >> >> too big for one machine since indexes are >50G) I'd love to hear how >> >> they sharded nicely and mange it. >> >> >> >> Regarding performance... well, for "reports" that will return 10M >> >> records, I will be quite happy with minutes as a response time, as >> >> this is typically data download for scientific analysis, and >> therefore >> >> people are happy to wait. The results get put on to Amazon S3 >> GZipped >> >> for download. What worries me is if I have 10-100 reports running >> at >> >> one time, there is an awful lot of single record requests on HBase. >> I >> >> guess I will try and blog the findings. >> >> >> >> I am following HBase, Katta and Hadoop code trunks so will also try >> >> and always use the latest, as this is a research project and not >> >> production right now (production is still mysql based). >> >> >> >> The alternative of course is to always open a scanner and then do a >> >> full table scan for each report... >> >> >> >> Thanks >> >> >> >> Tim >> >> >> >> On Wed, Dec 17, 2008 at 12:22 AM, Jonathan Gray <[email protected]> >> >> wrote: >> >> > If I understand your system (and Lucene) correctly, you obviously >> >> must input >> >> > all queried fields to Lucene. And the indexes will be stored for >> the >> >> > documents. >> >> > >> >> > Your question is about whether to also store the raw fields in >> Lucene >> >> or >> >> > just store indexes in Lucene? >> >> > >> >> > A few things you might consider... >> >> > >> >> > - Scaling Lucene is much more difficult than scaling HBase. >> Storing >> >> indexes >> >> > and raw content is going to grow your Lucene instance fast. >> Scaling >> >> HBase >> >> > is easy and you're going to have constant performance whereas >> Lucene >> >> > performance will degrade significantly as it grows. >> >> > >> >> > - Random access to HBase currently leaves something to be desired. >> >> What >> >> > kind of performance are you looking for with 1M random fetches? >> >> There is >> >> > major work being done for 0.19 and 0.20 that will really help with >> >> > performance as stack mentioned. >> >> > >> >> > - With 1M random reads, you might never get the performance out of >> >> HBase >> >> > that you want, certainly not if you're expecting 1M fetches to be >> >> done in >> >> > "realtime" (~100ms or so). However, depending on your dataset and >> >> access >> >> > patterns, you might be able to get sufficient performance with >> >> caching >> >> > (either block that is currently available, or record caching >> slated >> >> for 0.20 >> >> > but likely with a patch available soon). >> >> > >> >> > We are using Lucene by way of Solr and are not storing the raw >> data >> >> in >> >> > Lucene. We have an external Memcached-like cache so that our raw >> >> content >> >> > fetches are sufficiently quick. My team is currently working on >> >> building >> >> > this cache into HBase. >> >> > >> >> > I'm not sure if the highlighting features in Solr are only part of >> >> Solr or >> >> > also in Lucene, but of course you lose the ability to do those >> things >> >> if you >> >> > don't put the raw content into Lucene. >> >> > >> >> > JG >> >> > >> >> > >> >> > >> >> >> -----Original Message----- >> >> >> From: stack [mailto:[email protected]] >> >> >> Sent: Tuesday, December 16, 2008 2:37 PM >> >> >> To: [email protected] >> >> >> Subject: Re: Lucene from HBase - raw values in Lucene index or >> not? >> >> >> >> >> >> Interesting question. >> >> >> >> >> >> Would be grand if you didn't have to duplicate the hbase data in >> the >> >> >> lucene index, just store the hbase locations -- or, just store >> small >> >> >> stuff in the lucene index and leave big-stuff back in hbase -- >> but >> >> >> perhaps the double hop of lucene first and then to hbase will not >> >> >> perform well enough? 0.19.0 hbase will be better than 0.18.0 if >> you >> >> >> can >> >> >> wait a week or so for the release candiate to test. >> >> >> >> >> >> Let us know how it goes Tim, >> >> >> St.Ack >> >> >> >> >> >> >> >> >> tim robertson wrote: >> >> >> > Hi All, >> >> >> > >> >> >> > I have HBase running now, building Lucene indexes on Hadoop >> >> >> > successfully and then I will get Katta running for distributing >> my >> >> >> > indexes. >> >> >> > >> >> >> > I have around 15 search fields indexed that I wish to extract >> and >> >> >> > return those 15 to the user in the result set - my result sets >> >> will >> >> >> be >> >> >> > up to millions of records... >> >> >> > >> >> >> > Should I: >> >> >> > >> >> >> > a) have the values stored in the Lucene index which will make >> it >> >> >> > slower to search but returns the results immediately in pages >> >> without >> >> >> > hitting HBase >> >> >> > >> >> >> > or >> >> >> > >> >> >> > b) Not store the data in the index but page over the Lucene >> >> index >> >> >> > and do millions of "get by ROWKEY" on HBase >> >> >> > >> >> >> > Obviously this is not happening synchronously while the user >> >> waits, >> >> >> > but looking forward to hear if people have done similar >> scenarios >> >> and >> >> >> > what worked out nicely... >> >> >> > >> >> >> > Lucene degrades in performance at large page numbers (100th >> page >> >> of >> >> >> > 1000 results) right? >> >> >> > >> >> >> > Thanks for any insights, >> >> >> > >> >> >> > Tim >> >> >> > >> >> > >> >> > >> > >> > > >
