Re: random access and hotspots

TuX RaceR Fri, 12 Mar 2010 06:26:28 -0800

Thanks Alex, for you help.
Cheers
TuX

Alex Baranov wrote:

_In case your only use-case is searching_ then you might think about Solr.
"few 10 of millions" documents can be handled by it gracefully. Solr has
also solutions for splitting index and replicating (for load balancing).
Another thing I'd suggest to consider is Lucandra (
http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/
).
There are also some movements around creating Lucene index in HBase (
http://www.search-hadoop.com/m?id=201003091546.40151.tho...@koch.ro).


Check http://wiki.apache.org/hadoop/DistributedLucene page for similar
solutions.

Alex Baranau

http://sematext.com
http://en.wordpress.com/tag/hadoop-ecosystem-digest/

On Thu, Mar 11, 2010 at 11:40 PM, TuX RaceR <tuxrace...@gmail.com> wrote:

Hi Alex

Thanks again for your detailed answer.


Alex Baranov wrote:

 So, 2 to 50 columns in each row. In case the single row size (in bytes) is

 not large then if requests load (number of concurrent clients which perform
described queries) is heavy, then you probably should consider simple data
duplication. I.e. rows with composite keys (which you've put in "Indexes"
table) will contain all data you need.

Yes, that is what I thought too after reading the typical read performance
at:


http://www.search-hadoop.com/m?id=7c962aed1001141446v467a295ctd86f0e8a3ef77...@mail.gmail.com
Having to do 100 random access to generate just one web page would be too
costly.
Even more that the list pages pointing to the documents do not need to show
the whole document content (just the information necessary to generate a
link and maybe a short summary)



 Given the fact the total count of all

rows would be 1-10bil this might work well for you. Of course this would
work if your data isn't changes over time (immutable).

Yes, the data may change over time, also not very often. This is the
biggest headache I have when designing solution using Hbase ;) : updating
indexes.


 Also, have you considered IHBase and related secondary indexes

implementations: the one from transactional contrib, lucene-hbase?

I have already looked well at solr and a bit at elasticsearch.
I was interested in Hbase because of the scaling capabilities.
My current system is beginning to show the limits of Postgresql. I could
move the slow requests based on a SQL index to a Lucene based index, and
then move to Hbase when the site gets bigger. Or I could invest the time now
in a Hbase solution and do not use an intermediary (lucene based) stage.
That's not decided yet.

I do not understand very well Hbase secondary indexes and what are the
advantages with respect to hand made indexes. Does using Hbase secondary
indexes help when the data is mutable? Or is that because indexes are
created in a transaction making data consistency stronger?

Thanks
TuX

Re: random access and hotspots

Reply via email to