Re: Index size and performance degradation

Itamar Syn-Hershko Sun, 12 Jun 2011 13:48:30 -0700

Thanks for your detailed answer. We'll have to tackle this and see whatsmore important to us then. I'd definitely love to hear Zoie has overcameall that...

Any pointers to Michael Busch's approach? I take this has something todo with the core itself or index format, probably using the Flex version?



Itamar.


On 12/06/2011 23:12, Michael McCandless wrote:

> From what I understand of Zoie (and it's been some time since I last
looked... so this could be wrong now), the biggest difference vs NRT
is that Zoie aims for "immediate consistency", ie index changes are
always made visible to the very next query, vs NRT which is
"controlled consistency", a blend between immediate and eventual
consistency where your app decides when the changes must become
visible.

But in exchange for that, Zoie pays a price: each search has a higher
cost per collected hit, since it must post-filter for deleted docs.
And since Zoie necessarily adds complexity, there's more risk; eg
there were some nasty Zoie bugs that took quite some time to track
down (under https://issues.apache.org/jira/browse/LUCENE-2729).

Anyway, I don't think that's a good tradeoff, in general, for our
users, because very few apps truly require immediate consistency from
Lucene (can anyone give an example where their app depends on
immediate consistency...?).  I think it's better to spend time during
reopen so that searches aren't slower.

That said, Lucene has already incorporated one big part of Zoie
(caching small segments in RAM) via the new NRTCachingDirectory (in
contrib/misc).  Also, the upcoming NRTManager
(https://issues.apache.org/jira/browse/LUCENE-2955) adds control over
visibility of specific indexing changes to queries that need to see
the changes.

Finally, even better would be to not have to make any tradeoff
whatsoever ;)  Twitter's approach (created by Michael Busch) seems to
bring immediate consistency with no search performance hit, so if we
do anything here likely it'll be similar to what Michael has done
(though, those changes are not simple either!).

Mike McCandless

http://blog.mikemccandless.com

On Sun, Jun 12, 2011 at 2:25 PM, Itamar Syn-Hershko<ita...@code972.com>  wrote:

Mike,


Speaking of NRT, and completely off-topic, I know: Lucene's NRT apparently
isn't fast enough if Zoie was needed, and now that Zoie is around are there
any plans to make it Lucene's default? or: why would one still use NRT when
Zoie seem to work much better?


Itamar.


On 12/06/2011 13:16, Michael McCandless wrote:

Remember that memory-mapping is not a panacea: at the end of the day,
if there just isn't enough RAM on the machine to keep your full
"working set" hot, then the OS will have to hit the disk, regardless
of whether the access is through MMap or a "traditional" IO request.

That said, on Fedora Linux anyway, I generally see better performance
from MMap than from NIOFSDir; eg see the 2nd chart here:


http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.html

Mike McCandless

http://blog.mikemccandless.com

On Sun, Jun 12, 2011 at 4:10 AM, Itamar Syn-Hershko<ita...@code972.com>
  wrote:

Thanks.


The whole point of my question was to find out if and how to make
balancing
on the SAME machine. Apparently thats not going to help and at a certain
point we will just have to prompt the user to buy more hardware...


Out of curiosity, isn't there anything that we can do to avoid that? for
instance using memory-mapped files for the indexes? anything that would
help
us overcome OS limitations of that sort...


Also, you mention a scheduled job to check for performance degradation;
any
idea how serious such a drop should be for sharding to be really
beneficial?
or is it application specific too?


Itamar.


On 12/06/2011 06:43, Shai Erera wrote:

I agree w/ Erick, there is no cutoff point (index size for that matter)
above which you start sharding.

What you can do is create a scheduled job in your system that runs a
select
list of queries and monitors their performance. Once it degrades, it
shards
the index by either splitting it (you can use IndexSplitter under
contrib)
or create a new shard, and direct new documents to it.

I think I read somewhere, not sure if it was in Solr or ElasticSearch
documentation, about a Balancer object, which moves shards around in
order
to balance the load on the cluster. You can implement something similar
which tries to balance the index sizes, creates new shards on-the-fly,
even
merge shards if suddenly a whole source is being removed from the system
etc.

Also, note that the 'largest index size' threshold is really a machine
constraint and not Lucene's. So if you decide that 10 GB is your cutoff,
it
is pointless to create 10x10GB shards on the same machine -- searching
them
is just like searching a 100GB index w/ 10x10GB segments. Perhaps it's
even
worse because you consume more RAM when the indexes are split (e.g.,
terms
index, field infos etc.).

Shai

On Sun, Jun 12, 2011 at 3:10 AM, Erick
Erickson<erickerick...@gmail.com>wrote:

<<<We can't assume anything about the machine running it,
so testing won't really tell us much>>>

Hmmm, then it's pretty hopeless I think. Problem is that
anything you say about running on a machine with
2G available memory on a single processor is completely
incomparable to running on a machine with 64G of
memory available for Lucene and 16 processors.

There's really no such thing as an "optimum" Lucene index
size, it always relates to the characteristics of the
underlying hardware.

I think the best you can do is actually test on various
configurations, then at least you can say "on configuration
X this is the tipping point".

Sorry there isn't a better answer that I know of, but...

Best
Erick

On Sat, Jun 11, 2011 at 3:37 PM, Itamar Syn-Hershko<ita...@code972.com>
wrote:

Hi all,

I know Lucene indexes to be at their optimum up to a certain size -
said

to

be around several GBs. I haven't found a good discussion over this,
but

its

my understanding that at some point its better to split an index into

parts

(a la sharding) than to continue searching on a huge-size index. I
assume
this has to do with OS and IO configurations. Can anyone point me to
more
info on this?

We have a product that is using Lucene for various searches, and at
the
moment each type of search is using its own Lucene index. We plan on
refactoring the way it works and to combine all indexes into one -
making
the whole system more robust and with a smaller memory footprint,
among
other things.

Assuming the above is true, we are interested in knowing how to do
this
correctly. Initially all our indexes will be run in one big index, but
if

at

some index size there is a severe performance degradation we would
like

to

handle that correctly by starting a new FSDirectory index to flush
into,

or

by re-indexing and moving large indexes into their own Lucene index.

Are there are any guidelines for measuring or estimating this
correctly?
what we should be aware of while considering all that? We can't assume
anything about the machine running it, so testing won't really tell us
much...

Thanks in advance for any input on this,

Itamar.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Index size and performance degradation

Reply via email to