Re: benchmark drop for PrimaryKey

2018-08-24 Thread Michael Sokolov
In fact I see a pronounced effect even with the smallish (10k) index! And I
should correct my earlier statement about FST50 - My earlier test was
flawed: I was confused about how these benchmarks work and updated
nightlyBench.py rather than my localrun.py. After correcting that and
comparing FST50 with Memory I see that indeed it recovers the lost perf in
this benchmark, indeed in three runs it seems to be a consistent
improvement over Memory, although these test results are quite noisy so
that may not be accurate.

Maybe we ought to update nightlyBench.py to use the FST50 codec for this
test? I'm not sure what it is trying to demonstrate though: would that be a
"fair" test? AT least it would be more faithful to the original version of
the chart. Also, please let me know if these benchmarking discussions
belong elsewhere; I see that luceneutil is not really part of the apache
package per se, but I doubt it has its own mailing list :)

On Fri, Aug 24, 2018 at 3:17 AM Adrien Grand  wrote:

> I don't think you need an index that is so large that the terms dictionary
> doesn't fit in the OS cache to reproduce the difference, but you might need
> a larger index indeed. On my end I use wikimedium10M or wikimediumall (and
> wikibigall if I need to test phrases) most of the time as I get more noise
> with smaller indices. I added an annotation, it should be caught up next
> time benchmarks run.
>
> I also pushed a change to take into account the fact that the default
> codec changed. However, I did not add backward-codecs.jar to the classpath,
> you should rebuild the index that you use for benchmarking so that it uses
> the Lucene80 codec instead of Lucene70.
>
> Le ven. 24 août 2018 à 02:03, Michael Sokolov  a
> écrit :
>
>> I think the benchmarks need updating after LUCENE-8461. I got them
>> working again by replacing lucene70 with lucene80 everywhere except for the
>> DocValues formats, and adding the backward-codecs.jar to the benchmarks
>> build. I'm not sure that was really the right way to go about this? After
>> that I did try switching to use FST50 for this PKLookup benchmark (see
>> below), but it did not recover the lost perf.
>>
>> diff --git a/src/python/nightlyBench.py b/src/python/nightlyBench.py
>> index b42fe84..5807e49 100644
>> --- a/src/python/nightlyBench.py
>> +++ b/src/python/nightlyBench.py
>> @@ -699,7 +699,7 @@ def run():
>> -  idFieldPostingsFormat='Lucene50',
>> +  idFieldPostingsFormat='FST50',
>>
>>
>> On Thu, Aug 23, 2018 at 5:52 PM Michael Sokolov 
>> wrote:
>>
>>> OK thanks. I guess this benchmark must be run on a large-enough index
>>> that it doesn't fit entirely in RAM already anyway? When I ran it locally
>>> using the vanilla benchmark instructions, I believe the generated index was
>>> quite small (wikimedium10k).  At any rate, I don't have any specific use
>>> case yet, just thinking about some possibilities related to primary key
>>> lookup and came across this anomaly. Perhaps at least it deserves an
>>> annotation on the benchmark graph.
>>>
>>


Re: benchmark drop for PrimaryKey

2018-08-24 Thread Adrien Grand
I don't think you need an index that is so large that the terms dictionary
doesn't fit in the OS cache to reproduce the difference, but you might need
a larger index indeed. On my end I use wikimedium10M or wikimediumall (and
wikibigall if I need to test phrases) most of the time as I get more noise
with smaller indices. I added an annotation, it should be caught up next
time benchmarks run.

I also pushed a change to take into account the fact that the default codec
changed. However, I did not add backward-codecs.jar to the classpath, you
should rebuild the index that you use for benchmarking so that it uses the
Lucene80 codec instead of Lucene70.

Le ven. 24 août 2018 à 02:03, Michael Sokolov  a écrit :

> I think the benchmarks need updating after LUCENE-8461. I got them working
> again by replacing lucene70 with lucene80 everywhere except for the
> DocValues formats, and adding the backward-codecs.jar to the benchmarks
> build. I'm not sure that was really the right way to go about this? After
> that I did try switching to use FST50 for this PKLookup benchmark (see
> below), but it did not recover the lost perf.
>
> diff --git a/src/python/nightlyBench.py b/src/python/nightlyBench.py
> index b42fe84..5807e49 100644
> --- a/src/python/nightlyBench.py
> +++ b/src/python/nightlyBench.py
> @@ -699,7 +699,7 @@ def run():
> -  idFieldPostingsFormat='Lucene50',
> +  idFieldPostingsFormat='FST50',
>
>
> On Thu, Aug 23, 2018 at 5:52 PM Michael Sokolov 
> wrote:
>
>> OK thanks. I guess this benchmark must be run on a large-enough index
>> that it doesn't fit entirely in RAM already anyway? When I ran it locally
>> using the vanilla benchmark instructions, I believe the generated index was
>> quite small (wikimedium10k).  At any rate, I don't have any specific use
>> case yet, just thinking about some possibilities related to primary key
>> lookup and came across this anomaly. Perhaps at least it deserves an
>> annotation on the benchmark graph.
>>
>


Re: benchmark drop for PrimaryKey

2018-08-23 Thread Michael Sokolov
I think the benchmarks need updating after LUCENE-8461. I got them working
again by replacing lucene70 with lucene80 everywhere except for the
DocValues formats, and adding the backward-codecs.jar to the benchmarks
build. I'm not sure that was really the right way to go about this? After
that I did try switching to use FST50 for this PKLookup benchmark (see
below), but it did not recover the lost perf.

diff --git a/src/python/nightlyBench.py b/src/python/nightlyBench.py
index b42fe84..5807e49 100644
--- a/src/python/nightlyBench.py
+++ b/src/python/nightlyBench.py
@@ -699,7 +699,7 @@ def run():
-  idFieldPostingsFormat='Lucene50',
+  idFieldPostingsFormat='FST50',


On Thu, Aug 23, 2018 at 5:52 PM Michael Sokolov  wrote:

> OK thanks. I guess this benchmark must be run on a large-enough index that
> it doesn't fit entirely in RAM already anyway? When I ran it locally using
> the vanilla benchmark instructions, I believe the generated index was quite
> small (wikimedium10k).  At any rate, I don't have any specific use case
> yet, just thinking about some possibilities related to primary key lookup
> and came across this anomaly. Perhaps at least it deserves an annotation on
> the benchmark graph.
>


Re: benchmark drop for PrimaryKey

2018-08-23 Thread Michael Sokolov
OK thanks. I guess this benchmark must be run on a large-enough index that
it doesn't fit entirely in RAM already anyway? When I ran it locally using
the vanilla benchmark instructions, I believe the generated index was quite
small (wikimedium10k).  At any rate, I don't have any specific use case
yet, just thinking about some possibilities related to primary key lookup
and came across this anomaly. Perhaps at least it deserves an annotation on
the benchmark graph.


Re: benchmark drop for PrimaryKey

2018-08-23 Thread David Smiley
Switching to "FST50" ought to bring back much of the benefit of "Memory".

On Thu, Aug 23, 2018 at 5:15 PM Adrien Grand  wrote:

> The commit that caused this slowdown might be
> https://github.com/mikemccand/luceneutil/commit/1d8460f342f269c98047def9f9eb76213acae5d9
> .
>
> We don't have anything that performs as well anymore indeed, but I'm not
> sure this is a big deal. I would suspect that there were not many users of
> that postings format, one reason being that it was not supported in terms
> of backward compatibility (like any codec but the default one) and another
> reason being that it used a lot of RAM. In a number of cases, we try to
> fold benefits of alternative codecs in the default codec, for instance we
> used to have a "pulsing" postings format that could record postings in the
> terms dictionary in order to save one disk seek, and we ended up folding
> this feature into the default postings format by only enabling it on terms
> that have a document frequency of 1 and index_options=DOCS_ONLY, so that it
> would be always used with primary keys. For that postings format, it didn't
> really make sense as the way that it managed to be so much faster was by
> loading much more information in RAM, which we don't want to do with the
> default codec.
>
> Le jeu. 23 août 2018 à 22:40, Michael Sokolov  a
> écrit :
>
>> I happened to stumble across this chart
>> https://home.apache.org/~mikemccand/lucenebench/PKLookup.html showing a
>> pretty drastic drop in this benchmark on 5/13. I looked at the commits
>> between the previous run and this one and did some investigation, trying to
>> do some git bisect to find the problem using benchmarks as a test, but it
>> proved to be quite difficult due to a breaking change re: MemoryCodec that
>> also required corresponding changes in  benchmark code.
>>
>> In the end, I think removing MemoryCodec is what caused the drop in perf
>> here, based on this comment in benchmark code:
>>
>> '2011-06-26'
>>Switched to MemoryCodec for the primary-key 'id' field so that lookups
>> (either for PKLookup test or for deletions during reopen in the NRT test)
>> are fast, with no IO.  Also switched to NRTCachingDirectory for the NRT
>> test, so that small new segments are written only in RAM.
>>
>> I don't really understand the implications here beyond benchmarks, but it
>> does seem that perhaps some essential high-performing capability has been
>> lost?  Is there some equivalent thing remaining after MemoryCodec's removal
>> that can be used for primary keys?
>>
>> -Mike
>>
> --
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com


Re: benchmark drop for PrimaryKey

2018-08-23 Thread Adrien Grand
The commit that caused this slowdown might be
https://github.com/mikemccand/luceneutil/commit/1d8460f342f269c98047def9f9eb76213acae5d9
.

We don't have anything that performs as well anymore indeed, but I'm not
sure this is a big deal. I would suspect that there were not many users of
that postings format, one reason being that it was not supported in terms
of backward compatibility (like any codec but the default one) and another
reason being that it used a lot of RAM. In a number of cases, we try to
fold benefits of alternative codecs in the default codec, for instance we
used to have a "pulsing" postings format that could record postings in the
terms dictionary in order to save one disk seek, and we ended up folding
this feature into the default postings format by only enabling it on terms
that have a document frequency of 1 and index_options=DOCS_ONLY, so that it
would be always used with primary keys. For that postings format, it didn't
really make sense as the way that it managed to be so much faster was by
loading much more information in RAM, which we don't want to do with the
default codec.

Le jeu. 23 août 2018 à 22:40, Michael Sokolov  a écrit :

> I happened to stumble across this chart
> https://home.apache.org/~mikemccand/lucenebench/PKLookup.html showing a
> pretty drastic drop in this benchmark on 5/13. I looked at the commits
> between the previous run and this one and did some investigation, trying to
> do some git bisect to find the problem using benchmarks as a test, but it
> proved to be quite difficult due to a breaking change re: MemoryCodec that
> also required corresponding changes in  benchmark code.
>
> In the end, I think removing MemoryCodec is what caused the drop in perf
> here, based on this comment in benchmark code:
>
> '2011-06-26'
>Switched to MemoryCodec for the primary-key 'id' field so that lookups
> (either for PKLookup test or for deletions during reopen in the NRT test)
> are fast, with no IO.  Also switched to NRTCachingDirectory for the NRT
> test, so that small new segments are written only in RAM.
>
> I don't really understand the implications here beyond benchmarks, but it
> does seem that perhaps some essential high-performing capability has been
> lost?  Is there some equivalent thing remaining after MemoryCodec's removal
> that can be used for primary keys?
>
> -Mike
>