On Mon, 27 Jul 2020 at 19:24, Adrien Grand <jpou...@gmail.com> wrote:
>
> It's interesting you're not seeing the same slowdown on the other field.
> How hard would it be for you to test what the performance is if you
> lowercase the name of the digest algorithms, ie. "md5;[md5 value in hex]",
> etc. The reason I'm asking is because the compression logic is optimized
> for lowercase ASCII so removing uppercase letters would help remove the
> need to encode exceptions, which is one reason I'm thinking why the
> slowdown might be less on your other field.

It took me a while to get some free time to make a new version of the
test which doesn't have our own code in it so that I was able to add
the new field without rewriting a large chunk of our system... but it
looks like the timing for lowercase prefixes is around the same as
upper.

This particular test I've ended up doing though is a pathological
case, as it turned out to have 0 hits in the index despite searching
for 29 million digests.

-------------------------
Time for just reading the digest list
Count = 29459432, time = 1946 ms
Count = 29459432, time = 1752 ms
Count = 29459432, time = 1752 ms
-------------------------
Times for digest-upper
Count = 0, time = 40570 ms
Count = 0, time = 42574 ms
Count = 0, time = 40121 ms
-------------------------
Times for digest-lower
Count = 0, time = 40462 ms
Count = 0, time = 40319 ms
Count = 0, time = 39938 ms
-------------------------
Times for digest-no-prefix
Count = 0, time = 10936 ms
Count = 0, time = 10857 ms
Count = 0, time = 10628 ms
-------------------------

So about 4 times faster on the field with no term prefixes.
The code for all 3 tests is shared:

    private static void timeDigest(Path md5sFile, IndexReader reader,
String field, String termPrefix) throws IOException {
        try (BufferedReader md5sReader = Files.newBufferedReader(md5sFile)) {
            TermsEnum termsEnum = MultiTerms.getTerms(reader, field).iterator();
            PostingsEnum postingsEnum = null;

            long t0 = System.currentTimeMillis();
            int hitCount = 0;

            while (true) {
                String md5 = md5sReader.readLine();
                if (md5 == null) {
                    break;
                }

                if (termsEnum.seekExact(new BytesRef(termPrefix + md5))) {
                    postingsEnum = termsEnum.postings(postingsEnum,
PostingsEnum.NONE);
                    while (postingsEnum.nextDoc() !=
DocIdSetIterator.NO_MORE_DOCS) {
                        hitCount++;
                    }
                }
            }

            long t1 = System.currentTimeMillis();
            System.out.println("Count = " + hitCount + ", time = " +
(t1 - t0) + " ms");
        }
    }

> In case you're using an old JRE, you might want to try out with a JRE 13 or
> more recent. Some of the logic in this lowercase ASCII compression only
> gets vectorized on JDK13+.

Times for JDK 14.0.2:

-------------------------
Times for just reading the digest list
Count = 29459432, time = 2050 ms
Count = 29459432, time = 2156 ms
Count = 29459432, time = 1905 ms
-------------------------
Times for digest-upper
Count = 0, time = 24336 ms
Count = 0, time = 24236 ms
Count = 0, time = 23986 ms
-------------------------
Times for digest-lower
Count = 0, time = 24440 ms
Count = 0, time = 23960 ms
Count = 0, time = 23956 ms
-------------------------
Times for digest-no-prefix
Count = 0, time = 13177 ms
Count = 0, time = 13095 ms
Count = 0, time = 13081 ms
-------------------------


Almost a 2:1 speed boost for prefixed timings just by updating the JDK...

The non-prefixed timings seem to be 30% slower than on JDK 8 (WTF?)
but still win when compared to the prefixed timings alone.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to