On Mon, 27 Jul 2020 at 19:24, Adrien Grand <jpou...@gmail.com> wrote: > > It's interesting you're not seeing the same slowdown on the other field. > How hard would it be for you to test what the performance is if you > lowercase the name of the digest algorithms, ie. "md5;[md5 value in hex]", > etc. The reason I'm asking is because the compression logic is optimized > for lowercase ASCII so removing uppercase letters would help remove the > need to encode exceptions, which is one reason I'm thinking why the > slowdown might be less on your other field.
It took me a while to get some free time to make a new version of the test which doesn't have our own code in it so that I was able to add the new field without rewriting a large chunk of our system... but it looks like the timing for lowercase prefixes is around the same as upper. This particular test I've ended up doing though is a pathological case, as it turned out to have 0 hits in the index despite searching for 29 million digests. ------------------------- Time for just reading the digest list Count = 29459432, time = 1946 ms Count = 29459432, time = 1752 ms Count = 29459432, time = 1752 ms ------------------------- Times for digest-upper Count = 0, time = 40570 ms Count = 0, time = 42574 ms Count = 0, time = 40121 ms ------------------------- Times for digest-lower Count = 0, time = 40462 ms Count = 0, time = 40319 ms Count = 0, time = 39938 ms ------------------------- Times for digest-no-prefix Count = 0, time = 10936 ms Count = 0, time = 10857 ms Count = 0, time = 10628 ms ------------------------- So about 4 times faster on the field with no term prefixes. The code for all 3 tests is shared: private static void timeDigest(Path md5sFile, IndexReader reader, String field, String termPrefix) throws IOException { try (BufferedReader md5sReader = Files.newBufferedReader(md5sFile)) { TermsEnum termsEnum = MultiTerms.getTerms(reader, field).iterator(); PostingsEnum postingsEnum = null; long t0 = System.currentTimeMillis(); int hitCount = 0; while (true) { String md5 = md5sReader.readLine(); if (md5 == null) { break; } if (termsEnum.seekExact(new BytesRef(termPrefix + md5))) { postingsEnum = termsEnum.postings(postingsEnum, PostingsEnum.NONE); while (postingsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) { hitCount++; } } } long t1 = System.currentTimeMillis(); System.out.println("Count = " + hitCount + ", time = " + (t1 - t0) + " ms"); } } > In case you're using an old JRE, you might want to try out with a JRE 13 or > more recent. Some of the logic in this lowercase ASCII compression only > gets vectorized on JDK13+. Times for JDK 14.0.2: ------------------------- Times for just reading the digest list Count = 29459432, time = 2050 ms Count = 29459432, time = 2156 ms Count = 29459432, time = 1905 ms ------------------------- Times for digest-upper Count = 0, time = 24336 ms Count = 0, time = 24236 ms Count = 0, time = 23986 ms ------------------------- Times for digest-lower Count = 0, time = 24440 ms Count = 0, time = 23960 ms Count = 0, time = 23956 ms ------------------------- Times for digest-no-prefix Count = 0, time = 13177 ms Count = 0, time = 13095 ms Count = 0, time = 13081 ms ------------------------- Almost a 2:1 speed boost for prefixed timings just by updating the JDK... The non-prefixed timings seem to be 30% slower than on JDK 8 (WTF?) but still win when compared to the prefixed timings alone. TX --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org