tyronecai commented on PR #15779: URL: https://github.com/apache/lucene/pull/15779#issuecomment-3996863128
Since all the term information is already stored in `bytesStart` + `pool`, the `ids` is simply a rearrangement of the ids in bytesStart based on their hash codes. Therefore, it's completely useless during rehashing and subsequent compaction. So, in https://github.com/apache/lucene/pull/15772, I modified the compaction process to discard the ids information. Following this idea, I readjusted the `rehash` mechanism, improving performance and reducing memory consumption. Please review it again @mikemccand @dweiss @mikemccand I retested it on all hardware I could find. I still extracting terms from some application logs using newlines and spaces, resulting in 2,282,163 unique terms. Test devices included: My own Apple M1 Pro laptop, (259.36 ms VS 465.09 ms) My AMD Ryzen 7 9700X desktop with similar results, (166.57 VS 335.15 ms) A server equipped with an Arm Kunpeng 960 CPU, (731.54 ms VS 1310.54 ms) A server equipped with an older Intel CPU (Intel(R) Xeon(R) Silver 4110), (980.57 ms VS 1611.37 ms) The results were similar: hash.add completed 2,282,163 term in half the time it took before optimizing the rehash code. ``` List<BytesRef> testData = loadUniqueTermsFromFile(filename); for (int i = 0; i < round; i++) { insert(testData); } private static void insert(List<BytesRef> testData) { BytesRefHash hash = new BytesRefHash(); long start = System.nanoTime(); int uniqueCount = 0; for (BytesRef ref : testData) { int pos = hash.add(ref); if (pos >= 0) { uniqueCount += 1; } } long insertTimeNs = System.nanoTime() - start; System.out.printf( "Inserted %d strings in %.2f ms, uniqueCount %d, %n", testData.size(), insertTimeNs / 1_000_000.0, uniqueCount); } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
