tyronecai commented on PR #15779:
URL: https://github.com/apache/lucene/pull/15779#issuecomment-3996863128

   Since all the term information is already stored in `bytesStart` + `pool`,
   the `ids` is simply a rearrangement of the ids in bytesStart based on their 
hash codes. 
   
   Therefore, it's completely useless during rehashing and subsequent 
compaction.
   
   So, in https://github.com/apache/lucene/pull/15772, I modified the 
compaction process to discard the ids information.
   
   Following this idea, I readjusted the `rehash` mechanism, improving 
performance and reducing memory consumption. Please review it again  
@mikemccand @dweiss 
   
   
   @mikemccand 
   I retested it on all hardware I could find.
   I still extracting terms from some application logs using newlines and 
spaces, resulting in 2,282,163 unique terms. Test devices included:
   My own Apple M1 Pro laptop,      (259.36 ms VS 465.09 ms)
   My AMD Ryzen 7 9700X desktop with similar results,            (166.57 VS 
335.15 ms)
   A server equipped with an Arm Kunpeng 960 CPU,                         
(731.54 ms VS 1310.54 ms)
   A server equipped with an older Intel CPU (Intel(R) Xeon(R) Silver 4110),    
  (980.57 ms VS 1611.37 ms)
   
   The results were similar: hash.add completed 2,282,163 term in half the time 
it took before optimizing the rehash code.
   
   
   
   
   
   ```
   List<BytesRef> testData = loadUniqueTermsFromFile(filename);
   for (int i = 0; i < round; i++) {
     insert(testData);
   }
   
   private static void insert(List<BytesRef> testData) {
       BytesRefHash hash = new BytesRefHash();
       long start = System.nanoTime();
       int uniqueCount = 0;
       for (BytesRef ref : testData) {
         int pos = hash.add(ref);
         if (pos >= 0) {
           uniqueCount += 1;
         }
       }
   
       long insertTimeNs = System.nanoTime() - start;
       System.out.printf(
           "Inserted %d strings in %.2f ms, uniqueCount %d, %n",
           testData.size(), insertTimeNs / 1_000_000.0, uniqueCount);
     }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to