flyrain opened a new pull request #3530:
URL: https://github.com/apache/iceberg/pull/3530


   CharSeqComparator is a bit of overkill to know whether two file paths are 
different. It also performances not well. The PR introduces a new method for 
the use case with better perf.
   
   Perf tests have been done for the new method comparing to class 
`CharSeqComparator`. 
   1. Compare two identical paths.  The new method has the similar perf as 
`CharSeqComparator.compare`
   2. Compare two paths with different lengths. The new method is about 17 
times faster.
   3. Compare two path with the same length but different chars inside. The new 
method is about 13 times faster.
   
   FYI, here is the code I used for perf tests.
   ```
     @Test
     public void testCharSeqEqualsVsComparatorWithSamePath() {
       // the same CharSeq
       CharSequence s5 = 
"s3:/bucket/db/table/data/partition/00000-0-uuid-00001.parquet";
       CharSequence s6 = 
"s3:/bucket/db/table/data/partition/00000-0-uuid-00001.parquet";
       perfCompare(s5, s6);
     }
   
     @Test
     public void testCharSeqEqualsVsComparatorWithDiffChar() {
       // different chars with the same length
       CharSequence s1 = 
"s3:/bucket/db/table/data/partition/00000-0-uuid-00001.parquet";
       CharSequence s2 = 
"s3:/bucket/db/table/data/partition/00000-1-uuid-00001.parquet";
       perfCompare(s1, s2);
     }
   
     @Test
     public void testCharSeqEqualsVsComparatorWithDiffLength() {
       // different length
       CharSequence s3 = 
"s3:/bucket/db/table/data/partition/00000-0-uuid-00001.parquet";
       CharSequence s4 = 
"s3:/bucket/db/table/data/partition/00000-0-auuid-00001.parquet";
       perfCompare(s3, s4);
     }
   
     private void perfCompare(CharSequence str1, CharSequence str2) {
       int count = 1_000_000;
       long start = System.nanoTime();
       for(int i = 0; i < count; i++) {
         equals(str1, str2);
       }
       long stop = System.nanoTime();
       long duration1 = stop-start;
       System.out.println("Time: " + (stop-start)/1000000.0 + " msec");
   
       Comparator<CharSequence> charSequenceComparator = 
Comparators.charSequences();
       start = System.nanoTime();
       for(int i = 0; i < count; i++) {
         charSequenceComparator.compare(str1, str2);
       }
       stop = System.nanoTime();
       System.out.println("Time: " + (stop-start)/1000000.0 + " msec");
   
       System.out.println("Duration compare: " + (double)(stop-start)/duration1 
+ " times");
     }
   
     boolean equals(CharSequence str1, CharSequence str2) {
       if(str1 == str2) {
         return true;
       }
   
       int count = str1.length();
       if (count != str2.length()) {
         return false;
       }
   
       return str1.toString().equals(str2.toString());
     }
   ```
   
   cc @aokolnychyi @RussellSpitzer @szehon-ho @karuppayya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to