flyrain opened a new pull request #3530:
URL: https://github.com/apache/iceberg/pull/3530
CharSeqComparator is a bit of overkill to know whether two file paths are
different. It also performances not well. The PR introduces a new method for
the use case with better perf.
Perf tests have been done for the new method comparing to class
`CharSeqComparator`.
1. Compare two identical paths. The new method has the similar perf as
`CharSeqComparator.compare`
2. Compare two paths with different lengths. The new method is about 17
times faster.
3. Compare two path with the same length but different chars inside. The new
method is about 13 times faster.
FYI, here is the code I used for perf tests.
```
@Test
public void testCharSeqEqualsVsComparatorWithSamePath() {
// the same CharSeq
CharSequence s5 =
"s3:/bucket/db/table/data/partition/00000-0-uuid-00001.parquet";
CharSequence s6 =
"s3:/bucket/db/table/data/partition/00000-0-uuid-00001.parquet";
perfCompare(s5, s6);
}
@Test
public void testCharSeqEqualsVsComparatorWithDiffChar() {
// different chars with the same length
CharSequence s1 =
"s3:/bucket/db/table/data/partition/00000-0-uuid-00001.parquet";
CharSequence s2 =
"s3:/bucket/db/table/data/partition/00000-1-uuid-00001.parquet";
perfCompare(s1, s2);
}
@Test
public void testCharSeqEqualsVsComparatorWithDiffLength() {
// different length
CharSequence s3 =
"s3:/bucket/db/table/data/partition/00000-0-uuid-00001.parquet";
CharSequence s4 =
"s3:/bucket/db/table/data/partition/00000-0-auuid-00001.parquet";
perfCompare(s3, s4);
}
private void perfCompare(CharSequence str1, CharSequence str2) {
int count = 1_000_000;
long start = System.nanoTime();
for(int i = 0; i < count; i++) {
equals(str1, str2);
}
long stop = System.nanoTime();
long duration1 = stop-start;
System.out.println("Time: " + (stop-start)/1000000.0 + " msec");
Comparator<CharSequence> charSequenceComparator =
Comparators.charSequences();
start = System.nanoTime();
for(int i = 0; i < count; i++) {
charSequenceComparator.compare(str1, str2);
}
stop = System.nanoTime();
System.out.println("Time: " + (stop-start)/1000000.0 + " msec");
System.out.println("Duration compare: " + (double)(stop-start)/duration1
+ " times");
}
boolean equals(CharSequence str1, CharSequence str2) {
if(str1 == str2) {
return true;
}
int count = str1.length();
if (count != str2.length()) {
return false;
}
return str1.toString().equals(str2.toString());
}
```
cc @aokolnychyi @RussellSpitzer @szehon-ho @karuppayya
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]