RussellSpitzer opened a new pull request, #13947:
URL: https://github.com/apache/iceberg/pull/13947

   One of our slowest test suites is TestRewrite* and each version we add to 
test increases that burden. So I decided to take some quick changes we could do 
to reduce the burden.
   
   I noticed that most of our slow down comes from collecting the data to check 
for integrity. We are doing so using a Spark side sort which is extemely 
expensive because of the number of tasks we are reading. We have a lot of tasks 
because we are forcibly changing the split size of the table to trigger 
different file combinations in the suite but this behavior isn't important when 
checking for integrity. I did the following experiments.
   
   ----
   ## Original Performance
   <img width="661" height="144" alt="Pasted Graphic" 
src="https://github.com/user-attachments/assets/6b16c5e9-f0ec-4fac-9bd2-e6e870fbfc78";
 />
   
   
   ##
Using Single Partition Sort

   ```java
   spark.read().format("iceberg").load(tableLocation).coalesce(1).sort("c1", 
"c2", "c3").collectAsList());
   ```
   <img width="661" height="144" alt="Pasted Graphic 1" 
src="https://github.com/user-attachments/assets/fda70f34-dcb5-43e1-b627-e1e35b8e88b5";
 />
   
   ##
Using a local sort instead of Spark sort
   ```java
   List<Row> rows = 
spark.read().format("iceberg").load(tableLocation).collectAsList();
   rows.sort(Comparator.comparingInt((Row r) -> r.getAs("c1"))
       .thenComparing(r -> r.getAs("c2"))
       .thenComparing(r -> r.getAs("c3")));
   return rowsToJava(rows);
   ```
   
   <img width="661" height="144" alt="Pasted Graphic 2" 
src="https://github.com/user-attachments/assets/23e776cd-f7e5-470a-b669-acd5509f2e20";
 />
   
   
   ## Minimizing splits used to read
   ```java
   protected List<Object[]> currentData() {
     return rowsToJava(
       spark
         .read()
         .option(SparkReadOptions.SPLIT_SIZE, 1024 * 1024 * 32)
         .option(SparkReadOptions.FILE_OPEN_COST, 0)
         .format("iceberg").load(tableLocation)
         .coalesce(1)
         .sort("c1", "c2", "c3").collectAsList()
     );
   }
   ```
   
   
   ------
   
   ## Final Suite Timings
   ### Before
   <img width="636" height="66" alt="Pasted Graphic 5" 
src="https://github.com/user-attachments/assets/25138b12-2143-4e8d-b928-9d55e155603a";
 />
   ### After
   <img width="636" height="66" alt="Pasted Graphic 4" 
src="https://github.com/user-attachments/assets/34cfcd0a-3ec4-40a9-a28a-dceb4399b73e";
 />
   
   
   ### Before
   <img width="636" height="66" alt="Pasted Graphic 7" 
src="https://github.com/user-attachments/assets/27e7bc3d-bf8f-453e-8056-197f4a1c4b3d";
 />
   
   ### After
   <img width="636" height="66" alt="Pasted Graphic 6" 
src="https://github.com/user-attachments/assets/ca9b3a4b-3e95-4f03-9395-a94b60bc7fe3";
 />
   
   ------
   
   The improvements to the delete suite aren't as good but I figured I do the 
same changes there as well
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to