[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2452: Dedup files list generated in BaseSparkAction

GitBox Thu, 15 Apr 2021 23:34:24 -0700


aokolnychyi commented on a change in pull request #2452:
URL: https://github.com/apache/iceberg/pull/2452#discussion_r614593663




##########
File path: 
spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java
##########
@@ -173,7 +173,7 @@ protected Table newStaticTable(TableMetadata metadata, 
FileIO io) {
 
   protected Dataset<Row> buildOtherMetadataFileDF(TableOperations ops) {
     List<String> otherMetadataFiles = getOtherMetadataFilePaths(ops);
-    return spark.createDataset(otherMetadataFiles, 
Encoders.STRING()).toDF("file_path");
+    return spark.createDataset(otherMetadataFiles, 
Encoders.STRING()).toDF("file_path").distinct();

Review comment:
       Instead of doing a shuffle here, I think we should refine the approach 
we use to build the list of JSON files. I think what happens now is that we 
will take previous 100 version files in every version file and add them to the 
list even though each new version file has only one different entry. Will 
tables with 2000 snapshots and 100 previous metadata files produce a list with 
200000 elements?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2452: Dedup files list generated in BaseSparkAction

Reply via email to