[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2779: Spark: Add duplicate file check in add_files

GitBox Thu, 22 Jul 2021 08:55:11 -0700


RussellSpitzer commented on a change in pull request #2779:
URL: https://github.com/apache/iceberg/pull/2779#discussion_r674941963




##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -95,11 +102,16 @@
  */
 public class SparkTableUtil {
 
+  private static final Logger LOG = 
LoggerFactory.getLogger(SparkTableUtil.class);
+
   private static final Joiner.MapJoiner MAP_JOINER = 
Joiner.on(",").withKeyValueSeparator("=");
 
   private static final PathFilter HIDDEN_PATH_FILTER =
       p -> !p.getName().startsWith("_") && !p.getName().startsWith(".");
 
+  private static final String duplicateFileMessage = "Duplicate data files 
will be added to this table: %s.  " +

Review comment:
       Think this should be reworded a bit,
   "Cannot complete import because data files to be imported already exist 
within the target table. Iceberg is not designed to have multiple references to 
the same file within the same table so this type of import is disabled by 
default. If you are sure this is what you would like to do set 
'$doAVariableReferenceHere' to true to force the import"
   
   Just to make sure folks know that by doubly importing things they are not 
necessarily doing something that will work or will be safe in the long run. For 
example duplicate file entries will ... have odd effects on MergeInto :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2779: Spark: Add duplicate file check in add_files

Reply via email to