[GitHub] [iceberg] kbendick commented on issue #2068: Procedure for adding files to a Table

GitBox Tue, 12 Jan 2021 15:42:19 -0800


kbendick commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-759107248



   I'm in support of this. I have several Airflow DAGs that write parquet files 
with python etc that need to be added to tables, like @aokolnychyi mentioned.
   
   > This command would basically go through all of the files under the 
provided directory, and add them all directly to the table. The files would not 
be read of scanned so it would be up to the user to only add files with correct 
columns or to adjust the table column mapping to match.
   
   I do somewhat worry about not having the ability to perform checks on the 
files to ensure that they have the correct columns and abide by the partition 
spec. Particularly I would worry about people importing data that is 
partitioned by say `date` but the table spec has multiple partition specs, on 
other columns etc. I don't currently have a strong opinion about this either 
way, but it would seem beneficial to have something similar to spark's 
`spark.sql.parquet.mergeSchema`. Meaning that it would be nice to optionally 
allow for the file footers to be read and either update the table schema or 
error out if the files are incompatible. Though I guess that's already offered 
via a full import and most likely the use case for this would be something like 
importing from a directory that's already partitioned by date etc.
   
   I know the parquet merge schemas is a rather expensive option and not 
typically used, but there are some clusters that I've helped administer where 
setting that by default is reasonable for that cluster's users if they've 
historically run into issues with changing schemas in their Java / Python 
process without updating the metastore.
   
   But I suppose this would probably be something that users would have on a 
regular schedule such as the case of writing parquet files from other tools 
(where the schema doesn't typically change very often), or would be a one off 
thing where hopefully the users know what they're doing. At the least, logging 
the appropriate warnings would be important.
   
   So long as users could roll back, then I don't have a strong opinion about 
supporting the option to verify schema compatibility. Anybody who is truly 
concerned about that should be retaining a long enough snapshot history to 
rollback in that case.
   
   I'm also +1 on not deleting files in the same operation, as that does seem 
likely to cause somebody data loss.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] kbendick commented on issue #2068: Procedure for adding files to a Table

Reply via email to