kbendick commented on issue #2068: URL: https://github.com/apache/iceberg/issues/2068#issuecomment-759107248
I'm in support of this. I have several Airflow DAGs that write parquet files with python etc that need to be added to tables, like @aokolnychyi mentioned. > This command would basically go through all of the files under the provided directory, and add them all directly to the table. The files would not be read of scanned so it would be up to the user to only add files with correct columns or to adjust the table column mapping to match. I do somewhat worry about not having the ability to perform checks on the files to ensure that they have the correct columns and abide by the partition spec. Particularly I would worry about people importing data that is partitioned by say `date` but the table spec has multiple partition specs, on other columns etc. I don't currently have a strong opinion about this either way, but it would seem beneficial to have something similar to spark's `spark.sql.parquet.mergeSchema`. Meaning that it would be nice to optionally allow for the file footers to be read and either update the table schema or error out if the files are incompatible. Though I guess that's already offered via a full import and most likely the use case for this would be something like importing from a directory that's already partitioned by date etc. I know the parquet merge schemas is a rather expensive option and not typically used, but there are some clusters that I've helped administer where setting that by default is reasonable for that cluster's users if they've historically run into issues with changing schemas in their Java / Python process without updating the metastore. But I suppose this would probably be something that users would have on a regular schedule such as the case of writing parquet files from other tools (where the schema doesn't typically change very often), or would be a one off thing where hopefully the users know what they're doing. At the least, logging the appropriate warnings would be important. So long as users could roll back, then I don't have a strong opinion about supporting the option to verify schema compatibility. Anybody who is truly concerned about that should be retaining a long enough snapshot history to rollback in that case. I'm also +1 on not deleting files in the same operation, as that does seem likely to cause somebody data loss. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
