jackye1995 commented on issue #2068: URL: https://github.com/apache/iceberg/issues/2068#issuecomment-759207669
> One of the things I love about Iceberg is that it has a strong specification for metadata and tools to enforce it. Not having all sorts of random software writing raw files to disk in different ways is a feature. +1 for the comment. Internally we have an operation called bootstrap that serves this purpose, and it is always only done by a very small set of people who know exactly what is going on. The data quality guaranteed by Iceberg is a very valuable aspect. But it is always tempting to have these features that are very fast and useful by taking a shortcut. On the other hand, I do see the value of this, especially for people who want a quick try of Iceberg with an existing set of files. There are also some legit use cases where a set of data files cannot be rewritten and has to be added in this way. So if we think this is a valuable thing to add, this will definitely need a lot of explicit documentation and warning. Maybe a post-operation verification can be performed by reading certain amount of imported data to verify the table continues to work. If the verification fails, it is at least a simple command to rollback to the previous version without going too far. @RussellSpitzer something I do not fully understand is that, how does the procedure know the partition of each file? Does it read a line of the file to figure out? Does it assume a Hive layout to try map the path name into partition tuples? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
