jackye1995 commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-759207669


   > One of the things I love about Iceberg is that it has a strong 
specification for metadata and tools to enforce it. Not having all sorts of 
random software writing raw files to disk in different ways is a feature.
   
   +1 for the comment. Internally we have an operation called bootstrap that 
serves this purpose, and it is always only done by a very small set of people 
who know exactly what is going on. The data quality guaranteed by Iceberg is a 
very valuable aspect. But it is always tempting to have these features that are 
very fast and useful by taking a shortcut. 
   
   On the other hand, I do see the value of this, especially for people who 
want a quick try of Iceberg with an existing set of files. There are also some 
legit use cases where a set of data files cannot be rewritten and has to be 
added in this way. So if we think this is a valuable thing to add, this will 
definitely need a lot of explicit documentation and warning. Maybe a 
post-operation verification can be performed by reading certain amount of 
imported data to verify the table continues to work. If the verification fails, 
it is at least a simple command to rollback to the previous version without 
going too far.
   
   @RussellSpitzer something I do not fully understand is that, how does the 
procedure know the partition of each file? Does it read a line of the file to 
figure out? Does it assume a Hive layout to try map the path name into 
partition tuples?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to