RussellSpitzer commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-759616650


   I think one thing to note is that it is always possible for files to not 
match the Iceberg spec which makes the checking for validity of added files a 
bit difficult. I definitely do think we should end up doing at least a footer 
read for getting metadata but I'm not sure we can/should require that the 
schemas match exactly. For example if a file has additional columns not in the 
spec should it be rejected?
   
   In our internal code we just make a Mapping based on the icebergTable 
(MappingUtil.create(icebergTable.schema) and then just apply that.
   
   @karuppayya 
   Actually was discussing an issue with our current Migrate/Snapshot code that 
has a similar issue. For example if you create a file with a column "iD" and 
create an external hive table referring to that column with "id", that table 
can be read by Spark but when converting such a table to Iceberg we would look 
for "id" and not "iD" so the column would not be mapped correctly.
   
   @jackye1995 
   We currently use the SparkTableUtil's listPartition function. It has custom 
code for parquet, orc, and avro and uses the directory structure to determine 
which files belong to which partitions. After that it's just a getFile, 
table.append.appendFile. Internally we also provide an option for specifying a 
specific partition to import.
   
   I agree that this is a bit of a low level method but for systems where 
rewrite is impossible or expensive I think it may be a good way to move files 
(or remake metadata for files).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to