RussellSpitzer commented on issue #2068: URL: https://github.com/apache/iceberg/issues/2068#issuecomment-759616650
I think one thing to note is that it is always possible for files to not match the Iceberg spec which makes the checking for validity of added files a bit difficult. I definitely do think we should end up doing at least a footer read for getting metadata but I'm not sure we can/should require that the schemas match exactly. For example if a file has additional columns not in the spec should it be rejected? In our internal code we just make a Mapping based on the icebergTable (MappingUtil.create(icebergTable.schema) and then just apply that. @karuppayya Actually was discussing an issue with our current Migrate/Snapshot code that has a similar issue. For example if you create a file with a column "iD" and create an external hive table referring to that column with "id", that table can be read by Spark but when converting such a table to Iceberg we would look for "id" and not "iD" so the column would not be mapped correctly. @jackye1995 We currently use the SparkTableUtil's listPartition function. It has custom code for parquet, orc, and avro and uses the directory structure to determine which files belong to which partitions. After that it's just a getFile, table.append.appendFile. Internally we also provide an option for specifying a specific partition to import. I agree that this is a bit of a low level method but for systems where rewrite is impossible or expensive I think it may be a good way to move files (or remake metadata for files). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
