dennishuo commented on issue #5512: URL: https://github.com/apache/iceberg/issues/5512#issuecomment-1275083691
@asheeshgarg Right, unfortunately, as I understand it, mutations on the existing iceberg table would require catalog integration, so the low-level dataframe `load` approach would just be for reads. When I was using this myself, the missing version-hint error appeared to just be a "warning", and I was still successfully able to use the dataframe by ignoring the error message. Under the hood, the `version-hint.text` (note that the spelling is indeed `.text`, not `.txt`: https://github.com/apache/iceberg/blob/dc5f5c38f871f119b79ba167f8c075fc825797b8/core/src/main/java/org/apache/iceberg/hadoop/Util.java#L44) is used by the default "HadoopCatalog" as a pointer to the "latest/official version" of table metadata. When the file is missing, Spark/Hadoop fallback to "listing" all the `*.metadata.json` files. You can see where the "warning" for missing version-hint is caught here and how it falls through to attempting to list here: https://github.com/apache/iceberg/blob/dbb8a404f6632a55acb36e949f0e7b84b643cede/core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java#L325 As long as your `v*.metadata.json` filenames follow that naming convention of being monotonically increasing and fit in an int, the file-listing approach technically works **in the absence of concurrent attempted writes from other engines**. If you have tons (i.e., many thousands) of versioned metadata files in the metadata directories, this will be slow. If you do need to worry about transactionality with lots of writers trying to "commit" new metadata.json files, you at the very least need those writers to correctly populate `version-hint.text` to serve as an "atomic commit" of the correct write. Most ideally, you'd use another Catalog implementation -- one of the main reasons for having separate Catalog implementations is precisely to overcome the shortcomings of the default HadoopCatalog-based approach. What system were you using to write the Iceberg tables in the first place? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
