[GitHub] [iceberg] dennishuo commented on issue #5512: Accessing Iceberg tables without catalog

GitBox Tue, 11 Oct 2022 11:07:25 -0700


dennishuo commented on issue #5512:
URL: https://github.com/apache/iceberg/issues/5512#issuecomment-1275083691


   @asheeshgarg Right, unfortunately, as I understand it, mutations on the 
existing iceberg table would require catalog integration, so the low-level 
dataframe `load` approach would just be for reads.
   
   When I was using this myself, the missing version-hint error appeared to 
just be a "warning", and I was still successfully able to use the dataframe by 
ignoring the error message.
   
   Under the hood, the `version-hint.text` (note that the spelling is indeed 
`.text`, not `.txt`: 
https://github.com/apache/iceberg/blob/dc5f5c38f871f119b79ba167f8c075fc825797b8/core/src/main/java/org/apache/iceberg/hadoop/Util.java#L44)
 is used by the default "HadoopCatalog" as a pointer to the "latest/official 
version" of table metadata. When the file is missing, Spark/Hadoop fallback to 
"listing" all the `*.metadata.json` files. You can see where the "warning" for 
missing version-hint is caught here and how it falls through to attempting to 
list here: 
https://github.com/apache/iceberg/blob/dbb8a404f6632a55acb36e949f0e7b84b643cede/core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java#L325
   
   As long as your `v*.metadata.json` filenames follow that naming convention 
of being monotonically increasing and fit in an int, the file-listing approach 
technically works **in the absence of concurrent attempted writes from other 
engines**. If you have tons (i.e., many thousands) of versioned metadata files 
in the metadata directories, this will be slow.
   
   If you do need to worry about transactionality with lots of writers trying 
to "commit" new metadata.json files, you at the very least need those writers 
to correctly populate `version-hint.text` to serve as an "atomic commit" of the 
correct write.
   
   Most ideally, you'd use another Catalog implementation -- one of the main 
reasons for having separate Catalog implementations is precisely to overcome 
the shortcomings of the default HadoopCatalog-based approach.
   
   What system were you using to write the Iceberg tables in the first place?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] dennishuo commented on issue #5512: Accessing Iceberg tables without catalog

Reply via email to