fbocse commented on a change in pull request #1465:
URL: https://github.com/apache/iceberg/pull/1465#discussion_r489704855
##########
File path:
core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java
##########
@@ -289,19 +311,24 @@ int readVersionHint() {
return Integer.parseInt(in.readLine().replace("\n", ""));
} catch (Exception e) {
- LOG.warn("Error reading version hint file {}", versionHintFile, e);
+ LOG.debug("Error reading version hint file {}", versionHintFile, e);
Review comment:
@rdblue we've seen this happening with hadoop implementations that
implement `create(overwrite=true)` by a server-side `delete` and `create`
operation - so for a split micro-second we get a 404 on the version-hint.text
file - so the API isn't consistent - however they do provide read-after-write
guarantees on directory listing - so we've looked at solving this in a pretty
similar fashion, but instead of a fallback mechanism we went with the approach
that a table configuration (or version?) indicates the version resolver
implementation, either file based or directory listing based, but I think this
approach is actually better...
@jacques-n I don't think that this approach to fallback to directory listing
in case of failing to resolve the version hint file can lead to silent
"corruption" of the table. Can you think of such a scenario? I'm asking cause
we were considering doing something similar so I'm interested in any thoughts
that may question this approach.
But to your point if there so much is a possibility that this approach can
lead to table corruption then we should probably not provide this feature at
all, let alone say support it by a "turn-this-feature-at-own-risk" approach -
it will still corrupt the table :)
@pvary Big difference in the listing directory vs file approach is the
constant time in loading data from a file (constant time) vs using hadoop
listing directory (probably not constant time).
I would assume it would take longer to resolve the version and probably it
would be advised that older versions are dropped so the list API provides a
decent latency.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]