fbocse commented on a change in pull request #1465:
URL: https://github.com/apache/iceberg/pull/1465#discussion_r489704855



##########
File path: 
core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java
##########
@@ -289,19 +311,24 @@ int readVersionHint() {
       return Integer.parseInt(in.readLine().replace("\n", ""));
 
     } catch (Exception e) {
-      LOG.warn("Error reading version hint file {}", versionHintFile, e);
+      LOG.debug("Error reading version hint file {}", versionHintFile, e);

Review comment:
       @rdblue we've seen this happening with hadoop implementations that 
implement `create(overwrite=true)` by a server-side `delete` and `create` 
operation - so the API isn't consistent - for a split micro-second we get a 404 
on the version-hint.text file - we're covering for those with retries - however 
they do provide read-after-write guarantees on directory listing - so we've 
looked at solving this in a pretty similar fashion, but instead of a fallback 
mechanism we went with the approach that a table configuration (or version?) 
indicates the version resolver implementation, either file based or directory 
listing based, but I think this approach is actually better... 
   
   @jacques-n I don't think that this approach to fallback to directory listing 
in case of failing to resolve the version hint file can lead to silent 
"corruption" of the table. Can you think of such a scenario? I'm asking cause 
we were considering doing something similar so I'm interested in any thoughts 
that may question this approach.
   But to your point if there so much is a possibility that this approach can 
lead to table corruption then we should probably not provide this feature at 
all, let alone say support it by a "turn-this-feature-at-own-risk" approach - 
it will still corrupt the table :)
   
   @pvary Big difference in the listing directory vs file approach is the 
constant time in loading data from a file (constant time) vs using hadoop 
listing directory (probably not constant time).
   I would assume it would take longer to resolve the version and probably it 
would be advised that older versions are dropped so the list API provides a 
decent latency.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to