[GitHub] [spark] LuciferYang edited a comment on pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC

GitBox Thu, 19 Aug 2021 19:31:15 -0700


LuciferYang edited a comment on pull request #33748:
URL: https://github.com/apache/spark/pull/33748#issuecomment-902383334



   > In Hive it's common that the same file name (e.g., 000000_0) gets used 
when doing insert overwrite. Even if we check file size and other stuff, it 
can't completely prevent us from hitting a stale cache.
   
   Can we add `ctime` or `mtime` of the file to the `PartitionedFile` and use 
this information for check? 
   
   Similarly, how do we ensure that the `FileStatus` 
cache(`SharedInMemoryCache`) is correct when the user overwrites the file and 
does not send the `refreshTable` command to the current `SparkApp`? There is 
also the problem of the same name of the file. I feel that solving this 
consistency problem may become an independent topic :)
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] LuciferYang edited a comment on pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC

Reply via email to