sunchao commented on pull request #33748:
URL: https://github.com/apache/spark/pull/33748#issuecomment-902435389


   > Can we add ctime or mtime of the file to the PartitionedFile and use this 
information for check?
   
   Yea file path + modification time seem like a good way to validate the cache
   
   > Similarly, how do we ensure that the FileStatus cache(SharedInMemoryCache) 
is correct when the user overwrites the file and does not send the refreshTable 
command to the current SparkApp? There is also the problem of the same name of 
the file. I feel that solving this consistency problem may become an 
independent topic :)
   
   I think they are not quite the same: FileStatus cache operates on a 
per-table basis so you'll only get stale data in the worst case. However, here 
the cache is on a per-file basis, so one could end up having some files in a 
partition that are cached while the rest are not. In addition, at least in 
Parquet, metadata, including row count, stats, bloom filter, etc, are used to 
filter row groups or data pages. Task failure or correctness issue could happen 
if we apply stale metadata on a newer file, or if the metadata is used in 
aggregation pushdown (ongoing work).
   
   > Everything depends on the data lifecycle. For the safety, we can control 
it by reducing spark.sql.fileMetaCache.ttlSinceLastAccessSec to 10 secs or less 
which is still effective to reduce the twice footer lookups in the single task.
   
   I understand you want to avoid the duplicate footer lookup. In Parquet at 
least we can just pass the footer from either `ParquetFileFormat` or 
`ParquetPartitionReaderFactory` to `SpecificParquetRecordReaderBase` for reuse, 
which I think is much simpler than using a cache.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to