sunchao commented on pull request #33748: URL: https://github.com/apache/spark/pull/33748#issuecomment-902435389
> Can we add ctime or mtime of the file to the PartitionedFile and use this information for check? Yea file path + modification time seem like a good way to validate the cache > Similarly, how do we ensure that the FileStatus cache(SharedInMemoryCache) is correct when the user overwrites the file and does not send the refreshTable command to the current SparkApp? There is also the problem of the same name of the file. I feel that solving this consistency problem may become an independent topic :) I think they are not quite the same: FileStatus cache operates on a per-table basis so you'll only get stale data in the worst case. However, here the cache is on a per-file basis, so one could end up having some files in a partition that are cached while the rest are not. In addition, at least in Parquet, metadata, including row count, stats, bloom filter, etc, are used to filter row groups or data pages. Task failure or correctness issue could happen if we apply stale metadata on a newer file, or if the metadata is used in aggregation pushdown (ongoing work). > Everything depends on the data lifecycle. For the safety, we can control it by reducing spark.sql.fileMetaCache.ttlSinceLastAccessSec to 10 secs or less which is still effective to reduce the twice footer lookups in the single task. I understand you want to avoid the duplicate footer lookup. In Parquet at least we can just pass the footer from either `ParquetFileFormat` or `ParquetPartitionReaderFactory` to `SpecificParquetRecordReaderBase` for reuse, which I think is much simpler than using a cache. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
