[GitHub] [spark] dbtsai commented on pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC

GitBox Thu, 19 Aug 2021 23:54:42 -0700


dbtsai commented on pull request #33748:
URL: https://github.com/apache/spark/pull/33748#issuecomment-902475893



    > Everything depends on the data lifecycle. For the safety, we can control 
it by reducing `spark.sql.fileMetaCache.ttlSinceLastAccessSec` to `10 secs` or 
less which is still effective to reduce the twice footer lookups in the single 
task.
   
   @dongjoon-hyun Could you elaborate what's the twice footer lookups here in a 
single task? If it's in a single task, then the files should be the same, so 
the life of the cache can be just for a single cache, right?
   
   I thought the purpose of this PR is when the same ORC file is used for 
multiple tasks, the fileMetaCache can be used to avoid reading the footer 
multiple times with a caveat that those tasks have to be scheduled on the same 
executor. 
   
   As @LuciferYang and @sunchao mentioned above, this requires adding something 
like Presto coordinator to ensure the footer cache can be reused. I feel it's 
fairly complicated, and don't know if it worths it. For this use-case, we might 
just use Iceberg which stores the metadata as a separated manifest. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] dbtsai commented on pull request #33748: [SPARK-36516][SQL] Support File Metadata Cache for ORC

Reply via email to