[GitHub] spark pull request: [SPARK-8756][SQL] Keep cached information and ...

liancheng Sun, 19 Jul 2015 04:52:58 -0700

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/7154#issuecomment-122651410
  
    Thanks for contributing this patch! I have two high level comments here:
    
    1. PR #7396 also tries to accelerate Parquet metadata discovery/refreshing 
by several means, and has been proven to be quite effective. We've observed 
~50x speedup on large partitioned S3 dataset with schema merging enabled.
    1. How about adding a check for `FileStatus.getModificationTime`? Namely, 
we only read footers of new files and existing files that are modified since 
last refresh. This can be particularly useful for appending.
    
    In general, this PR can be a good complement for #7396.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-8756][SQL] Keep cached information and ...

Reply via email to