[GitHub] spark pull request: [SPARK-9876][SQL]: Update Parquet to 1.8.1.

liancheng Thu, 26 May 2016 13:17:21 -0700

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/13280#issuecomment-221982103
  
    I think we should upgrade Parquet to 1.8.1 in Spark 2.0 due to the 
following reasons:
    
    1. Get PARQUET-251 fixed so that we no longer write corrupted statistics 
for string/binary columns, and can re-enable filter push-down for these columns 
(1.8.1 ignores corrupted statistics automatically).
    1. The performance regression was discovered by a micro benchmark that 
simply scans the flat TPC-DS `store_sales` table (scale factor 15). However, we 
are not using our own vectorized Parquet reader for flat tables now. So at 
least table scan for flat schemata shouldn't be affected at the moment.
    
    My only concern for this PR is that we should add [the hack][1] done in PR 
#9225. Otherwise there would be noticeable performance regression for queries 
like `SELECT COUNT(*) FROM t`.
    
    [1]: 
https://github.com/apache/spark/pull/9225/files#diff-b4108187503e0f3ac64c1630d266b122R115



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-9876][SQL]: Update Parquet to 1.8.1.

Reply via email to