[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

rdblue Thu, 03 May 2018 10:52:03 -0700

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21070
  
    @maropu, looking at the pushdown benchmark, it looks like ORC and Parquet 
either both benefit or both do not benefit from pushdown. In some cases ORC is 
much faster, which is due to the fact that ORC will skip reading pages, not 
just row groups. But, when ORC benefits from pushdown so does Parquet, for 
example the `Select 1 int row (value = 7864320)` case.
    
    I think that you were expecting a string comparison case to have a 
significant benefit over non-pushdown. But I would only expect that if ORC had 
a similar benefit. That's because this is dependent on the clustering of values 
in the file so that Parquet can eliminate row groups. If ORC didn't have a 
benefit, then I would expect that the data just isn't clustered in a way that 
helps.
    
    I'm not sure how you're generating data, but I'd recommend adding a sorted 
column case with enough data to create multiple row groups (or stripes for 
ORC). That would write data so that you can ignore some row groups and you 
should see a speed up.
    
    Parquet also supports dictionary-based row group filtering. To test that, 
make sure you have a column that is entirely dictionary-encoded: pick a small 
set of values and randomly draw from that set. Then if you search for a value 
that isn't in that set you should see a speedup. Also make sure that you have 
`parquet.filter.dictionary.enabled=true` set in the Hadoop configuration so 
that Parquet uses dictionary filtering.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

Reply via email to