Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 @maropu, looking at the pushdown benchmark, it looks like ORC and Parquet either both benefit or both do not benefit from pushdown. In some cases ORC is much faster, which is due to the fact that ORC will skip reading pages, not just row groups. But, when ORC benefits from pushdown so does Parquet, for example the `Select 1 int row (value = 7864320)` case. I think that you were expecting a string comparison case to have a significant benefit over non-pushdown. But I would only expect that if ORC had a similar benefit. That's because this is dependent on the clustering of values in the file so that Parquet can eliminate row groups. If ORC didn't have a benefit, then I would expect that the data just isn't clustered in a way that helps. I'm not sure how you're generating data, but I'd recommend adding a sorted column case with enough data to create multiple row groups (or stripes for ORC). That would write data so that you can ignore some row groups and you should see a speed up. Parquet also supports dictionary-based row group filtering. To test that, make sure you have a column that is entirely dictionary-encoded: pick a small set of values and randomly draw from that set. Then if you search for a value that isn't in that set you should see a speedup. Also make sure that you have `parquet.filter.dictionary.enabled=true` set in the Hadoop configuration so that Parquet uses dictionary filtering.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org