Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21295#discussion_r190378887
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
---
@@ -225,7 +226,8 @@ protected void initialize(String path, List<String>
columns) throws IOException
this.sparkSchema = new
ParquetToSparkSchemaConverter(config).convert(requestedSchema);
this.reader = new ParquetFileReader(
config, footer.getFileMetaData(), file, blocks,
requestedSchema.getColumns());
- for (BlockMetaData block : blocks) {
+ // use the blocks from the reader in case some do not match filters
and will not be read
+ for (BlockMetaData block : reader.getRowGroups()) {
--- End diff --
Dictionary filtering is off by default in 1.8.x. It was enabled after we
built confidence in its correctness in 1.9.x.
We should backport this fix to 2.3.x also, but the only downside to not
having it is that dictionary filtering will throw an exception when it is
enabled. So the feature just isn't available.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]