ychernysh commented on PR #2937: URL: https://github.com/apache/drill/pull/2937#issuecomment-2325133637
@paul-rogers @cgivre > I am unaware of any code in the Foreman that scans all the Parquet files. Is that something new? Drill parallelizes scanning a parquet table at file, row group, column chunk and page levels. A single [ParquetRecordReader](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java) object (which is the key part of the issue, since it creates null-filled vectors for missing columns) reads 1 row group, so this is the level we're interested in. Each Minor Fragment is assigned a list of row groups (that is, a list of ParquetRecordReader objects) to read. The assignment happens in Foreman at parallelization phase ([Foreman#runPhysicalPlan:416](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java#L416), [AbstractParquetGroupScan#applyAssignments](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetGroupScan.java#L170-L173)) and requires the files metadata to be known at that phase (we need to know what row groups are there in order to assign them to the minor fragments). So the metadata is scanned back at the physical plan creation phase ([Foreman#runSQL:594](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java#L594)) using the [ParquetTableMetadataProvider](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/metastore/store/parquet/ParquetTableMetadataProvider.java) interface (see [this code block](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L137-L157)), whose implementations can either read from Drill Metastore, or from the files in FS themselves. Furthermore, the schema from the metadata (that is used in this PR) is also needed at ScanBatch initialization phase (minor fragment initialization) for row group pruning (see [RowsMatch](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/RowsMatch.java#L20-L27) for the logic). See [this](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java#L119) and [this](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java#L206) lines, where `rowGroupScan.getSchema()` is also used. So, the answer is: **no**, that's not something new. Rather, it's a thing required for some basic Drill functionality (assigning the row groups to minor fragments), but also for some specific functionality (row group pruning). > Doing that in the Foreman places extreme load on the Foreman, causes remote reads in Hadoop, and will be slow. As I said above, we cannot avoid it. But, for your information, the process of reading metadata from all the parquet files is [parallelized](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/metadata/Metadata.java#L432-L439) into 16 threads within a Foreman JVM. >The old, original Parquet schema cache did, in fact, read all the files in a folder, but did so in a distributed fashion, and wrote the result to a JSON file. (But without locking, so that two updates at the same time led to corruption.) Are we talking about a new feature recently added in the last few years? Or, about the original Parquet schema cache? Based on the above, and on that [ParquetTableMetadataProvider can use metadata cache files](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/metastore/store/parquet/ParquetTableMetadataProvider.java#L27-L32), I assume that we're talking about the original Parquet schema cache. But I'm not sure... > If this feature does exist, then we now have six ways that Drill handles schema (Parquet cache, provides schema, Drill Metastore, HMS, on-the-fly, and this new prescan of Parquet files). So, I guess we still have 5 ways, and let me summarize my knowledge on each: - parquet cache: used in this PR - [provided schema](https://drill.apache.org/docs/create-or-replace-schema/): only supports text files, not parquet - Drill Metastore: I have just checked, all the errors described in the issue are still reproducible with this way. Seems like this schema is not passed down to ParquetRecordReader (so was the parquet cache schema before this PR). In theory, we could take this schema, but: 1) it requires users to use Drill metastore, while this PR does not; 2) this PR has already done the same, but with parquet cache schema - HMS: ??? - on-the-fly: not applicable to this solution, since we want to take advantages of the "neighbor" parquet files, while this is the schema for a single file being read > - The Parquet reader has to properly handle the missing column to avoid a schema change. In Drill, "schema change" means "rebind the vectors and regenerate any operator-specific code." > - If the first file does not have the column, create a dummy column of the type specified by the planner. > - If the first file does have a given column, create a vector with the type specified by the planner, not the type specified by Parquet. > - If a subsequent file does not have the column, reuse the prior vector, which should be of OPTIONAL mode. Basically all of the above is met by this PR, but with a bit different wording (not using terms _first/subsequent file_, since as said before, the fix is order-agnostic): 1. If at least 1 parquet file contains a selected column, then the null-filled vectors should have its minor type 2. If at least 1 parquet file does not have a selected column, or have it as OPTIONAL, then ALL of the readers are forced to return it as OPTIONAL --- Please sorry if I missed answering some of the questions. I am new to Drill and this is one of my first tasks on it, so I might not understand some things. Regarding refactoring to EVF: I've never seen it before and probably I'm not the best person to implement it. I will try to research it though to understand better @paul-rogers 's reviews. But I think we should at least agree on how would we treat the issue: refactoring everything to EVF (which, as far as I understand, would erase everything made in this PR), making current parquet reader simulate the EVF as much as possible or just solving the concrete problems described in the tickets. Note that this PR went the 3rd way. @rymarm , maybe you could fill in the holes in my answer with your knowledge/experience? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org