[jira] [Commented] (DRILL-8507) Missing parquet columns quoted with backticks conflict with existing ones

ASF GitHub Bot (Jira) Mon, 02 Sep 2024 10:31:06 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-8507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878635#comment-17878635
 ]


ASF GitHub Bot commented on DRILL-8507:
---------------------------------------

ychernysh commented on PR #2937:
URL: https://github.com/apache/drill/pull/2937#issuecomment-2325133637

   @paul-rogers @cgivre 
   
   > I am unaware of any code in the Foreman that scans all the Parquet files. 
Is that something new?
   
   Drill parallelizes scanning a parquet table at file, row group, column chunk 
and page levels. A single 
[ParquetRecordReader](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java)
 object (which is the key part of the issue, since it creates null-filled 
vectors for missing columns) reads 1 row group, so this is the level we're 
interested in. Each Minor Fragment is assigned a list of row groups (that is, a 
list of ParquetRecordReader objects) to read.
   The assignment happens in Foreman at parallelization phase 
([Foreman#runPhysicalPlan:416](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java#L416),
 
[AbstractParquetGroupScan#applyAssignments](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetGroupScan.java#L170-L173))
 and requires the files metadata to be known at that phase (we need to know 
what row groups are there in order to assign them to the minor fragments). 
   So the metadata is scanned back at the physical plan creation phase 
([Foreman#runSQL:594](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java#L594))
 using the 
[ParquetTableMetadataProvider](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/metastore/store/parquet/ParquetTableMetadataProvider.java)
 interface (see [this code 
block](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L137-L157)),
 whose implementations can either read from Drill Metastore, or from the files 
in FS themselves.
   Furthermore, the schema from the metadata (that is used in this PR) is also 
needed at ScanBatch initialization phase (minor fragment initialization) for 
row group pruning (see 
[RowsMatch](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/RowsMatch.java#L20-L27)
 for the logic). See 
[this](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java#L119)
 and 
[this](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java#L206)
 lines, where `rowGroupScan.getSchema()` is also used.
   So, the answer is: **no**, that's not something new. Rather, it's a thing 
required for some basic Drill functionality (assigning the row groups to minor 
fragments), but also for some specific functionality (row group pruning). 
   
   > Doing that in the Foreman places extreme load on the Foreman, causes 
remote reads in Hadoop, and will be slow.
   
    As I said above, we cannot avoid it. But, for your information, the process 
of reading metadata from all the parquet files is 
[parallelized](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/metadata/Metadata.java#L432-L439)
 into 16 threads within a Foreman JVM. 
   
   >The old, original Parquet schema cache did, in fact, read all the files in 
a folder, but did so in a distributed fashion, and wrote the result to a JSON 
file. (But without locking, so that two updates at the same time led to 
corruption.) Are we talking about a new feature recently added in the last few 
years? Or, about the original Parquet schema cache?
   
   Based on the above, and on that [ParquetTableMetadataProvider can use 
metadata cache 
files](https://github.com/apache/drill/blob/drill-1.21.2/exec/java-exec/src/main/java/org/apache/drill/exec/metastore/store/parquet/ParquetTableMetadataProvider.java#L27-L32),
 I assume that we're talking about the original Parquet schema cache. But I'm 
not sure...
   
   > If this feature does exist, then we now have six ways that Drill handles 
schema (Parquet cache, provides schema, Drill Metastore, HMS, on-the-fly, and 
this new prescan of Parquet files).
   
   So, I guess we still have 5 ways, and let me summarize my knowledge on each:
   - parquet cache: used in this PR
   - [provided 
schema](https://drill.apache.org/docs/create-or-replace-schema/): only supports 
text files, not parquet
   - Drill Metastore: I have just checked, all the errors described in the 
issue are still reproducible with this way. Seems like this schema is not 
passed down to ParquetRecordReader (so was the parquet cache schema before this 
PR). In theory, we could take this schema, but: 1) it requires users to use 
Drill metastore, while this PR does not; 2) this PR has already done the same, 
but with parquet cache schema
   - HMS: ???
   - on-the-fly: not applicable to this solution, since we want to take 
advantages of the "neighbor" parquet files, while this is the schema for a 
single file being read
   
   > - The Parquet reader has to properly handle the missing column to avoid a 
schema change. In Drill, "schema change" means "rebind the vectors and 
regenerate any operator-specific code."
   > - If the first file does not have the column, create a dummy column of the 
type specified by the planner.
   > - If the first file does have a given column, create a vector with the 
type specified by the planner, not the type specified by Parquet.
   > - If a subsequent file does not have the column, reuse the prior vector, 
which should be of OPTIONAL mode.
   
   Basically all of the above is met by this PR, but with a bit different 
wording (not using terms _first/subsequent file_, since as said before, the fix 
is order-agnostic):
   1. If at least 1 parquet file contains a selected column, then the 
null-filled vectors should have its minor type
   2. If at least 1 parquet file does not have a selected column, or have it as 
OPTIONAL, then ALL of the readers are forced to return it as OPTIONAL
   
   ---
   
   Please sorry if I missed answering some of the questions. I am new to Drill 
and this is one of my first tasks on it, so I might not understand some things. 
Regarding refactoring to EVF: I've never seen it before and probably I'm not 
the best person to implement it. I will try to research it though to understand 
better @paul-rogers 's reviews. But I think we should at least agree on how 
would we treat the issue: refactoring everything to EVF (which, as far as I 
understand, would erase everything made in this PR), making current parquet 
reader simulate the EVF as much as possible or just solving the concrete 
problems described in the tickets. Note that this PR went the 3rd way.
   @rymarm , maybe you could fill in the holes in my answer with your 
knowledge/experience? 




> Missing parquet columns quoted with backticks conflict with existing ones
> -------------------------------------------------------------------------
>
>                 Key: DRILL-8507
>                 URL: https://issues.apache.org/jira/browse/DRILL-8507
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.21.2
>            Reporter: Yaroslav
>            Priority: Major
>         Attachments: people.tar.gz
>
>
> {*}NOTE{*}: I worked on this issue along with DRILL-8508. It turned out that 
> it required this bug to get fixed first. And since these 2 are about a little 
> bit different things it was decided to report them as separate issues. So, 
> I'm going to link this issue as a requirement for that issue and open one PR 
> for both (if it's allowed..). I think single PR would make it easier to 
> review the code since the issues are quite related anyway.
> h3. Prerequisites
> If a {{ParquetRecordReader}} doesn't find a selected column, it creates a 
> null-filled {{NullableIntVector}} with the column's name and the correct 
> value count set. The field name for the vector is derived from 
> {{SchemaPath#toExpr}} method, which always enquotes the outcome string with 
> backticks.
> h3. Problems
> This causes some wrong field name equality checks (comparing two strings of 
> field names, non-quoted and quoted, returns false, but essentially is 
> supposed to return true) that lead to some errors.
> For example, the errors occur when you select a column from a table where 
> some parquet files contain it and some do not. Consider a {{dfs.tmp.people}} 
> table with such parquet files and their schemas:
> {code:java}
> /tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | 
> age<INT(OPTIONAL)>
> /tmp/people/1.parquet: id<INT(REQUIRED)>{code}
> Now let's try to use an operator that doesn't support schema change. For 
> example, {{{}ORDER BY{}}}:
> {code:java}
> apache drill> SELECT age FROM dfs.tmp.people ORDER BY age;
> Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
> Sort. Please enable Union type.
> Previous schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]], 
> selectionVector=NONE]
> Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)], [``age`` 
> (INT:OPTIONAL)]], selectionVector=NONE]
> Fragment: 0:0
> [Error Id: d3efffd4-cf31-46d5-9f6a-141a61e71d12 on node2.vmcluster.com:31010] 
> (state=,code=0)
> {code}
> ORDER BY error clearly shows us that ``age`` is an extra column here and the 
> incoming schema should only have the unquoted field to match the previous 
> schema.
> Another example is in {{UNION ALL}} operator:
> {code:java}
> apache drill> SELECT age FROM dfs.tmp.people UNION ALL (VALUES (1));
> Error: SYSTEM ERROR: IllegalArgumentException: Input batch and output batch 
> have different field counthas!
> Fragment: 0:0
> Please, refer to logs for more information.
> [Error Id: 81680275-92ee-4d1b-93b5-14f4068990eb on node2.vmcluster.com:31010] 
> (state=,code=0)
> {code}
> Again, "different field counts" issue is caused by an extra quoted column 
> that counts as different field.
> h3. Solution
> The solution for these errors would be to replace {{SchemaPath#toExpr}} call 
> with {{{}SchemaPath#getAsUnescapedPath{}}}, which doesn't enquote the outcome 
> string. Simply enough, but note that we used to use 
> {{SchemaPath#getAsUnescapedPath}} before DRILL-4264  where we switched to 
> {{{}SchemaPath#toExpr{}}}. The author even put a comment:
> {code:java}
> // col.toExpr() is used here as field name since we don't want to see these 
> fields in the existing maps{code}
> So it looks like moving to {{SchemaPath#toExpr}} was a conscious decision. 
> But, honestly, I don't really understand the motivation for this, even with 
> the comment. I don't really understand what "existing maps" are here. Maybe 
> someone from the community can help here, but I will further consider it as a 
> mistake, simply because it causes the above problems.
> h3. Regressions
> The change brings some regressions detected by unit tests. Those I found fail 
> because they changed their execution flow after the fix in these places:
>  * 
> [FieldIdUtil#getFieldId|https://github.com/apache/drill/blob/7c813ff6440a118de15f552145b40eb07bb8e7a2/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/FieldIdUtil.java#L206]
>  * 
> [ScanBatch$Mutator#addField|https://github.com/apache/drill/blob/097a4717ac998ec6bf3c70a99575c7ff53f47430/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java#L523-L524]
> Before the fix, the field name equality checks returned false from comparing 
> quoted and unquoted field names. The failing tests relied on that behavior 
> and worked only with this condition. But after the fix, when we compare two 
> quoted names and return true, we fall to a different branch and the tests 
> aren't ready for that.
> But obviously the change _is fixing the problem_ so the tests that relied on 
> that problem should now be adjusted.
> Please see more technical details on each failed unit test in the linked PR.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (DRILL-8507) Missing parquet columns quoted with backticks conflict with existing ones

Reply via email to