[jira] [Commented] (DRILL-8507) Missing parquet columns quoted with backticks conflict with existing ones

ASF GitHub Bot (Jira) Fri, 30 Aug 2024 14:45:03 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-8507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878255#comment-17878255
 ]


ASF GitHub Bot commented on DRILL-8507:
---------------------------------------

paul-rogers commented on PR #2937:
URL: https://github.com/apache/drill/pull/2937#issuecomment-2322376645

   > @paul-rogers @ychernysh I'm wondering if it wouldn't be worth it to 
refactor the Parquet reader to use EVF2 rather than debug all this. I don't 
know what was involved, but I do know that refactoring all the other plugins to 
use EVF2 wasn't all that difficult, but I do know that Parquet is another ball 
game.
   
   Doing so would be the ideal solution. The challenge has always been that the 
Parquet reader is horribly complex. I took a few cracks a refactoring it back 
in the day, but it is still pretty complex.
   
   The most challenging issue is that Parquet is parallel: it reads each column 
in a separate thread. MapR did a large number of hacks to maximize parallelism, 
so that code grew quite complex with many nuances to saturate threads while not 
using too many resources overall: that is, to maximize parallelism within a 
single query, but also across the "thousands of concurrent queries" that were 
the rage back in the day.
   
   All other readers are row-based since that is how most other data formats 
work. EVF is a row-based implementation. As a result, EVF would be difficult to 
reuse.
   
   This raises the reason that EVF was created in the first place: limiting 
batch size to prevent memory fragmentation. Back in the day, all readers read 
64K records per batch, even if that resulted in huge vectors. EVF imposes a 
batch size limit, and gracefully wraps up each batch, rolling over any "excess" 
data to the next one.
   
   In Parquet, that logic does not exist. That is, if we have, say, 20 column 
writers all busy building their own vectors, there is nothing to say, "hold, 
on, we're over our 16 MB batch size limit". Instead, the readers just read _n_ 
rows, creating whatever size vectors are required. Read 1000 columns of 1 MB 
each and you need a 1GB value vector.
   
   The memory fragmentation issue arises because Drill's Netty-based memory 
manager handles allocations up to 32 MB (IIRC) from its binary-buddy free list. 
Beyond that, every request comes from the OS. Netty does not release memory 
back to the OS if even a single byte is in use from a 32 MB block. Eventually, 
all memory resides in the Netty free list, and we reach the OS allocation 
limit. As a result, we can have 100% of the Netty pool free, but no OS capacity 
to allocate another 64MB vector and we get an OOM. The only recourse is to 
restart Drill to return memory to the OS.
   
   While we long ago fixed the fragmentation issues in the other readers (via 
EVF) and other operators (using the "temporary" "batch sizer" hack), it may be 
that Parquet still suffers from memory fragmentation issues because of its 
unique, parallel structure.
   
   Still, perhaps there is some way to have EVF handle the schema and vector 
management stuff, but to disable the row-oriented batch size checks and let the 
Parquet readers write as much data as they want to each vector (fragmenting 
memory if it chooses to do so). Or, maybe work out some way to give each column 
reader a "lease" to read up to x MB. EVF can handle the work needed to copy 
"extra" data over to a new vector for the next batch. I'd have to try to swap 
EVF knowledge back into the ole' brain to sort this out.
   
   All this said, I can certainly see the argument for hacking the existing 
code just to get done. I guess I'd just suggest that the hacks at least reuse 
the rules we already worked out for EVF, even if they can't reuse the code.
   
   All of this is premised on the notion that someone did, recently, add a 
"Parquet prescan" to the planner, and that someone added column type 
propagation to the Calcite planner. Was this actually done? Or, are we somehow 
assuming it was done? Are we confusing this with the old Parquet schema cache? 
Again, I've been out of the loop so I'm just verifying I'm understanding the 
situation.




> Missing parquet columns quoted with backticks conflict with existing ones
> -------------------------------------------------------------------------
>
>                 Key: DRILL-8507
>                 URL: https://issues.apache.org/jira/browse/DRILL-8507
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.21.2
>            Reporter: Yaroslav
>            Priority: Major
>         Attachments: people.tar.gz
>
>
> {*}NOTE{*}: I worked on this issue along with DRILL-8508. It turned out that 
> it required this bug to get fixed first. And since these 2 are about a little 
> bit different things it was decided to report them as separate issues. So, 
> I'm going to link this issue as a requirement for that issue and open one PR 
> for both (if it's allowed..). I think single PR would make it easier to 
> review the code since the issues are quite related anyway.
> h3. Prerequisites
> If a {{ParquetRecordReader}} doesn't find a selected column, it creates a 
> null-filled {{NullableIntVector}} with the column's name and the correct 
> value count set. The field name for the vector is derived from 
> {{SchemaPath#toExpr}} method, which always enquotes the outcome string with 
> backticks.
> h3. Problems
> This causes some wrong field name equality checks (comparing two strings of 
> field names, non-quoted and quoted, returns false, but essentially is 
> supposed to return true) that lead to some errors.
> For example, the errors occur when you select a column from a table where 
> some parquet files contain it and some do not. Consider a {{dfs.tmp.people}} 
> table with such parquet files and their schemas:
> {code:java}
> /tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | 
> age<INT(OPTIONAL)>
> /tmp/people/1.parquet: id<INT(REQUIRED)>{code}
> Now let's try to use an operator that doesn't support schema change. For 
> example, {{{}ORDER BY{}}}:
> {code:java}
> apache drill> SELECT age FROM dfs.tmp.people ORDER BY age;
> Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
> Sort. Please enable Union type.
> Previous schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]], 
> selectionVector=NONE]
> Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)], [``age`` 
> (INT:OPTIONAL)]], selectionVector=NONE]
> Fragment: 0:0
> [Error Id: d3efffd4-cf31-46d5-9f6a-141a61e71d12 on node2.vmcluster.com:31010] 
> (state=,code=0)
> {code}
> ORDER BY error clearly shows us that ``age`` is an extra column here and the 
> incoming schema should only have the unquoted field to match the previous 
> schema.
> Another example is in {{UNION ALL}} operator:
> {code:java}
> apache drill> SELECT age FROM dfs.tmp.people UNION ALL (VALUES (1));
> Error: SYSTEM ERROR: IllegalArgumentException: Input batch and output batch 
> have different field counthas!
> Fragment: 0:0
> Please, refer to logs for more information.
> [Error Id: 81680275-92ee-4d1b-93b5-14f4068990eb on node2.vmcluster.com:31010] 
> (state=,code=0)
> {code}
> Again, "different field counts" issue is caused by an extra quoted column 
> that counts as different field.
> h3. Solution
> The solution for these errors would be to replace {{SchemaPath#toExpr}} call 
> with {{{}SchemaPath#getAsUnescapedPath{}}}, which doesn't enquote the outcome 
> string. Simply enough, but note that we used to use 
> {{SchemaPath#getAsUnescapedPath}} before DRILL-4264  where we switched to 
> {{{}SchemaPath#toExpr{}}}. The author even put a comment:
> {code:java}
> // col.toExpr() is used here as field name since we don't want to see these 
> fields in the existing maps{code}
> So it looks like moving to {{SchemaPath#toExpr}} was a conscious decision. 
> But, honestly, I don't really understand the motivation for this, even with 
> the comment. I don't really understand what "existing maps" are here. Maybe 
> someone from the community can help here, but I will further consider it as a 
> mistake, simply because it causes the above problems.
> h3. Regressions
> The change brings some regressions detected by unit tests. Those I found fail 
> because they changed their execution flow after the fix in these places:
>  * 
> [FieldIdUtil#getFieldId|https://github.com/apache/drill/blob/7c813ff6440a118de15f552145b40eb07bb8e7a2/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/FieldIdUtil.java#L206]
>  * 
> [ScanBatch$Mutator#addField|https://github.com/apache/drill/blob/097a4717ac998ec6bf3c70a99575c7ff53f47430/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java#L523-L524]
> Before the fix, the field name equality checks returned false from comparing 
> quoted and unquoted field names. The failing tests relied on that behavior 
> and worked only with this condition. But after the fix, when we compare two 
> quoted names and return true, we fall to a different branch and the tests 
> aren't ready for that.
> But obviously the change _is fixing the problem_ so the tests that relied on 
> that problem should now be adjusted.
> Please see more technical details on each failed unit test in the linked PR.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (DRILL-8507) Missing parquet columns quoted with backticks conflict with existing ones

Reply via email to