[ https://issues.apache.org/jira/browse/DRILL-8507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878255#comment-17878255 ]
ASF GitHub Bot commented on DRILL-8507: --------------------------------------- paul-rogers commented on PR #2937: URL: https://github.com/apache/drill/pull/2937#issuecomment-2322376645 > @paul-rogers @ychernysh I'm wondering if it wouldn't be worth it to refactor the Parquet reader to use EVF2 rather than debug all this. I don't know what was involved, but I do know that refactoring all the other plugins to use EVF2 wasn't all that difficult, but I do know that Parquet is another ball game. Doing so would be the ideal solution. The challenge has always been that the Parquet reader is horribly complex. I took a few cracks a refactoring it back in the day, but it is still pretty complex. The most challenging issue is that Parquet is parallel: it reads each column in a separate thread. MapR did a large number of hacks to maximize parallelism, so that code grew quite complex with many nuances to saturate threads while not using too many resources overall: that is, to maximize parallelism within a single query, but also across the "thousands of concurrent queries" that were the rage back in the day. All other readers are row-based since that is how most other data formats work. EVF is a row-based implementation. As a result, EVF would be difficult to reuse. This raises the reason that EVF was created in the first place: limiting batch size to prevent memory fragmentation. Back in the day, all readers read 64K records per batch, even if that resulted in huge vectors. EVF imposes a batch size limit, and gracefully wraps up each batch, rolling over any "excess" data to the next one. In Parquet, that logic does not exist. That is, if we have, say, 20 column writers all busy building their own vectors, there is nothing to say, "hold, on, we're over our 16 MB batch size limit". Instead, the readers just read _n_ rows, creating whatever size vectors are required. Read 1000 columns of 1 MB each and you need a 1GB value vector. The memory fragmentation issue arises because Drill's Netty-based memory manager handles allocations up to 32 MB (IIRC) from its binary-buddy free list. Beyond that, every request comes from the OS. Netty does not release memory back to the OS if even a single byte is in use from a 32 MB block. Eventually, all memory resides in the Netty free list, and we reach the OS allocation limit. As a result, we can have 100% of the Netty pool free, but no OS capacity to allocate another 64MB vector and we get an OOM. The only recourse is to restart Drill to return memory to the OS. While we long ago fixed the fragmentation issues in the other readers (via EVF) and other operators (using the "temporary" "batch sizer" hack), it may be that Parquet still suffers from memory fragmentation issues because of its unique, parallel structure. Still, perhaps there is some way to have EVF handle the schema and vector management stuff, but to disable the row-oriented batch size checks and let the Parquet readers write as much data as they want to each vector (fragmenting memory if it chooses to do so). Or, maybe work out some way to give each column reader a "lease" to read up to x MB. EVF can handle the work needed to copy "extra" data over to a new vector for the next batch. I'd have to try to swap EVF knowledge back into the ole' brain to sort this out. All this said, I can certainly see the argument for hacking the existing code just to get done. I guess I'd just suggest that the hacks at least reuse the rules we already worked out for EVF, even if they can't reuse the code. All of this is premised on the notion that someone did, recently, add a "Parquet prescan" to the planner, and that someone added column type propagation to the Calcite planner. Was this actually done? Or, are we somehow assuming it was done? Are we confusing this with the old Parquet schema cache? Again, I've been out of the loop so I'm just verifying I'm understanding the situation. > Missing parquet columns quoted with backticks conflict with existing ones > ------------------------------------------------------------------------- > > Key: DRILL-8507 > URL: https://issues.apache.org/jira/browse/DRILL-8507 > Project: Apache Drill > Issue Type: Bug > Affects Versions: 1.21.2 > Reporter: Yaroslav > Priority: Major > Attachments: people.tar.gz > > > {*}NOTE{*}: I worked on this issue along with DRILL-8508. It turned out that > it required this bug to get fixed first. And since these 2 are about a little > bit different things it was decided to report them as separate issues. So, > I'm going to link this issue as a requirement for that issue and open one PR > for both (if it's allowed..). I think single PR would make it easier to > review the code since the issues are quite related anyway. > h3. Prerequisites > If a {{ParquetRecordReader}} doesn't find a selected column, it creates a > null-filled {{NullableIntVector}} with the column's name and the correct > value count set. The field name for the vector is derived from > {{SchemaPath#toExpr}} method, which always enquotes the outcome string with > backticks. > h3. Problems > This causes some wrong field name equality checks (comparing two strings of > field names, non-quoted and quoted, returns false, but essentially is > supposed to return true) that lead to some errors. > For example, the errors occur when you select a column from a table where > some parquet files contain it and some do not. Consider a {{dfs.tmp.people}} > table with such parquet files and their schemas: > {code:java} > /tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | > age<INT(OPTIONAL)> > /tmp/people/1.parquet: id<INT(REQUIRED)>{code} > Now let's try to use an operator that doesn't support schema change. For > example, {{{}ORDER BY{}}}: > {code:java} > apache drill> SELECT age FROM dfs.tmp.people ORDER BY age; > Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External > Sort. Please enable Union type. > Previous schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]], > selectionVector=NONE] > Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)], [``age`` > (INT:OPTIONAL)]], selectionVector=NONE] > Fragment: 0:0 > [Error Id: d3efffd4-cf31-46d5-9f6a-141a61e71d12 on node2.vmcluster.com:31010] > (state=,code=0) > {code} > ORDER BY error clearly shows us that ``age`` is an extra column here and the > incoming schema should only have the unquoted field to match the previous > schema. > Another example is in {{UNION ALL}} operator: > {code:java} > apache drill> SELECT age FROM dfs.tmp.people UNION ALL (VALUES (1)); > Error: SYSTEM ERROR: IllegalArgumentException: Input batch and output batch > have different field counthas! > Fragment: 0:0 > Please, refer to logs for more information. > [Error Id: 81680275-92ee-4d1b-93b5-14f4068990eb on node2.vmcluster.com:31010] > (state=,code=0) > {code} > Again, "different field counts" issue is caused by an extra quoted column > that counts as different field. > h3. Solution > The solution for these errors would be to replace {{SchemaPath#toExpr}} call > with {{{}SchemaPath#getAsUnescapedPath{}}}, which doesn't enquote the outcome > string. Simply enough, but note that we used to use > {{SchemaPath#getAsUnescapedPath}} before DRILL-4264 where we switched to > {{{}SchemaPath#toExpr{}}}. The author even put a comment: > {code:java} > // col.toExpr() is used here as field name since we don't want to see these > fields in the existing maps{code} > So it looks like moving to {{SchemaPath#toExpr}} was a conscious decision. > But, honestly, I don't really understand the motivation for this, even with > the comment. I don't really understand what "existing maps" are here. Maybe > someone from the community can help here, but I will further consider it as a > mistake, simply because it causes the above problems. > h3. Regressions > The change brings some regressions detected by unit tests. Those I found fail > because they changed their execution flow after the fix in these places: > * > [FieldIdUtil#getFieldId|https://github.com/apache/drill/blob/7c813ff6440a118de15f552145b40eb07bb8e7a2/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/FieldIdUtil.java#L206] > * > [ScanBatch$Mutator#addField|https://github.com/apache/drill/blob/097a4717ac998ec6bf3c70a99575c7ff53f47430/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java#L523-L524] > Before the fix, the field name equality checks returned false from comparing > quoted and unquoted field names. The failing tests relied on that behavior > and worked only with this condition. But after the fix, when we compare two > quoted names and return true, we fall to a different branch and the tests > aren't ready for that. > But obviously the change _is fixing the problem_ so the tests that relied on > that problem should now be adjusted. > Please see more technical details on each failed unit test in the linked PR. > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)