[jira] [Updated] (DRILL-8508) Choosing the best suitable major type for a partially missing parquet column

Yaroslav (Jira) Wed, 28 Aug 2024 08:54:06 -0700


     [ 
https://issues.apache.org/jira/browse/DRILL-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yaroslav updated DRILL-8508:
----------------------------
    Description: 
{*}NOTE{*}: This issue requires and assumes DRILL-8507 bug to be fixed first. 
Please do not proceed to this one until that issue would be solved.
h3. Prerequisites

If a {{ParquetRecordReader}} doesn't find a selected column, it creates a 
null-filled {{NullableIntVector}} with the column's name and the correct value 
count set.
h3. Problems

Hardcoding the minor type (INT) leads to SchemaChangeExceptions and type cast 
exceptions. Former also happens due to data mode change (REQUIRED -> OPTIONAL). 
Consider a {{dfs.tmp.people}} table with such parquet files and their schemas:
{code:java}
/tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | 
age<INT(REQUIRED)>
/tmp/people/1.parquet: id<INT(REQUIRED)>{code}
The following query against that table would fail because of minor type change 
(VARCHAR -> INT):
{code:java}
apache drill> SELECT name FROM dfs.tmp.people ORDER BY name;
Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
Sort. Please enable Union type.
Previous schema: BatchSchema [fields=[[`name` (VARCHAR:OPTIONAL)]], 
selectionVector=NONE]
Incoming schema: BatchSchema [fields=[[`name` (INT:OPTIONAL)]], 
selectionVector=NONE]
Fragment: 0:0
[Error Id: 97625816-0a07-410e-87b1-1d461fb8f00d on node2.vmcluster.com:31010] 
(state=,code=0)
{code}
And the following query would fail because of data mode change (REQUIRED -> 
OPTIONAL):
{code:java}
apache drill> SELECT age FROM dfs.tmp.people ORDER BY age;
Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
Sort. Please enable Union type.
Previous schema: BatchSchema [fields=[[`age` (INT:REQUIRED)]], 
selectionVector=NONE]
Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]], 
selectionVector=NONE]
Fragment: 0:0
[Error Id: adce5b82-331c-410d-87f4-c8fc1ba943e6 on node2.vmcluster.com:31010] 
(state=,code=0)
{code}
Note that the last query would also fail if we had both parquet files 
containing the column, but one would have it as REQURIED and other as OPTIONAL, 
such as here:
{code:java}
/tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | 
age<INT(REQUIRED)>
/tmp/people/1.parquet: id<INT(REQUIRED)> | age<INT(OPTIONAL)>
{code}
h3. Solution idea

Note that all of the cases above have this {_}partially missing column{_}, 
meaning that some of the parquet files in a queried table have the column and 
others do not (or have it as OPTIONAL). If none of the files would contain the 
column ({_}completely missing{_}), we wouldn't have any chance to guess the 
major type except defaulting to INT:OPTIONAL.

But the case with partially missing column is different in that the correct 
minor type exists in those parquet files who have the column (and the data mode 
is obviously OPTIONAL since we create a null-filled vector). So, in theory, we 
could take the minor type from there create a null-filled vector for a missing 
column with this type.

The solution idea suggested here is based on the fact that {_}schemas of all 
the parquet files to read is available at planning phase in a foreman{_}. 
Furthermore, it is already passed to each separate minor fragment (and its 
parquet readers) so the only thing left to do is to take the minor type from 
there and use it for missing columns. For data mode issue, however, we also 
need to catch the partially missing column case and enforce all the readers to 
return it as OPTIONAL (even if this particular reader have it as REQUIRED).

*Expected behavior*

So this ticket aims to bring 2 new statements in the Drill's behavior for 
parquet missing columns:
 # If at least 1 parquet file contains a selected column, then the null-filled 
vectors should have its minor type
 # If at least 1 parquet file does not have a selected column, or have it as 
OPTIONAL, then ALL of the readers should  return this column as OPTIONAL

 

 

  was:
{*}NOTE{*}: This issue requires and assumes DRILL-8507 bug to be fixed first. 
Please do not proceed to this one until that issue would be solved.
h3. Prerequisites

If a {{ParquetRecordReader}} doesn't find a selected column, it creates a 
null-filled {{NullableIntVector}} with the column's name and the correct value 
count set.
h3. Problems

Hardcoding the minor type (INT) leads to SchemaChangeExceptions and type cast 
exceptions. Former also happens due to data mode change (REQUIRED -> OPTIONAL). 
Consider a {{dfs.tmp.people}} table with such parquet files and their schemas:
{code:java}
/tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | 
age<INT(REQUIRED)>
/tmp/people/1.parquet: id<INT(REQUIRED)>{code}
The following query against that table would fail because of minor type change 
(VARCHAR -> INT):

 

 
{code:java}
apache drill> SELECT name FROM dfs.tmp.people ORDER BY name;
Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
Sort. Please enable Union type.
Previous schema: BatchSchema [fields=[[`name` (VARCHAR:OPTIONAL)]], 
selectionVector=NONE]
Incoming schema: BatchSchema [fields=[[`name` (INT:OPTIONAL)]], 
selectionVector=NONE]
Fragment: 0:0
[Error Id: 97625816-0a07-410e-87b1-1d461fb8f00d on node2.vmcluster.com:31010] 
(state=,code=0)
{code}
 

And the following query would fail because of data mode change (REQUIRED -> 
OPTIONAL):

 
{code:java}
apache drill> SELECT age FROM dfs.tmp.people ORDER BY age;
Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
Sort. Please enable Union type.
Previous schema: BatchSchema [fields=[[`age` (INT:REQUIRED)]], 
selectionVector=NONE]
Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]], 
selectionVector=NONE]
Fragment: 0:0
[Error Id: adce5b82-331c-410d-87f4-c8fc1ba943e6 on node2.vmcluster.com:31010] 
(state=,code=0)
{code}
Note that the last query would also fail if we had both parquet files 
containing the column, but one would have it as REQURIED and other as OPTIONAL, 
such as here:

 
{code:java}
/tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | 
age<INT(REQUIRED)>
/tmp/people/1.parquet: id<INT(REQUIRED)> | age<INT(OPTIONAL)>
{code}
h3. Solution idea

Note that all of the cases above have this {_}partially missing column{_}, 
meaning that some of the parquet files in a queried table have the column and 
others do not (or have it as OPTIONAL). If none of the files would contain the 
column ({_}completely missing{_}), we wouldn't have any chance to guess the 
major type except defaulting to INT:OPTIONAL.

But the case with partially missing column is different in that the correct 
minor type exists in those parquet files who have the column (and the data mode 
is obviously OPTIONAL since we create a null-filled vector). So, in theory, we 
could take the minor type from there create a null-filled vector for a missing 
column with this type.

The solution idea suggested here is based on the fact that {_}schemas of all 
the parquet files to read is available at planning phase in a foreman{_}. 
Furthermore, it is already passed to each separate minor fragment (and its 
parquet readers) so the only thing left to do is to take the minor type from 
there and use it for missing columns. For data mode issue, however, we also 
need to catch the partially missing column case and enforce all the readers to 
return it as OPTIONAL (even if this particular reader have it as REQUIRED).

*Expected behavior*

So this ticket aims to bring 2 new statements in the Drill's behavior for 
parquet missing columns:
 # If at least 1 parquet file contains a selected column, then the null-filled 
vectors should have its minor type
 # If at least 1 parquet file does not have a selected column, or have it as 
OPTIONAL, then ALL of the readers should  return this column as OPTIONAL

 

 


> Choosing the best suitable major type for a partially missing parquet column
> ----------------------------------------------------------------------------
>
>                 Key: DRILL-8508
>                 URL: https://issues.apache.org/jira/browse/DRILL-8508
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: Yaroslav
>            Priority: Major
>         Attachments: people.tar.gz
>
>
> {*}NOTE{*}: This issue requires and assumes DRILL-8507 bug to be fixed first. 
> Please do not proceed to this one until that issue would be solved.
> h3. Prerequisites
> If a {{ParquetRecordReader}} doesn't find a selected column, it creates a 
> null-filled {{NullableIntVector}} with the column's name and the correct 
> value count set.
> h3. Problems
> Hardcoding the minor type (INT) leads to SchemaChangeExceptions and type cast 
> exceptions. Former also happens due to data mode change (REQUIRED -> 
> OPTIONAL). Consider a {{dfs.tmp.people}} table with such parquet files and 
> their schemas:
> {code:java}
> /tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | 
> age<INT(REQUIRED)>
> /tmp/people/1.parquet: id<INT(REQUIRED)>{code}
> The following query against that table would fail because of minor type 
> change (VARCHAR -> INT):
> {code:java}
> apache drill> SELECT name FROM dfs.tmp.people ORDER BY name;
> Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
> Sort. Please enable Union type.
> Previous schema: BatchSchema [fields=[[`name` (VARCHAR:OPTIONAL)]], 
> selectionVector=NONE]
> Incoming schema: BatchSchema [fields=[[`name` (INT:OPTIONAL)]], 
> selectionVector=NONE]
> Fragment: 0:0
> [Error Id: 97625816-0a07-410e-87b1-1d461fb8f00d on node2.vmcluster.com:31010] 
> (state=,code=0)
> {code}
> And the following query would fail because of data mode change (REQUIRED -> 
> OPTIONAL):
> {code:java}
> apache drill> SELECT age FROM dfs.tmp.people ORDER BY age;
> Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
> Sort. Please enable Union type.
> Previous schema: BatchSchema [fields=[[`age` (INT:REQUIRED)]], 
> selectionVector=NONE]
> Incoming schema: BatchSchema [fields=[[`age` (INT:OPTIONAL)]], 
> selectionVector=NONE]
> Fragment: 0:0
> [Error Id: adce5b82-331c-410d-87f4-c8fc1ba943e6 on node2.vmcluster.com:31010] 
> (state=,code=0)
> {code}
> Note that the last query would also fail if we had both parquet files 
> containing the column, but one would have it as REQURIED and other as 
> OPTIONAL, such as here:
> {code:java}
> /tmp/people/0.parquet: id<INT(REQUIRED)> | name<VARCHAR(OPTIONAL)> | 
> age<INT(REQUIRED)>
> /tmp/people/1.parquet: id<INT(REQUIRED)> | age<INT(OPTIONAL)>
> {code}
> h3. Solution idea
> Note that all of the cases above have this {_}partially missing column{_}, 
> meaning that some of the parquet files in a queried table have the column and 
> others do not (or have it as OPTIONAL). If none of the files would contain 
> the column ({_}completely missing{_}), we wouldn't have any chance to guess 
> the major type except defaulting to INT:OPTIONAL.
> But the case with partially missing column is different in that the correct 
> minor type exists in those parquet files who have the column (and the data 
> mode is obviously OPTIONAL since we create a null-filled vector). So, in 
> theory, we could take the minor type from there create a null-filled vector 
> for a missing column with this type.
> The solution idea suggested here is based on the fact that {_}schemas of all 
> the parquet files to read is available at planning phase in a foreman{_}. 
> Furthermore, it is already passed to each separate minor fragment (and its 
> parquet readers) so the only thing left to do is to take the minor type from 
> there and use it for missing columns. For data mode issue, however, we also 
> need to catch the partially missing column case and enforce all the readers 
> to return it as OPTIONAL (even if this particular reader have it as REQUIRED).
> *Expected behavior*
> So this ticket aims to bring 2 new statements in the Drill's behavior for 
> parquet missing columns:
>  # If at least 1 parquet file contains a selected column, then the 
> null-filled vectors should have its minor type
>  # If at least 1 parquet file does not have a selected column, or have it as 
> OPTIONAL, then ALL of the readers should  return this column as OPTIONAL
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (DRILL-8508) Choosing the best suitable major type for a partially missing parquet column

Reply via email to