[
https://issues.apache.org/jira/browse/DRILL-5970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16303863#comment-16303863
]
Vitalii Diravka commented on DRILL-5970:
----------------------------------------
[~Paul.Rogers]
The schema of second file is the following:
{code}
message root {
optional int32 MYCOL;
required binary Bucket (UTF8);
}
{code}
So you are right regarding nullable list and schema changing, but it is
actually other issue.
My intention is to improve Drill OPTIONAL mode for parquet REQUIRED mode. You
question was "is this feature or bug?".
Let's take a query without nullable list:
{code}
CREATE TABLE dfs.tmp.bof_repro_3 as select * from (select 'text' AS MYCOL,
'Bucket1' AS Bucket FROM (VALUES(1)));
{code}
Schema:
{code}
message root {
required binary MYCOL (UTF8);
required binary Bucket (UTF8);
}
{code}
After copying to the "bof_repro_1" directory and querying it we get the
following result:
{code}
1 row(s):
-------------------------------------------------------
| MYCOL<VARCHAR(REPEATED)> | Bucket<VARCHAR(OPTIONAL)>|
-------------------------------------------------------
| ["hello","hai"] | Bucket1 |
-------------------------------------------------------
1 row(s):
-------------------------------------------------------
| MYCOL<VARCHAR(REQUIRED)> | Bucket<VARCHAR(REQUIRED)>|
-------------------------------------------------------
| text | Bucket1 |
-------------------------------------------------------
{code}
The first row was read with new Parquet reader, but seconf one - with the old
one. It shows that new Drill Parquet reader threats "Bucket" column with
OPTIONAL mode, but old reader - with REQUIRED mode. If we want to explain this
as a feature, Drill should keep the same behavior for both readers.
But I think we should keep the conformity between Parquet and Drill data modes.
And my fix actually does it (not fully - the similar improvements should be
done for complex types). But I agree that Drill's internal complex types should
be documented better, since there is a lot of issues connected to it. And when
the documentation is ready, we can determine the direction of improving Drill's
complex types.
Since my improvement was directed to regular data types (not complex), I assume
that this issue can be fixed independently of improving of Drill's complex data
types.
If you aren't agreed, this jira should be added to the scope of improving of
Drill's internal data types with documentation provision.
> DrillParquetReader always builds the schema with "OPTIONAL" dataMode columns
> instead of "REQUIRED" ones
> -------------------------------------------------------------------------------------------------------
>
> Key: DRILL-5970
> URL: https://issues.apache.org/jira/browse/DRILL-5970
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Codegen, Execution - Data Types, Storage -
> Parquet
> Affects Versions: 1.11.0
> Reporter: Vitalii Diravka
> Assignee: Vitalii Diravka
>
> The root cause of the issue is that adding REQUIRED (not-nullable) data types
> to the container in the all MapWriters is not implemented.
> It can lead to get invalid schema.
> {code}
> 0: jdbc:drill:zk=local> CREATE TABLE dfs.tmp.bof_repro_1 as select * from
> (select CONVERT_FROM('["hello","hai"]','JSON') AS MYCOL, 'Bucket1' AS Bucket
> FROM (VALUES(1)));
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
> details.
> +-----------+----------------------------+
> | Fragment | Number of records written |
> +-----------+----------------------------+
> | 0_0 | 1 |
> +-----------+----------------------------+
> 1 row selected (2.376 seconds)
> {code}
> Run from Drill unit test framework (to see "data mode"):
> {code}
> @Test
> public void test() throws Exception {
> setColumnWidths(new int[] {25, 25});
> List<QueryDataBatch> queryDataBatches = testSqlWithResults("select * from
> dfs.tmp.bof_repro_1");
> printResult(queryDataBatches);
> }
> 1 row(s):
> -------------------------------------------------------
> | MYCOL<VARCHAR(REPEATED)> | Bucket<VARCHAR(OPTIONAL)>|
> -------------------------------------------------------
> | ["hello","hai"] | Bucket1 |
> -------------------------------------------------------
> Total record count: 1
> {code}
> {code}
> vitalii@vitalii-pc:~/parquet-tools/parquet-mr/parquet-tools/target$ java -jar
> parquet-tools-1.6.0rc3-SNAPSHOT.jar schema /tmp/bof_repro_1/0_0_0.parquet
> message root {
> repeated binary MYCOL (UTF8);
> required binary Bucket (UTF8);
> }
> {code}
> To simulate of obtaining the wrong result you can try the query with
> aggregation by using a new parquet reader (used by default for complex data
> types) and old parquet reader. False "Hash aggregate does not support schema
> changes" error will happen.
> 1) Create two parquet files.
> {code}
> 0: jdbc:drill:schema=dfs> CREATE TABLE dfs.tmp.bof_repro_1 as select * from
> (select CONVERT_FROM('["hello","hai"]','JSON') AS MYCOL, 'Bucket1' AS Bucket
> FROM (VALUES(1)));
> +-----------+----------------------------+
> | Fragment | Number of records written |
> +-----------+----------------------------+
> | 0_0 | 1 |
> +-----------+----------------------------+
> 1 row selected (1.122 seconds)
> 0: jdbc:drill:schema=dfs> CREATE TABLE dfs.tmp.bof_repro_2 as select * from
> (select CONVERT_FROM('[]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM
> (VALUES(1)));
> +-----------+----------------------------+
> | Fragment | Number of records written |
> +-----------+----------------------------+
> | 0_0 | 1 |
> +-----------+----------------------------+
> 1 row selected (0.552 seconds)
> 0: jdbc:drill:schema=dfs> select * from dfs.tmp.bof_repro_2;
> {code}
> 2) Copy the parquet files from bof_repro_1 to bof_repro_2.
> {code}
> [root@naravm1 ~]# hadoop fs -ls /tmp/bof_repro_1
> Found 1 items
> -rw-r--r-- 3 mapr mapr 415 2017-07-25 11:46
> /tmp/bof_repro_1/0_0_0.parquet
> [root@naravm1 ~]# hadoop fs -ls /tmp/bof_repro_2
> Found 1 items
> -rw-r--r-- 3 mapr mapr 368 2017-07-25 11:46
> /tmp/bof_repro_2/0_0_0.parquet
> [root@naravm1 ~]# hadoop fs -cp /tmp/bof_repro_1/0_0_0.parquet
> /tmp/bof_repro_2/0_0_1.parquet
> [root@naravm1 ~]#
> {code}
> 3) Query the table.
> {code}
> 0: jdbc:drill:schema=dfs> ALTER SESSION SET `planner.enable_streamagg`=false;
> +-------+------------------------------------+
> | ok | summary |
> +-------+------------------------------------+
> | true | planner.enable_streamagg updated. |
> +-------+------------------------------------+
> 1 row selected (0.124 seconds)
> 0: jdbc:drill:schema=dfs> select * from dfs.tmp.bof_repro_2;
> +------------------+----------+
> | MYCOL | Bucket |
> +------------------+----------+
> | ["hello","hai"] | Bucket1 |
> | null | Bucket1 |
> +------------------+----------+
> 2 rows selected (0.247 seconds)
> 0: jdbc:drill:schema=dfs> select bucket, count(*) from dfs.tmp.bof_repro_2
> group by bucket;
> Error: UNSUPPORTED_OPERATION ERROR: Hash aggregate does not support schema
> changes
> Fragment 0:0
> [Error Id: 60f8ada3-5f00-4413-a676-4881fc8cb409 on naravm3:31010]
> (state=,code=0)
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)