[
https://issues.apache.org/jira/browse/DRILL-5970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295955#comment-16295955
]
Paul Rogers commented on DRILL-5970:
------------------------------------
In the reproduction steps, there is a bug, but it is in the JSON conversion:
{code}
CREATE TABLE dfs.tmp.bof_repro_2 as select * from (select
CONVERT_FROM('[]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM (VALUES(1)));
{code}
Note that we are trying to create an empty list. But, it is a fact of JSON (and
one of Drill's biggest bug/feature confusions) that we don't know the type of
the array. The JSON {{"[]"}} is an array of what, exactly? JSON uses the LIST
type for arrays. The list type has a default type of {{LATE}}. So, the list can
be a list of nothing, which allows expressing empty lists and null lists
without knowing the type. (Not sure the rest of Drill handles these cases.)
But, it is most decidedly not a {{LIST}} of {{VARCHAR}}.
The next query illustrated this point:
{noformat}
select * from dfs.tmp.bof_repro_2;
+------------------+----------+
| MYCOL | Bucket |
+------------------+----------+
| ["hello","hai"] | Bucket1 |
| null | Bucket1 |
+------------------+----------+
{noformat}
The amazing bit is that something in Drill did know that a {{List}} of {{LATE}}
is compatible with a {{LIST}} of {{VARCHAR}}.
So, the above result is absolutely correct, if confusing (because we don't all
know the details of how JSON or {{LIST}} work...)
And, of course, this really is a schema change, so the following is expected:
{noformat}
select bucket, count(*) from dfs.tmp.bof_repro_2 group by bucket;
Error: UNSUPPORTED_OPERATION ERROR: Hash aggregate does not support schema
changes
{noformat}
You'll see why if you inspect the schema of the second Parquet file. What is
the type of {{MYCOL}}?
> DrillParquetReader always builds the schema with "OPTIONAL" dataMode columns
> instead of "REQUIRED" ones
> -------------------------------------------------------------------------------------------------------
>
> Key: DRILL-5970
> URL: https://issues.apache.org/jira/browse/DRILL-5970
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Codegen, Execution - Data Types, Storage -
> Parquet
> Affects Versions: 1.11.0
> Reporter: Vitalii Diravka
> Assignee: Vitalii Diravka
>
> The root cause of the issue is that adding REQUIRED (not-nullable) data types
> to the container in the all MapWriters is not implemented.
> It can lead to get invalid schema.
> {code}
> 0: jdbc:drill:zk=local> CREATE TABLE dfs.tmp.bof_repro_1 as select * from
> (select CONVERT_FROM('["hello","hai"]','JSON') AS MYCOL, 'Bucket1' AS Bucket
> FROM (VALUES(1)));
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
> details.
> +-----------+----------------------------+
> | Fragment | Number of records written |
> +-----------+----------------------------+
> | 0_0 | 1 |
> +-----------+----------------------------+
> 1 row selected (2.376 seconds)
> {code}
> Run from Drill unit test framework (to see "data mode"):
> {code}
> @Test
> public void test() throws Exception {
> setColumnWidths(new int[] {25, 25});
> List<QueryDataBatch> queryDataBatches = testSqlWithResults("select * from
> dfs.tmp.bof_repro_1");
> printResult(queryDataBatches);
> }
> 1 row(s):
> -------------------------------------------------------
> | MYCOL<VARCHAR(REPEATED)> | Bucket<VARCHAR(OPTIONAL)>|
> -------------------------------------------------------
> | ["hello","hai"] | Bucket1 |
> -------------------------------------------------------
> Total record count: 1
> {code}
> {code}
> vitalii@vitalii-pc:~/parquet-tools/parquet-mr/parquet-tools/target$ java -jar
> parquet-tools-1.6.0rc3-SNAPSHOT.jar schema /tmp/bof_repro_1/0_0_0.parquet
> message root {
> repeated binary MYCOL (UTF8);
> required binary Bucket (UTF8);
> }
> {code}
> To simulate of obtaining the wrong result you can try the query with
> aggregation by using a new parquet reader (used by default for complex data
> types) and old parquet reader. False "Hash aggregate does not support schema
> changes" error will happen.
> 1) Create two parquet files.
> {code}
> 0: jdbc:drill:schema=dfs> CREATE TABLE dfs.tmp.bof_repro_1 as select * from
> (select CONVERT_FROM('["hello","hai"]','JSON') AS MYCOL, 'Bucket1' AS Bucket
> FROM (VALUES(1)));
> +-----------+----------------------------+
> | Fragment | Number of records written |
> +-----------+----------------------------+
> | 0_0 | 1 |
> +-----------+----------------------------+
> 1 row selected (1.122 seconds)
> 0: jdbc:drill:schema=dfs> CREATE TABLE dfs.tmp.bof_repro_2 as select * from
> (select CONVERT_FROM('[]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM
> (VALUES(1)));
> +-----------+----------------------------+
> | Fragment | Number of records written |
> +-----------+----------------------------+
> | 0_0 | 1 |
> +-----------+----------------------------+
> 1 row selected (0.552 seconds)
> 0: jdbc:drill:schema=dfs> select * from dfs.tmp.bof_repro_2;
> {code}
> 2) Copy the parquet files from bof_repro_1 to bof_repro_2.
> {code}
> [root@naravm1 ~]# hadoop fs -ls /tmp/bof_repro_1
> Found 1 items
> -rw-r--r-- 3 mapr mapr 415 2017-07-25 11:46
> /tmp/bof_repro_1/0_0_0.parquet
> [root@naravm1 ~]# hadoop fs -ls /tmp/bof_repro_2
> Found 1 items
> -rw-r--r-- 3 mapr mapr 368 2017-07-25 11:46
> /tmp/bof_repro_2/0_0_0.parquet
> [root@naravm1 ~]# hadoop fs -cp /tmp/bof_repro_1/0_0_0.parquet
> /tmp/bof_repro_2/0_0_1.parquet
> [root@naravm1 ~]#
> {code}
> 3) Query the table.
> {code}
> 0: jdbc:drill:schema=dfs> ALTER SESSION SET `planner.enable_streamagg`=false;
> +-------+------------------------------------+
> | ok | summary |
> +-------+------------------------------------+
> | true | planner.enable_streamagg updated. |
> +-------+------------------------------------+
> 1 row selected (0.124 seconds)
> 0: jdbc:drill:schema=dfs> select * from dfs.tmp.bof_repro_2;
> +------------------+----------+
> | MYCOL | Bucket |
> +------------------+----------+
> | ["hello","hai"] | Bucket1 |
> | null | Bucket1 |
> +------------------+----------+
> 2 rows selected (0.247 seconds)
> 0: jdbc:drill:schema=dfs> select bucket, count(*) from dfs.tmp.bof_repro_2
> group by bucket;
> Error: UNSUPPORTED_OPERATION ERROR: Hash aggregate does not support schema
> changes
> Fragment 0:0
> [Error Id: 60f8ada3-5f00-4413-a676-4881fc8cb409 on naravm3:31010]
> (state=,code=0)
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)