[jira] [Commented] (DRILL-5970) DrillParquetReader always builds the schema with "OPTIONAL" dataMode columns instead of "REQUIRED" ones

Paul Rogers (JIRA) Tue, 26 Dec 2017 12:14:52 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-5970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304001#comment-16304001
 ]


Paul Rogers commented on DRILL-5970:
------------------------------------

[~vitalii], it seems we have multiple issues and it is hard to keep them 
straight.

*Missing Type Handling*

In your first example, you had a reference to an empty array. It is not clear 
how Drill defines the type of that array. (My original comment was wrong, I 
thought JSON used the {{ListVector}}. Turns out it doesn't. So the empty list 
had some concrete type. But, what? {{VARCHAR}}? {{INT}}?)

In your second example, you say {{select 'text' AS MYCOL, 'Bucket1' AS Bucket 
FROM (VALUES(1))}}. Again, what are the types? If this followed normal 
practice, it should be nullable {{INT}}. But, your example suggests it is 
nullable {{VARCHAR}}.

A long-term suggestion is that we document our rules. Without such 
documentation, all we can say is that the answer is whatever the code does. 
Without knowing the answers, it is hard to follow the remainder of the above 
discussion.

This topic, in turn, is associated with the idea that Drill is schema free, 
that we discover types on the fly, that types may change, and types may be 
missing. As Drill reaches maturity, the current behavior becomes the de-facto 
design.

*Parquet Optional/Required Type Handling*

We cited two equally valid arguments for Parquet types. 1) Do what Parquet 
says. 2) Always use {{OPTIONAL}} to allow for schema migration over time (which 
means reading a set of files in which some contain the column and some do not.)

Drill cannot decide: there is no one right answer. The rule is 
application-specific. If the data is a dump from a mature Oracle DB, say, then 
the set of columns won't change. If the data is from a new ETL project, the 
columns may change as the user tweaks the ETL rules.

The point is, Drill does not have the information to make the decision; the 
information is known only to the application. Since Drill has no schema, the 
application cannot communicate that information to Drill. So, Drill must make 
an arbitrary decision.

Your point is that we have two readers: one does it one way, one another. But, 
which is right? I think we should ask Pritesh and PM what customers want.

*Suggestion*

If project leadership does not have a firm opinion, perhaps we should just do 
what our most important reader does. This would be the reader that Salim is 
enhancing. (The "new" reader?) You mentioned that it uses {{OPTIONAL}}.

Do we "own" the code for the "old" reader, or is that one we inherited from the 
Parquet project? If it is from Parquet, then perhaps we should stick with what 
it does to be consistent with the Parquet project.

In this case, the "new" reader offers additional non-standard Drill-specific 
extensions (such as Drill's date handling, mapping to vectors, filling in 
"missing" columns, etc.), one of which is the use of {{OPTIONAL}} for Parquet 
required types.

In short, without a spec, there really is no right answer. [~sachouche], what 
do you suggest?

> DrillParquetReader always builds the schema with "OPTIONAL" dataMode columns 
> instead of "REQUIRED" ones
> -------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-5970
>                 URL: https://issues.apache.org/jira/browse/DRILL-5970
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Codegen, Execution - Data Types, Storage - 
> Parquet
>    Affects Versions: 1.11.0
>            Reporter: Vitalii Diravka
>            Assignee: Vitalii Diravka
>
> The root cause of the issue is that adding REQUIRED (not-nullable) data types 
> to the container in the all MapWriters is not implemented.
> It can lead to get invalid schema. 
> {code}
> 0: jdbc:drill:zk=local> CREATE TABLE dfs.tmp.bof_repro_1 as select * from 
> (select CONVERT_FROM('["hello","hai"]','JSON') AS MYCOL, 'Bucket1' AS Bucket 
> FROM (VALUES(1)));
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 1                          |
> +-----------+----------------------------+
> 1 row selected (2.376 seconds)
> {code}
> Run from Drill unit test framework (to see "data mode"):
> {code}
> @Test
>   public void test() throws Exception {
>     setColumnWidths(new int[] {25, 25});
>     List<QueryDataBatch> queryDataBatches = testSqlWithResults("select * from 
> dfs.tmp.bof_repro_1");
>     printResult(queryDataBatches);
>   }
> 1 row(s):
> -------------------------------------------------------
> | MYCOL<VARCHAR(REPEATED)> | Bucket<VARCHAR(OPTIONAL)>|
> -------------------------------------------------------
> | ["hello","hai"]          | Bucket1                  |
> -------------------------------------------------------
> Total record count: 1
> {code}
> {code}
> vitalii@vitalii-pc:~/parquet-tools/parquet-mr/parquet-tools/target$ java -jar 
> parquet-tools-1.6.0rc3-SNAPSHOT.jar schema /tmp/bof_repro_1/0_0_0.parquet 
> message root {
>   repeated binary MYCOL (UTF8);
>   required binary Bucket (UTF8);
> }
> {code}
> To simulate of obtaining the wrong result you can try the query with 
> aggregation by using a new parquet reader (used by default for complex data 
> types) and old parquet reader. False "Hash aggregate does not support schema 
> changes" error will happen. 
> 1) Create two parquet files.
> {code}
> 0: jdbc:drill:schema=dfs> CREATE TABLE dfs.tmp.bof_repro_1 as select * from 
> (select CONVERT_FROM('["hello","hai"]','JSON') AS MYCOL, 'Bucket1' AS Bucket 
> FROM (VALUES(1)));
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 1                          |
> +-----------+----------------------------+
> 1 row selected (1.122 seconds)
> 0: jdbc:drill:schema=dfs> CREATE TABLE dfs.tmp.bof_repro_2 as select * from 
> (select CONVERT_FROM('[]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM 
> (VALUES(1)));
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 1                          |
> +-----------+----------------------------+
> 1 row selected (0.552 seconds)
> 0: jdbc:drill:schema=dfs> select * from dfs.tmp.bof_repro_2;
> {code}
> 2) Copy the parquet files from bof_repro_1 to bof_repro_2.
> {code}
> [root@naravm1 ~]# hadoop fs -ls /tmp/bof_repro_1
> Found 1 items
> -rw-r--r--   3 mapr mapr        415 2017-07-25 11:46 
> /tmp/bof_repro_1/0_0_0.parquet
> [root@naravm1 ~]# hadoop fs -ls /tmp/bof_repro_2
> Found 1 items
> -rw-r--r--   3 mapr mapr        368 2017-07-25 11:46 
> /tmp/bof_repro_2/0_0_0.parquet
> [root@naravm1 ~]# hadoop fs -cp /tmp/bof_repro_1/0_0_0.parquet 
> /tmp/bof_repro_2/0_0_1.parquet
> [root@naravm1 ~]#
> {code}
> 3) Query the table.
> {code}
> 0: jdbc:drill:schema=dfs> ALTER SESSION SET  `planner.enable_streamagg`=false;
> +-------+------------------------------------+
> |  ok   |              summary               |
> +-------+------------------------------------+
> | true  | planner.enable_streamagg updated.  |
> +-------+------------------------------------+
> 1 row selected (0.124 seconds)
> 0: jdbc:drill:schema=dfs> select * from dfs.tmp.bof_repro_2;
> +------------------+----------+
> |      MYCOL       |  Bucket  |
> +------------------+----------+
> | ["hello","hai"]  | Bucket1  |
> | null             | Bucket1  |
> +------------------+----------+
> 2 rows selected (0.247 seconds)
> 0: jdbc:drill:schema=dfs> select bucket, count(*) from dfs.tmp.bof_repro_2 
> group by bucket;
> Error: UNSUPPORTED_OPERATION ERROR: Hash aggregate does not support schema 
> changes
> Fragment 0:0
> [Error Id: 60f8ada3-5f00-4413-a676-4881fc8cb409 on naravm3:31010] 
> (state=,code=0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (DRILL-5970) DrillParquetReader always builds the schema with "OPTIONAL" dataMode columns instead of "REQUIRED" ones

Reply via email to