[
https://issues.apache.org/jira/browse/DRILL-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16792820#comment-16792820
]
benj commented on DRILL-7104:
-----------------------------
For more precision :
With the correct file "nobug"
{code:java}
hadoop jar parquet-tools-1.10.0.jar schema nobug
=>
message root {
optional binary r (UTF8);
optional binary col1 (UTF8);
{code}
With the problematic file "bug"
{code:java}
hadoop jar parquet-tools-1.10.0.jar schema bug
=>
message root {
optional binary r (UTF8);
optional int32 col1;
}
{code}
The problem seem to be on the creation of the parquet (not on the read)
> Change of data type when parquet with multiple fragment
> -------------------------------------------------------
>
> Key: DRILL-7104
> URL: https://issues.apache.org/jira/browse/DRILL-7104
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Parquet
> Affects Versions: 1.15.0
> Reporter: benj
> Priority: Major
>
> When creating a Parquet with a column filled only with "CAST(NULL AS
> VARCHAR)", if the parquet has several fragment, the type is read like INT
> instead of VARCHAR.
> First, create +Parquet with only one fragment+ - all is fine (the type of
> "demo" is correct).
> {code:java}
> CREATE TABLE ....`bug` AS
> (SELECT CAST(NULL AS VARCHAR) AS demo
> , md5(cast(rand() AS VARCHAR) AS jam
> FROM ....`onebigfile` LIMIT 1000000));
> +-----------+----------------------------+
> | Fragment | Number of records written |
> +-----------+----------------------------+
> | 0_0 | 10000000 |
> SELECT drilltypeof(demo) AS goodtype FROM ....`bug` LIMIT 1;
> +--------------------+
> | goodtype |
> +--------------------+
> | VARCHAR |
> {code}
> Second, create +Parquet with at least 2 fragments+ - the type of "demo"
> change to INT
> {code:java}
> CREATE TABLE ....`bug` AS
> ((SELECT CAST(NULL AS VARCHAR) AS demo
> ,md5(CAST(rand() AS VARCHAR)) AS jam
> FROM ....`onebigfile` LIMIT 1000000)
> UNION
> (SELECT CAST(NULL AS VARCHAR) AS demo
> ,md5(CAST(rand() AS VARCHAR)) AS jam
> FROM ....`onebigfile` LIMIT 1000000));
> +-----------+----------------------------+
> | Fragment | Number of records written |
> +-----------+----------------------------+
> | 1_1 | 1000276 |
> | 1_0 | 999724 |
> SELECT drilltypeof(demo) AS badtype FROM ....`bug` LIMIT 1;
> +--------------------+
> | badtype |
> +--------------------+
> | INT |{code}
> The change of type is really terrible...
>
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)