[jira] [Created] (DRILL-6948) Defaults values not homogeneous

benj (JIRA) Mon, 07 Jan 2019 03:27:12 -0800

benj created DRILL-6948:
---------------------------

             Summary: Defaults values not homogeneous
                 Key: DRILL-6948
                 URL: https://issues.apache.org/jira/browse/DRILL-6948
             Project: Apache Drill
          Issue Type: Bug
          Components: Storage - Parquet
    Affects Versions: 1.15.0
            Reporter: benj



Several problems can occur when using Parquet file with differents columns.
{code:java}
CREATE TABLE tmp2.`mytable1` AS SELECT            1 AS myc1,
      'col3_1' AS myc3;
CREATE TABLE tmp2.`mytable2` AS SELECT            2 AS myc1, 'col2_2' AS
myc2, 'col3_2' AS myc3, 'col4_2' AS myc4;
CREATE TABLE tmp2.`mytable3` AS SELECT 0 AS myc0, 3 AS myc1, 'col2_3' AS
myc2;
{code}
1 - SELECT ... FROM tmp2.`mytable*`; can pass or fail depending of the random 
order of treatment of the files.

2 - (When passing) the default value is not stable between the files, example :

 
{code:java}
SELECT myc0, myc1, myc2, myc3, myc4 FROM tmp2.`mytable*`;
+-------+-------+---------+---------+---------+
| myc0  | myc1  |  myc2   |  myc3   |  myc4   |
+-------+-------+---------+---------+---------+
| 0     | 3     | col2_3  | null    | null    |
| 0     | 2     | col2_2  | col3_2  | col4_2  |
| 0     | 1     |         | col3_1  |         |
+-------+-------+---------+---------+---------+
{code}
Here, column myc4 only exists on mytable2, but appears in

 
 * empty string for mytable1
 * null for mytable3

The repartition of NULL and empty may change (function of the random order of 
the treatment of the file) and lead to extremely different results (function of 
WHERE part for example).

The problem is sometimes "worse"

 
{code:java}
SELECT * FROM tmp2.`mytable*`;
+-----------+-------------+-------+---------+
|   dir0    |    myc0     | myc1  |  myc2   |
+-----------+-------------+-------+---------+
| mytable3  | 0           | 3     | col2_3  |
| mytable2  | 0           | 2     | col2_2  |
| mytable1  | 1635023213  | 1     |         |
+-----------+-------------+-------+---------+{code}
In reality, only the file mytable3 has the column myc0 defined (with the value 
0), but these queries give us different non NULL values for 

 
 * mytable2 : 0 - disturbing because of the 0 of mytable3
 * mytable1 : 1635023213 - disturbing because these value can change with each 
call

 

It's very difficult to work with queries that produce random value and/or 
random NULL value and can fail randomly.

In my opinion, +with Parquet files+, if a column is unknown, this column could 
be treated but the default value should always be NULL.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (DRILL-6948) Defaults values not homogeneous

Reply via email to