benj created DRILL-6948:
---------------------------
Summary: Defaults values not homogeneous
Key: DRILL-6948
URL: https://issues.apache.org/jira/browse/DRILL-6948
Project: Apache Drill
Issue Type: Bug
Components: Storage - Parquet
Affects Versions: 1.15.0
Reporter: benj
Several problems can occur when using Parquet file with differents columns.
{code:java}
CREATE TABLE tmp2.`mytable1` AS SELECT 1 AS myc1,
'col3_1' AS myc3;
CREATE TABLE tmp2.`mytable2` AS SELECT 2 AS myc1, 'col2_2' AS
myc2, 'col3_2' AS myc3, 'col4_2' AS myc4;
CREATE TABLE tmp2.`mytable3` AS SELECT 0 AS myc0, 3 AS myc1, 'col2_3' AS
myc2;
{code}
1 - SELECT ... FROM tmp2.`mytable*`; can pass or fail depending of the random
order of treatment of the files.
2 - (When passing) the default value is not stable between the files, example :
{code:java}
SELECT myc0, myc1, myc2, myc3, myc4 FROM tmp2.`mytable*`;
+-------+-------+---------+---------+---------+
| myc0 | myc1 | myc2 | myc3 | myc4 |
+-------+-------+---------+---------+---------+
| 0 | 3 | col2_3 | null | null |
| 0 | 2 | col2_2 | col3_2 | col4_2 |
| 0 | 1 | | col3_1 | |
+-------+-------+---------+---------+---------+
{code}
Here, column myc4 only exists on mytable2, but appears in
* empty string for mytable1
* null for mytable3
The repartition of NULL and empty may change (function of the random order of
the treatment of the file) and lead to extremely different results (function of
WHERE part for example).
The problem is sometimes "worse"
{code:java}
SELECT * FROM tmp2.`mytable*`;
+-----------+-------------+-------+---------+
| dir0 | myc0 | myc1 | myc2 |
+-----------+-------------+-------+---------+
| mytable3 | 0 | 3 | col2_3 |
| mytable2 | 0 | 2 | col2_2 |
| mytable1 | 1635023213 | 1 | |
+-----------+-------------+-------+---------+{code}
In reality, only the file mytable3 has the column myc0 defined (with the value
0), but these queries give us different non NULL values for
* mytable2 : 0 - disturbing because of the 0 of mytable3
* mytable1 : 1635023213 - disturbing because these value can change with each
call
It's very difficult to work with queries that produce random value and/or
random NULL value and can fail randomly.
In my opinion, +with Parquet files+, if a column is unknown, this column could
be treated but the default value should always be NULL.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)