[
https://issues.apache.org/jira/browse/DRILL-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581151#comment-17581151
]
James Turton commented on DRILL-8231:
-------------------------------------
Still broken in Drill master. The COL6408 expression SUM(CAST(val11 as
BIGINT)+CAST(val12 as BIGINT)) can be replaced with simplicifications like
MAX(val11) or MAX(val12) while still reproducing the bug so it looks like the
problem arises when either of these two varchar columns participates in the key
used for the hash exchange.
{code:java}
text 00-00 Screen
00-01 Project(COL6408=[$0], COL4452=[$1])
00-02 StreamAgg(group=[{}], COL6408=[MAX($1)], COL4452=[COUNT($0)])
00-03 UnionExchange
01-01 HashAgg(group=[{1}], COL6408=[MAX($0)])
01-02 Project(val11=[$0], val2=[$1])
01-03 HashToRandomExchange(dist0=[[$0]])
02-01 UnorderedMuxExchange
03-01 Project(val11=[$0], val2=[$1],
E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0, 1301011:BIGINT)])
03-02 Scan(table=[[dfs, tmp,
/8231/data/*/log_15872_R_79_*.parquet]], groupscan=[ParquetGroupScan
[entries=[ReadEntryWithPath
[path=/tmp/8231/data/02/log_15872_R_79_2022051819502000.parquet],
ReadEntryWithPath
[path=/tmp/8231/data/10/log_15872_R_79_2022051819502000.parquet],
ReadEntryWithPath
[path=/tmp/8231/data/05/log_15872_R_79_2022051819502000.parquet],
ReadEntryWithPath
[path=/tmp/8231/data/07/log_15872_R_79_2022051819502000.parquet],
ReadEntryWithPath
[path=/tmp/8231/data/09/log_15872_R_79_2022051819502000.parquet],
ReadEntryWithPath
[path=/tmp/8231/data/03/log_15872_R_79_2022051819502000.parquet],
ReadEntryWithPath
[path=/tmp/8231/data/04/log_15872_R_79_2022051819502000.parquet],
ReadEntryWithPath
[path=/tmp/8231/data/08/log_15872_R_79_2022051819502000.parquet],
ReadEntryWithPath
[path=/tmp/8231/data/06/log_15872_R_79_2022051819502000.parquet]],
selectionRoot=file:/tmp/8231/data, numFiles=9, numRowGroups=9,
usedMetadataFile=false, usedMetastore=false, columns=[`val11`, `val2`]]]){code}
> Wrong result in the COUNT function position.
> --------------------------------------------
>
> Key: DRILL-8231
> URL: https://issues.apache.org/jira/browse/DRILL-8231
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.18.0, 1.19.0
> Reporter: manabu nagamine
> Priority: Major
> Attachments: drill.zip
>
>
> Hi Team.
> We using Drill 1.18.
> There is a phenomenon that the count values of COL4452 are different in the
> execution results of the following queries.
> The only difference is that the positions of COL4452 and COL6408 have been
> changed.
> {code:java}
> 1.
> select COUNT(DISTINCT val2) COL4452, SUM(CAST(val11 as BIGINT)+CAST(val12 as
> BIGINT)) COL6408 from dfs.root.`/drill/data/*/log_15872_R_79_*.parquet` WHERE
> 1 = 1 and ( ( dir0 between '01' and '10' ) ) and ( LOG_DATE >= '2022-04-01
> 00:00:00.000000' and LOG_DATE <= '2022-04-30 23:59:59.000000');
> 2.
> select SUM(CAST(val11 as BIGINT)+CAST(val12 as BIGINT)) COL6408,
> COUNT(DISTINCT val2) COL4452 from
> dfs.root.`/drill/data/*/log_15872_R_79_*.parquet` WHERE 1 = 1 and ( ( dir0
> between '01' and '10' ) ) and ( LOG_DATE >= '2022-04-01 00:00:00.000000' and
> LOG_DATE <= '2022-04-30 23:59:59.000000');{code}
> As for the actual data, the count with COL4452 at the beginning of 1. is
> correct.
> I am having trouble understanding the cause of this phenomenon.
> Can anybody help me?Thanks in advance.
> Attached the parquet log file.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)