[GitHub] [incubator-doris] zeropc commented on issue #6009: [Spark Load]Got incorrect bucket when distributed by int

GitBox Sun, 20 Jun 2021 00:17:52 -0700


zeropc commented on issue #6009:
URL: 
https://github.com/apache/incubator-doris/issues/6009#issuecomment-864511062



   > There are two problems here
   > 1 DppUtils.getHashValue() didn't deal with date type, so no bytes is added 
here, so result is wrong;
   > 2 I create a table with Integer type and using `Spark Load` to load data, 
but I didn't reproduce it; The case is as below;
   > 
   > ```
   > table:
   > CREATE TABLE `test_int_bucket` (
   >   `tinyint_col` tinyint(4) NULL COMMENT "",
   >   `smallint_col` smallint(6) NULL COMMENT "",
   >   `int_col` int(11) NULL COMMENT "",
   >   `bigint_col` bigint(20) NULL COMMENT "",
   >   `pv_sum` int(11) SUM NULL COMMENT ""
   > ) ENGINE=OLAP
   > AGGREGATE KEY(`tinyint_col`, `smallint_col`, `int_col`, `bigint_col`)
   > COMMENT "OLAP"
   > DISTRIBUTED BY HASH(`tinyint_col`,`smallint_col`,`int_col`,`bigint_col`) 
BUCKETS 3
   > PROPERTIES (
   > "replication_num" = "1",
   > "in_memory" = "false",
   > "storage_format" = "DEFAULT"
   > ); 
   > 
   > data:
   > mysql> select * from test_int_bucket;
   > +-------------+--------------+---------+------------+--------+
   > | tinyint_col | smallint_col | int_col | bigint_col | pv_sum |
   > +-------------+--------------+---------+------------+--------+
   > |           1 |            1 |       1 |          1 |      1 |
   > |           4 |            4 |       4 |          4 |      4 |
   > |           2 |            2 |       2 |          2 |      2 |
   > |           3 |            3 |       3 |          3 |      3 |
   > +-------------+--------------+---------+------------+--------+
   > 4 rows in set (0.01 sec)
   > 
   > 
   > query:
   > mysql> select count(1) from test_int_bucket where bigint_col=1;
   > +----------+
   > | count(1) |
   > +----------+
   > |        1 |
   > +----------+
   > 1 row in set (0.02 sec)
   > ```
   
   sorry for the wrong case. In the procedure you provided, the distributed key 
should be only one column like below:
   CREATE TABLE `test_int_bucket` (
     `int_col` int(11) NULL COMMENT "",
     `pv_sum` int(11) SUM NULL COMMENT ""
   ) ENGINE=OLAP
   AGGREGATE KEY(`int_col`)
   COMMENT "OLAP"
   DISTRIBUTED BY HASH(`int_col`) BUCKETS 10
   PROPERTIES (
   "replication_num" = "1",
   "in_memory" = "false",
   "storage_format" = "DEFAULT"
   ); 
   It is recommended to increase the bucket number so that it would be easier 
to to reproduce it.
   
   Another option is to change the query to below:
   select count(1) from test_int_bucket where tinyint_col=1 and smallint_col=1 
and int_col=1 and bigint_col=1;
   The key is to make the query plan scan ONLY ONE BUCKET that contains the 
target data. The problem here is that the query goes to the wrong tablet, not 
that the data itself (i.e. if scanning all tablet, the result wont go wrong).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-doris] zeropc commented on issue #6009: [Spark Load]Got incorrect bucket when distributed by int

Reply via email to