GoGoWen opened a new pull request, #14148:
URL: https://github.com/apache/doris/pull/14148
# Proposed changes
enhance broker load for parquet and orc file when missing columns in src
files
## Problem summary
source my_file.orc/parquet like below:
+------+------+-------------+-------+------+
| name | id | impressions | click | cost |
+------+------+-------------+-------+------+
| 1 | 1 | 2 | NULL | 2 |
| 4 | 4 | 8 | 8 | 8 |
| 5 | 5 | 10 | 10 | 10 |
| 3 | 3 | 6 | 6 | 6 |
| 2 | 2 | 4 | 4 | 4 |
| 11 | 11 | 22 | 22 | 22 |
+------+------+-------------+-------+------+
case 1:
create table t1 like below
CREATE TABLE `t1` (
`name` bigint(20) NOT NULL,
`id` bigint(20) NOT NULL,
`impressions` bigint(20) SUM NULL DEFAULT "0" COMMENT '用户总展现',
`click` bigint(20) SUM NULL DEFAULT "0" COMMENT '用户总点击',
`cost` bigint(20) SUM NULL DEFAULT "0" COMMENT '用户总消费'
) ENGINE=OLAP
AGGREGATE KEY(`name`, `id`)
COMMENT 'OLAP'
PARTITION BY RANGE(`name`)
(PARTITION p201901 VALUES [("1"), ("100")))
DISTRIBUTED BY HASH(`id`) BUCKETS 16
PROPERTIES (
"replication_allocation" = "tag.location.default: 3",
"in_memory" = "false",
"storage_format" = "V2",
"disable_auto_compaction" = "false"
);
when we load from source file, we wil get
+------+------+-------------+-------+------+
| name | id | impressions | click | cost |
+------+------+-------------+-------+------+
| 1 | 1 | 2 | NULL | 2 |
| 4 | 4 | 8 | 8 | 8 |
| 5 | 5 | 10 | 10 | 10 |
| 3 | 3 | 6 | 6 | 6 |
| 2 | 2 | 4 | 4 | 4 |
| 11 | 11 | 22 | 22 | 22 |
+------+------+-------------+-------+------+
case 2:
when create table t1 like below(column id2 is missing from src file):
CREATE TABLE `t1` (
`name` bigint(20) NOT NULL,
`id` bigint(20) NOT NULL,
`id2` bigint(20) NULL DEFAULT "0",
`impressions` bigint(20) SUM NULL DEFAULT "0" COMMENT '用户总展现',
`click` bigint(20) SUM NULL DEFAULT "0" COMMENT '用户总点击',
`cost` bigint(20) SUM NULL DEFAULT "0" COMMENT '用户总消费'
) ENGINE=OLAP
AGGREGATE KEY(`name`, `id`, `id2`)
COMMENT 'OLAP'
PARTITION BY RANGE(`name`)
(PARTITION p201901 VALUES [("1"), ("100")))
DISTRIBUTED BY HASH(`id`) BUCKETS 16
PROPERTIES (
"replication_allocation" = "tag.location.default: 3",
"in_memory" = "false",
"storage_format" = "V2",
"disable_auto_compaction" = "false"
);
after broker load from my_file.orc/parquet. we will get:
+------+------+------+-------------+-------+------+
| name | id | id2 | impressions | click | cost |
+------+------+------+-------------+-------+------+
| 5 | 5 | 0 | 5 | 5 | 5 |
| 4 | 4 | 0 | 4 | 4 | 4 |
| 11 | 11 | 0 | 11 | 11 | 11 |
| 1 | 1 | 0 | 1 | NULL | 1 |
| 2 | 2 | 0 | 2 | 2 | 2 |
| 3 | 3 | 0 | 3 | 3 | 3 |
+------+------+------+-------------+-------+------+
....
Note:
the case that enable_new_load_scan_node=true is not included in this pr.
Describe your changes.
## Checklist(Required)
1. Does it affect the original behavior:
- [ ] Yes
- [ ] No
- [ ] I don't know
2. Has unit tests been added:
- [ ] Yes
- [ ] No
- [ ] No Need
3. Has document been added or modified:
- [ ] Yes
- [ ] No
- [ ] No Need
4. Does it need to update dependencies:
- [ ] Yes
- [ ] No
5. Are there any changes that cannot be rolled back:
- [ ] Yes (If Yes, please explain WHY)
- [ ] No
## Further comments
If this is a relatively large or complex change, kick off the discussion at
[[email protected]](mailto:[email protected]) by explaining why you
chose the solution you did and what alternatives you considered, etc...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]