[
https://issues.apache.org/jira/browse/FLINK-16818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068393#comment-17068393
]
Jun Zhang commented on FLINK-16818:
-----------------------------------
hi,[~ykt836] I share the execution graph on this link
[https://blog.csdn.net/zhangjun5965/article/details/105143014]
I did not do shuffle before write it to hive, my idea is that flink can
automatically help users optimize the sql to reduce the complexity of sql. For
my sql, the partition with a large data, by default, flink will use one subtask
write to the hive partition, and finally generates a file, about 10 G, and
spark will generate many files, each file is several hundred megabytes.
In addition, I observed that each sub-task of Spark reads and writes files of
almost the same size, so I guess that the reason why Spark is faster is that by
default, it automatically optimizes this data skew.
> Optimize data skew when flink write data to hive dynamic partition table
> ------------------------------------------------------------------------
>
> Key: FLINK-16818
> URL: https://issues.apache.org/jira/browse/FLINK-16818
> Project: Flink
> Issue Type: Improvement
> Components: Connectors / Hive
> Affects Versions: 1.10.0
> Environment: {code:java}
> {code}
> Reporter: Jun Zhang
> Priority: Major
> Fix For: 1.11.0
>
>
> I read the source table data of hive through flink sql, and then write the
> target table of hive. The target table is a partitioned table. When the data
> of a partition is particularly large, data skew occurs, resulting in a
> particularly long execution time.
> By default Configuration, the same sql, hive on spark takes five minutes, and
> flink takes about 40 minutes.
> example:
>
> {code:java}
> // the schema of myparttable
> name string,
> age int,
> PARTITIONED BY (
> type string,
> day string
> )
> INSERT OVERWRITE myparttable SELECT name, age, type,day from sourcetable;
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)