[
https://issues.apache.org/jira/browse/SPARK-42694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
FengZhou updated SPARK-42694:
-----------------------------
Attachment: image-2023-03-07-15-59-08-818.png
> Data duplication and loss occur after executing 'insert overwrite...' in
> Spark 3.1.1
> ------------------------------------------------------------------------------------
>
> Key: SPARK-42694
> URL: https://issues.apache.org/jira/browse/SPARK-42694
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.1.1
> Environment: Spark 3.1.1
> Hadoop 3.2.1
> Hive 3.1.2
> Reporter: FengZhou
> Priority: Blocker
> Labels: shuffle, spark
> Attachments: image-2023-03-07-15-59-08-818.png,
> image-2023-03-07-15-59-27-665.png
>
>
> We are currently using Spark version 3.1.1 in our production environment. We
> have noticed that occasionally, after executing 'insert overwrite ...
> select', the resulting data is inconsistent, with some data being duplicated
> or lost. This issue does not occur all the time and seems to be more
> prevalent on large tables with tens of millions of records.
> We compared the execution plans for two runs of the same SQL and found that
> they were identical. In the case where the SQL was executed successfully, the
> amount of data being written and read during the shuffle stage was the same.
> However, in the case where the problem occurred, the amount of data being
> written and read during the shuffle stage was different. Please see the
> attached screenshots for the write/read data during shuffle stage.
>
> Normal SQL:
> !https://hnzycfc-collaborative.feishu.cn/space/api/box/stream/download/asynccode/?code=NjE1MzU2MDJmZjhlOTMzNDM3YjlkOTU3ZjQ0NjUzMjRfWFl6OVRnWE5wdnhsdWZtQW1hMUxzOUJuQ0tJekRlQ25fVG9rZW46Ym94Y256Nk5pVlNOVzNpN1N1Vk5DNUdLcEhoXzE2NzgxNzU0MDE6MTY3ODE3OTAwMV9WNA!
> SQL with issues:
> !https://hnzycfc-collaborative.feishu.cn/space/api/box/stream/download/asynccode/?code=YzIxODMzYTJlMTM3ZTg2ZDc1ZGZjYzlhZmFkMDJmMWNfM0lSM0FzNE9mdk8ybkJIVm9ucWMwcVJ2b2pCT0FFbzNfVG9rZW46Ym94Y25keWJYVHRjY3VjUWxwMk9mdHZjdUVoXzE2NzgxNzU0MTk6MTY3ODE3OTAxOV9WNA!
>
> Is this problem caused by a bug in version 3.1.1, specifically (SPARK-34534):
> 'New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss
> or correctness'? Or is it caused by something else? What could be the root
> cause of this problem?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]