[
https://issues.apache.org/jira/browse/FLINK-36267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18007227#comment-18007227
]
dylanhz edited comment on FLINK-36267 at 7/16/25 2:47 AM:
----------------------------------------------------------
After some research, I've found that there is no consistent industry-wide
approach for handling this function.
For example, engines like *Hive* (which uses Java's native
{{{}String.split(){}}}), {*}ClickHouse{*}, and *Presto* do not account for
Supplementary Multilingual Plane (SMP) characters. In contrast, *Spark* and
*BigQuery* do handle SMP characters correctly.
Given this divergence, I would like to clarify: *What is the intended semantics
for this function in Flink?* Does the current implementation align with this
expectation?
was (Author: JIRAUSER305836):
After some research, I've found that this behavior is expected. Several
products that offer a {{split}} function, including Hive, BigQuery, ClickHouse,
and even Java's native {{{}String.split(){}}}, handle this scenario
consistently. They do not account for Supplementary Multilingual Plane (SMP)
characters when splitting by an empty string delimiter.
Given that our implementation aligns with this established industry practice, I
will be closing this issue.
> SPLIT doesn't support SMP characters if delimiter is empty
> ----------------------------------------------------------
>
> Key: FLINK-36267
> URL: https://issues.apache.org/jira/browse/FLINK-36267
> Project: Flink
> Issue Type: Bug
> Components: Table SQL / API
> Reporter: dylanhz
> Priority: Major
>
> In Flink:
> {code:sql}
> > SELECT SPLIT('123😊笑脸', '');
> ["1", "2", "3", "?", "?", "笑", "脸"]
> > SELECT SPLIT('123😊笑脸', '😊');
> ["123", "笑脸"]
> > SELECT SPLIT('123😊笑脸', '3');
> ["12", "😊笑脸"]
> {code}
> While in Spark:
> {code:sql}
> > SELECT SPLIT('123😊笑脸', '');
> ["1", "2", "3", "😊", "笑", "脸"]
> {code}
> I think this may be a bug. But I'm not sure the best way to solve this, here
> are two ideas:
> # Keep the code of handling empty delimiter separate from normal cases that
> use {{BinaryStringDataUtil#splitByWholeSeparatorPreserveAllTokens()}} as it
> used to do.
> # Modify {{BinaryStringDataUtil#splitByWholeSeparatorPreserveAllTokens()}}
> to align with the SPLIT semantics, meaning that it should separate every
> character when the delimiter is empty. I haven't seen this method used
> elsewhere, so this should be practical.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)