[
https://issues.apache.org/jira/browse/BEAM-12730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17433749#comment-17433749
]
Eugene Nikolaiev commented on BEAM-12730:
-----------------------------------------
This issue is related to: BEAM-2802 (Java SDK TextIO custom delimiter).
Hi, [~echauchot]!
We are adding custom delimiter support into Python SDK's TextIO.
Could you, please, help us understand the reason behind disallowing
self-overlapping custom delimiters? I am wondering if we should implement the
same for the Python reader.
Java code is here:
[https://github.com/apache/beam/blob/1f08d1f3ddc2e7bc7341be4b29bdafaec18de9cc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L385]
Unit tests with examples of forbidden delimiters:
https://github.com/apache/beam/blob/1f08d1f3ddc2e7bc7341be4b29bdafaec18de9cc/sdks/java/core/src/test/java/org/apache/beam/sdk/io/TextIOReadTest.java#L416-L417
Your commit
[https://github.com/apache/beam/commit/1b6cde067ce78e1ce780b66e0cf1c883ce901959]
mentioned this: "Supports only separators that can not self-overlap, because
self-overlapping separators cause ambiguous parsing."
Does this happen only during the bundle splitting when we look behind to
determine if we are at the line start
([https://github.com/apache/beam/blob/1f08d1f3ddc2e7bc7341be4b29bdafaec18de9cc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSource.java#L151)|https://github.com/apache/beam/blob/1f08d1f3ddc2e7bc7341be4b29bdafaec18de9cc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSource.java#L151)?]
? I.e. with only 1 delimiter length back we cannot be sure if we see the full
delimiter or its beginning belongs to a previous delimiter end, for delimiter
"aba": "{{xxxababa[bundle split here]xxx"}} ?
If so, maybe we could scan a bit more backward until the first unambiguous byte
and then "fast forward" to the end of all true delimiters? In the example above
we could determine that previous line ended with "{{xxxaba}}" and so, we do not
have a line start at the bundle split and need to search forward to the next
delimiter after the bundle split.
As long as done only once per bundle, this should not case performance issues.
Unless the are thousands of crafted consecutive delimiters behind...
> Add custom delimiters to Python TextIO reads
> --------------------------------------------
>
> Key: BEAM-12730
> URL: https://issues.apache.org/jira/browse/BEAM-12730
> Project: Beam
> Issue Type: New Feature
> Components: io-py-common, io-py-files
> Reporter: Daniel Oliveira
> Assignee: Dmitrii Kuzin
> Priority: P2
> Labels: beginner, newbie, starter
> Time Spent: 11h 50m
> Remaining Estimate: 0h
>
> A common request by users is to be able to separate a text files read by
> TextIO with delimiters other than newline. The Java SDK already supports this
> feature.
> The current delimiter code is [located
> here|https://github.com/apache/beam/blob/v2.31.0/sdks/python/apache_beam/io/textio.py#L236]
> and defaults to newlines. This function could easily be modified to also
> handle custom delimiters. Changing this would also necessitate changing the
> API for the various TextIO.Read methods and adding documentation.
> This seems like a good starter bug for making more in-depth contributions to
> Beam Python.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)