[ 
https://issues.apache.org/jira/browse/BEAM-12730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17433749#comment-17433749
 ] 

Eugene Nikolaiev commented on BEAM-12730:
-----------------------------------------

This issue is related to: BEAM-2802 (Java SDK TextIO custom delimiter).

Hi, [~echauchot]!
We are adding custom delimiter support into Python SDK's TextIO.

Could you, please, help us understand the reason behind disallowing 
self-overlapping custom delimiters? I am wondering if we should implement the 
same for the Python reader.

Java code is here:
[https://github.com/apache/beam/blob/1f08d1f3ddc2e7bc7341be4b29bdafaec18de9cc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L385]

Unit tests with examples of forbidden delimiters:
https://github.com/apache/beam/blob/1f08d1f3ddc2e7bc7341be4b29bdafaec18de9cc/sdks/java/core/src/test/java/org/apache/beam/sdk/io/TextIOReadTest.java#L416-L417

Your  commit 
[https://github.com/apache/beam/commit/1b6cde067ce78e1ce780b66e0cf1c883ce901959]
 mentioned this: "Supports only separators that can not self-overlap, because 
self-overlapping separators cause ambiguous parsing."

Does this happen only during the bundle splitting when we look behind to 
determine if we are at the line start 
([https://github.com/apache/beam/blob/1f08d1f3ddc2e7bc7341be4b29bdafaec18de9cc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSource.java#L151)|https://github.com/apache/beam/blob/1f08d1f3ddc2e7bc7341be4b29bdafaec18de9cc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSource.java#L151)?]
? I.e. with only 1 delimiter length back we cannot be sure if we see the full 
delimiter or its beginning belongs to a previous delimiter end, for delimiter 
"aba":  "{{xxxababa[bundle split here]xxx"}} ?

If so, maybe we could scan a bit more backward until the first unambiguous byte 
and then "fast forward" to the end of all true delimiters? In the example above 
we could determine that previous line ended with "{{xxxaba}}" and so, we do not 
have a line start at the bundle split and need to search forward to the next 
delimiter after the bundle split. 

As long as done only once per bundle, this should not case performance issues. 
Unless the are thousands of crafted consecutive delimiters behind...

> Add custom delimiters to Python TextIO reads
> --------------------------------------------
>
>                 Key: BEAM-12730
>                 URL: https://issues.apache.org/jira/browse/BEAM-12730
>             Project: Beam
>          Issue Type: New Feature
>          Components: io-py-common, io-py-files
>            Reporter: Daniel Oliveira
>            Assignee: Dmitrii Kuzin
>            Priority: P2
>              Labels: beginner, newbie, starter
>          Time Spent: 11h 50m
>  Remaining Estimate: 0h
>
> A common request by users is to be able to separate a text files read by 
> TextIO with delimiters other than newline. The Java SDK already supports this 
> feature.
> The current delimiter code is [located 
> here|https://github.com/apache/beam/blob/v2.31.0/sdks/python/apache_beam/io/textio.py#L236]
>  and defaults to newlines. This function could easily be modified to also 
> handle custom delimiters. Changing this would also necessitate changing the 
> API for the various TextIO.Read methods and adding documentation.
> This seems like a good starter bug for making more in-depth contributions to 
> Beam Python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to