[ 
https://issues.apache.org/jira/browse/BEAM-12730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17435918#comment-17435918
 ] 

Etienne Chauchot commented on BEAM-12730:
-----------------------------------------

Hi [~EugeneNikolaiev]. It's been years since I coded that, so I had to take  a 
look at the code. Yes bundle split can happen at any byte so it can cut a 
multi-byte delimiter so we have to go backward in parsing to be sure we get the 
whole delimiter and we have data completeness.

Regarding self overlap: such as {{||}} for example: how should we interpret 
{{abc|||xyz}} - as {{abc|}}, {{xyz}} - or as {{abc}}, {{|xyz}}? And how do we 
consistently enforce this interpretation if the file is split by the runner 
into bundles differently each time?

At parsing I cannot rewind the offset of {{(separator.size + 1)}} to allow only 
one byte overlap, neither can I rewind the offset of {{(2*sperator.size)}} to 
allow maximum overlap because it might produce duplicate record if a 
{{record.size < separator.size}}. I cannot either catch anything to state that 
the file format is wrong in case of overlap because I will get no exception, 
just flaky record content depending on the runner / source split point.

So we decided to disallow overlapping delimiters. I advice you should disallow 
overlapping delimiters for python as well for the same reasons + consistency 
with the java sdk

> Add custom delimiters to Python TextIO reads
> --------------------------------------------
>
>                 Key: BEAM-12730
>                 URL: https://issues.apache.org/jira/browse/BEAM-12730
>             Project: Beam
>          Issue Type: New Feature
>          Components: io-py-common, io-py-files
>            Reporter: Daniel Oliveira
>            Assignee: Dmitrii Kuzin
>            Priority: P2
>              Labels: beginner, newbie, starter
>          Time Spent: 18h 10m
>  Remaining Estimate: 0h
>
> A common request by users is to be able to separate a text files read by 
> TextIO with delimiters other than newline. The Java SDK already supports this 
> feature.
> The current delimiter code is [located 
> here|https://github.com/apache/beam/blob/v2.31.0/sdks/python/apache_beam/io/textio.py#L236]
>  and defaults to newlines. This function could easily be modified to also 
> handle custom delimiters. Changing this would also necessitate changing the 
> API for the various TextIO.Read methods and adding documentation.
> This seems like a good starter bug for making more in-depth contributions to 
> Beam Python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to