[ 
https://issues.apache.org/jira/browse/BEAM-12730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436140#comment-17436140
 ] 

Eugene Nikolaiev edited comment on BEAM-12730 at 10/29/21, 7:13 PM:
--------------------------------------------------------------------

Thanks for the detailed answer, [~echauchot]!

We have implemented self-overlap check for Python SDK the same way as in Java, 
because rewinding would be a bit tricky, and for feature parity between SDKs.

In your example, I believe, the split would be {{abc}}, {{|xyz }}(i.e. the 
first delimiter found from the left wins). We just need to rewind further to 
the left from bundle start. We are already rewinding, but just for 1 delimiter 
length, here is Java code:[ 
https://github.com/apache/beam/blob/2e448dee58f1ee60551cc47b9aa7df6bc832734a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSource.java#L153-L155|https://github.com/apache/beam/blob/2e448dee58f1ee60551cc47b9aa7df6bc832734a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSource.java#L153-L155]

And then, to avoid record duplication we would need to fast forward as close as 
possible to the bundle start while being "inside delimiters only". After that 
there will be 2 cases: (1) if we are at the bundle start - it is the line start 
and should be included; (2) if we ended earlier (i.e. we have a dangling {{|}} 
right before bundle start) than the bundle start belongs to the previous 
bundle's line and we should search for the next delimiter.

But, in an extreme case of a file, consisting just of millions of delimiters 
{{|||...}}, each bundle would rewind until the file start.

Anyway, this sounds like and is complicated :)


was (Author: eugenenikolaiev):
Thanks for the detailed answer, [~echauchot]!

We have implemented self-overlap check for Python SDK the same way as in Java, 
because rewinding would be a bit tricky, and for feature parity between SDKs.

In you example, I believe, the split would be {{abc}}, {{|xyz }}(i.e. the first 
delimiter found from the left wins). We just need to rewind further to the left 
from bundle start. We are already rewinding, but just for 1 delimiter length, 
here is Java code:[ 
https://github.com/apache/beam/blob/2e448dee58f1ee60551cc47b9aa7df6bc832734a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSource.java#L153-L155|https://github.com/apache/beam/blob/2e448dee58f1ee60551cc47b9aa7df6bc832734a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSource.java#L153-L155]

And then, to avoid record duplication we would need to fast forward as close as 
possible to the bundle start while being "inside delimiters only". After that 
there will be 2 cases: (1) if we are at the bundle start - it is the line start 
and should be included; (2) if we ended earlier (i.e. we have a dangling {{|}} 
right before bundle start) than the bundle start belongs to the previous 
bundle's line and we should search for the next delimiter.

But, in an extreme case of a file, consisting just of millions of delimiters 
{{|||...}}, each bundle would rewind until the file start.

Anyway, this sounds like and is complicated :)

> Add custom delimiters to Python TextIO reads
> --------------------------------------------
>
>                 Key: BEAM-12730
>                 URL: https://issues.apache.org/jira/browse/BEAM-12730
>             Project: Beam
>          Issue Type: New Feature
>          Components: io-py-common, io-py-files
>            Reporter: Daniel Oliveira
>            Assignee: Dmitrii Kuzin
>            Priority: P2
>              Labels: beginner, newbie, starter
>          Time Spent: 19.5h
>  Remaining Estimate: 0h
>
> A common request by users is to be able to separate a text files read by 
> TextIO with delimiters other than newline. The Java SDK already supports this 
> feature.
> The current delimiter code is [located 
> here|https://github.com/apache/beam/blob/v2.31.0/sdks/python/apache_beam/io/textio.py#L236]
>  and defaults to newlines. This function could easily be modified to also 
> handle custom delimiters. Changing this would also necessitate changing the 
> API for the various TextIO.Read methods and adding documentation.
> This seems like a good starter bug for making more in-depth contributions to 
> Beam Python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to