[
https://issues.apache.org/jira/browse/BEAM-12730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436140#comment-17436140
]
Eugene Nikolaiev edited comment on BEAM-12730 at 10/29/21, 7:13 PM:
--------------------------------------------------------------------
Thanks for the detailed answer, [~echauchot]!
We have implemented self-overlap check for Python SDK the same way as in Java,
because rewinding would be a bit tricky, and for feature parity between SDKs.
In your example, I believe, the split would be {{abc}}, {{|xyz }}(i.e. the
first delimiter found from the left wins). We just need to rewind further to
the left from bundle start. We are already rewinding, but just for 1 delimiter
length, here is Java code:[
https://github.com/apache/beam/blob/2e448dee58f1ee60551cc47b9aa7df6bc832734a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSource.java#L153-L155|https://github.com/apache/beam/blob/2e448dee58f1ee60551cc47b9aa7df6bc832734a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSource.java#L153-L155]
And then, to avoid record duplication we would need to fast forward as close as
possible to the bundle start while being "inside delimiters only". After that
there will be 2 cases: (1) if we are at the bundle start - it is the line start
and should be included; (2) if we ended earlier (i.e. we have a dangling {{|}}
right before bundle start) than the bundle start belongs to the previous
bundle's line and we should search for the next delimiter.
But, in an extreme case of a file, consisting just of millions of delimiters
{{|||...}}, each bundle would rewind until the file start.
Anyway, this sounds like and is complicated :)
was (Author: eugenenikolaiev):
Thanks for the detailed answer, [~echauchot]!
We have implemented self-overlap check for Python SDK the same way as in Java,
because rewinding would be a bit tricky, and for feature parity between SDKs.
In you example, I believe, the split would be {{abc}}, {{|xyz }}(i.e. the first
delimiter found from the left wins). We just need to rewind further to the left
from bundle start. We are already rewinding, but just for 1 delimiter length,
here is Java code:[
https://github.com/apache/beam/blob/2e448dee58f1ee60551cc47b9aa7df6bc832734a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSource.java#L153-L155|https://github.com/apache/beam/blob/2e448dee58f1ee60551cc47b9aa7df6bc832734a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSource.java#L153-L155]
And then, to avoid record duplication we would need to fast forward as close as
possible to the bundle start while being "inside delimiters only". After that
there will be 2 cases: (1) if we are at the bundle start - it is the line start
and should be included; (2) if we ended earlier (i.e. we have a dangling {{|}}
right before bundle start) than the bundle start belongs to the previous
bundle's line and we should search for the next delimiter.
But, in an extreme case of a file, consisting just of millions of delimiters
{{|||...}}, each bundle would rewind until the file start.
Anyway, this sounds like and is complicated :)
> Add custom delimiters to Python TextIO reads
> --------------------------------------------
>
> Key: BEAM-12730
> URL: https://issues.apache.org/jira/browse/BEAM-12730
> Project: Beam
> Issue Type: New Feature
> Components: io-py-common, io-py-files
> Reporter: Daniel Oliveira
> Assignee: Dmitrii Kuzin
> Priority: P2
> Labels: beginner, newbie, starter
> Time Spent: 19.5h
> Remaining Estimate: 0h
>
> A common request by users is to be able to separate a text files read by
> TextIO with delimiters other than newline. The Java SDK already supports this
> feature.
> The current delimiter code is [located
> here|https://github.com/apache/beam/blob/v2.31.0/sdks/python/apache_beam/io/textio.py#L236]
> and defaults to newlines. This function could easily be modified to also
> handle custom delimiters. Changing this would also necessitate changing the
> API for the various TextIO.Read methods and adding documentation.
> This seems like a good starter bug for making more in-depth contributions to
> Beam Python.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)