[
https://issues.apache.org/jira/browse/BEAM-12730?focusedWorklogId=672167&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-672167
]
ASF GitHub Bot logged work on BEAM-12730:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 29/Oct/21 20:17
Start Date: 29/Oct/21 20:17
Worklog Time Spent: 10m
Work Description: nikie edited a comment on pull request #15775:
URL: https://github.com/apache/beam/pull/15775#issuecomment-955011357
> > Thanks for the snippet. The case i was concerned about is where we use a
default separator (evaluates to \n), but split between '\r\n', not a custom
separator that is also \r\n. I checked that it also works, since we just extend
the buffer, and then still look back 1 character when we encounter \n.
>
> Let's make a test case out of your code snippet, if there isn't one that
covers this scenario. Thanks.
Nice catch, @tvalentyn, but it works! :)
Here is the test which succeeds:
```
def test_read_crlf_split_by_buffer(self):
file_name, expected_data = write_data(3, eol=EOL.CRLF)
assert len(expected_data) == 3
self._run_read_test(
file_name, expected_data, buffer_size=6)
```
This is because buffer is not discarded at the end of it:
* at the end of the 1st iteration we have `b'line0\r'` in the buffer
* in the beginning of next iteration
`self._try_to_ensure_num_bytes_in_buffer` extends the buffer to
`b'line0\r\nline1'`
* `b'\n'` is found as possible next separator and back check will pick up
the `r'\r'`
**upd**: Sorry, I have just seen your other comment where you mentioned the
same.
By the way, do you think we should add support for `b'\r'` delimiter in
default mode, which Java SDK has? Now or at some point later?
This would not be backward compatible, though, and would be less efficient
than current search.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 672167)
Time Spent: 20h (was: 19h 50m)
> Add custom delimiters to Python TextIO reads
> --------------------------------------------
>
> Key: BEAM-12730
> URL: https://issues.apache.org/jira/browse/BEAM-12730
> Project: Beam
> Issue Type: New Feature
> Components: io-py-common, io-py-files
> Reporter: Daniel Oliveira
> Assignee: Dmitrii Kuzin
> Priority: P2
> Labels: beginner, newbie, starter
> Time Spent: 20h
> Remaining Estimate: 0h
>
> A common request by users is to be able to separate a text files read by
> TextIO with delimiters other than newline. The Java SDK already supports this
> feature.
> The current delimiter code is [located
> here|https://github.com/apache/beam/blob/v2.31.0/sdks/python/apache_beam/io/textio.py#L236]
> and defaults to newlines. This function could easily be modified to also
> handle custom delimiters. Changing this would also necessitate changing the
> API for the various TextIO.Read methods and adding documentation.
> This seems like a good starter bug for making more in-depth contributions to
> Beam Python.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)