[
https://issues.apache.org/jira/browse/BEAM-12730?focusedWorklogId=672080&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-672080
]
ASF GitHub Bot logged work on BEAM-12730:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 29/Oct/21 17:29
Start Date: 29/Oct/21 17:29
Worklog Time Spent: 10m
Work Description: tvalentyn commented on a change in pull request #15775:
URL: https://github.com/apache/beam/pull/15775#discussion_r739404366
##########
File path: sdks/python/apache_beam/io/textio.py
##########
@@ -348,19 +345,19 @@ def _read_record(self, file_to_read, read_buffer):
return (read_buffer.data[record_start_position_in_buffer:], -1)
if self._strip_trailing_newlines:
- # Current record should not contain the separator.
+ # Current record should not contain the delimiter.
return (
read_buffer.data[record_start_position_in_buffer:sep_bounds[0]],
sep_bounds[1] - record_start_position_in_buffer)
else:
- # Current record should contain the separator.
+ # Current record should contain the delimiter.
return (
read_buffer.data[record_start_position_in_buffer:sep_bounds[1]],
sep_bounds[1] - record_start_position_in_buffer)
@staticmethod
def _is_self_overlapping(delimiter):
- # delimiter self-overlaps if v exists such as delimiter = vu = wv
+ # A delimiter self-overlaps if it has a prefix that is also its suffix.
# with u and w non-empty
Review comment:
drop this line
##########
File path: sdks/python/apache_beam/io/textio_test.py
##########
@@ -1077,7 +1077,7 @@ def test_custom_delimiter_must_not_self_overlap_ok(self):
delimiter=delimiter,
)
- def test_custom_delimiter_must_not_self_overlap_error(self):
+ def test_self_overlapping_delimiter_is_rejected(self):
"""Self-overlapping delimiter is rejected."""
Review comment:
nit: redundant comment
##########
File path: sdks/python/apache_beam/io/textio_test.py
##########
@@ -1063,7 +1063,7 @@ def test_custom_delimiter_must_not_empty_bytes(self):
delimiter=delimiter,
)
- def test_custom_delimiter_must_not_self_overlap_ok(self):
+ def test_non_self_overlapping_delimiter_is_accepted(self):
"""Non self-overlapping delimiter is accepted."""
Review comment:
nit: redundant comment
##########
File path: sdks/python/apache_beam/io/textio.py
##########
@@ -272,29 +272,26 @@ def _find_separator_bounds(self, file_to_read,
read_buffer):
# Using find() here is more efficient than a linear scan
# of the byte array.
- next_lf = read_buffer.data.find(delimiter, current_pos)
-
- if next_lf >= 0:
- if self._delimiter is None \
- and read_buffer.data[next_lf - 1:next_lf] == b'\r':
- # Found a '\r\n' if delimiter is not define.
- # Accepting that as the next separator.
- return (next_lf - 1, next_lf + 1)
+ next_delim = read_buffer.data.find(delimiter, current_pos)
+
+ if next_delim >= 0:
+ if (self._delimiter is None and
+ read_buffer.data[next_delim - 1:next_delim] == b'\r'):
+ # Accept both '\r\n' and '\n' as a default delimiter.
+ # Accepting that as the next delimiter.
Review comment:
drop this line
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 672080)
Time Spent: 18h 20m (was: 18h 10m)
> Add custom delimiters to Python TextIO reads
> --------------------------------------------
>
> Key: BEAM-12730
> URL: https://issues.apache.org/jira/browse/BEAM-12730
> Project: Beam
> Issue Type: New Feature
> Components: io-py-common, io-py-files
> Reporter: Daniel Oliveira
> Assignee: Dmitrii Kuzin
> Priority: P2
> Labels: beginner, newbie, starter
> Time Spent: 18h 20m
> Remaining Estimate: 0h
>
> A common request by users is to be able to separate a text files read by
> TextIO with delimiters other than newline. The Java SDK already supports this
> feature.
> The current delimiter code is [located
> here|https://github.com/apache/beam/blob/v2.31.0/sdks/python/apache_beam/io/textio.py#L236]
> and defaults to newlines. This function could easily be modified to also
> handle custom delimiters. Changing this would also necessitate changing the
> API for the various TextIO.Read methods and adding documentation.
> This seems like a good starter bug for making more in-depth contributions to
> Beam Python.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)