[ 
https://issues.apache.org/jira/browse/BEAM-12730?focusedWorklogId=669489&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-669489
 ]

ASF GitHub Bot logged work on BEAM-12730:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 25/Oct/21 12:22
            Start Date: 25/Oct/21 12:22
    Worklog Time Spent: 10m 
      Work Description: nikie commented on a change in pull request #15775:
URL: https://github.com/apache/beam/pull/15775#discussion_r735544151



##########
File path: sdks/python/apache_beam/io/textio.py
##########
@@ -262,33 +267,27 @@ def _find_separator_bounds(self, file_to_read, 
read_buffer):
       next_lf = read_buffer.data.find(delimiter, current_pos)
 
       if next_lf >= 0:
-        if self._delimiter is None and delimiter == b'\n' \
+        if self._delimiter is None \
                 and read_buffer.data[next_lf - 1:next_lf] == b'\r':
-          # Default delimiter
+          # Default b'\n' or user defined delimiter
           # Found a '\r\n'. Accepting that as the next separator.
           return (next_lf - 1, next_lf + 1)
         else:
           # User defined delimiter
           # Found a delimiter. Accepting that as the next separator.
           return (next_lf, next_lf + delimiter_len)
 
-      elif read_buffer.data.find(delimiter[0], current_pos) >= 0:
-        # Corner case: delimiter truncated at the end of the file
-        current_delimiter_pos = read_buffer.data.find(delimiter[0], 
current_pos)
-
-        i = 0
-        while i < len(delimiter) and read_buffer.data[current_delimiter_pos +
-                                                      i] == delimiter[i]:
-          i += 1
-          if not self._try_to_ensure_num_bytes_in_buffer(
-              file_to_read, read_buffer, current_delimiter_pos + i + 1):
-            break
-
-        if i == delimiter_len:
-          # All bytes of delimiter found
-          return current_delimiter_pos, current_delimiter_pos + delimiter_len
-
-        current_pos += i
+      elif self._delimiter is not None:
+        # Corner case: custom delimiter is truncated at the end of the buffer.
+        next_lf = read_buffer.data.find(

Review comment:
       Unit test, covering the issue fixed with last commit would be nice. Or 
an existing one could be updated if possible.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 669489)
    Time Spent: 11h 20m  (was: 11h 10m)

> Add custom delimiters to Python TextIO reads
> --------------------------------------------
>
>                 Key: BEAM-12730
>                 URL: https://issues.apache.org/jira/browse/BEAM-12730
>             Project: Beam
>          Issue Type: New Feature
>          Components: io-py-common, io-py-files
>            Reporter: Daniel Oliveira
>            Assignee: Dmitrii Kuzin
>            Priority: P2
>              Labels: beginner, newbie, starter
>          Time Spent: 11h 20m
>  Remaining Estimate: 0h
>
> A common request by users is to be able to separate a text files read by 
> TextIO with delimiters other than newline. The Java SDK already supports this 
> feature.
> The current delimiter code is [located 
> here|https://github.com/apache/beam/blob/v2.31.0/sdks/python/apache_beam/io/textio.py#L236]
>  and defaults to newlines. This function could easily be modified to also 
> handle custom delimiters. Changing this would also necessitate changing the 
> API for the various TextIO.Read methods and adding documentation.
> This seems like a good starter bug for making more in-depth contributions to 
> Beam Python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to