[GitHub] [beam] nikie commented on a change in pull request #15667: [BEAM-12730] Add custom delimiters to Python TextIO reads

GitBox Sat, 16 Oct 2021 17:31:37 -0700


nikie commented on a change in pull request #15667:
URL: https://github.com/apache/beam/pull/15667#discussion_r730046514




##########
File path: sdks/python/apache_beam/io/textio.py
##########
@@ -561,6 +570,7 @@ def __init__(
         skipped from each source file. Must be 0 or higher. Large number of
         skipped lines might impact performance.
       coder (~apache_beam.coders.coders.Coder): Coder used to decode each line.
+      delimiter (str or bytes): delimiter to split records

Review comment:
       @dmitriikuzinepam 
   What happens if `readbuffer` ends in the middle of a multi-byte delimiter?
   This line `current_pos = len(read_buffer.data)` together with `next_lf = 
read_buffer.data.find(self._delimiter, current_pos)` on next iteration might 
skip such delimiter.
   
   Also, it looks like Java SDK's TextIO behaves differently in case if 
delimiter is not set: it uses both `\n` and `\r\n` delimiters only in case if 
delimiter is not provided. If it is provided - it looks only for the explicit 
delimiter, see: 
https://github.com/apache/beam/blob/52a178f0a66829bbd1d99fcaf70921a8bd9300f6/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSource.java#L189,
 
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSource.java#L208

##########
File path: sdks/python/apache_beam/io/textio.py
##########
@@ -561,6 +570,7 @@ def __init__(
         skipped from each source file. Must be 0 or higher. Large number of
         skipped lines might impact performance.
       coder (~apache_beam.coders.coders.Coder): Coder used to decode each line.
+      delimiter (str or bytes): delimiter to split records

Review comment:
       @dmitriikuzinepam 
   What happens if `readbuffer` ends in the middle of a multi-byte delimiter?
   This line `current_pos = len(read_buffer.data)` together with `next_lf = 
read_buffer.data.find(self._delimiter, current_pos)` on next iteration might 
skip such delimiter.
   
   Also, it looks like Java SDK's TextIO behaves differently in case if 
delimiter is not set: it uses both `\n` and `\r\n` delimiters only in case if 
delimiter is not provided. If it is provided - it looks only for the explicit 
delimiter, see: 
https://github.com/apache/beam/blob/52a178f0a66829bbd1d99fcaf70921a8bd9300f6/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSource.java#L189,
 
https://github.com/apache/beam/blob/52a178f0a66829bbd1d99fcaf70921a8bd9300f6/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSource.java#L208

##########
File path: sdks/python/apache_beam/io/textio.py
##########
@@ -561,6 +570,7 @@ def __init__(
         skipped from each source file. Must be 0 or higher. Large number of
         skipped lines might impact performance.
       coder (~apache_beam.coders.coders.Coder): Coder used to decode each line.
+      delimiter (str or bytes): delimiter to split records

Review comment:
       @dmitriikuzinepam 
   What happens if `readbuffer` ends in the middle of a multi-byte delimiter?
   This line `current_pos = len(read_buffer.data)` together with `next_lf = 
read_buffer.data.find(self._delimiter, current_pos)` on next iteration might 
skip such delimiter.
   
   Also, it looks like Java SDK's TextIO behaves differently in case if 
delimiter is not set: it uses both `\n` and `\r\n` delimiters only in case if 
delimiter is not provided. If it is provided - it looks only for the explicit 
delimiter, see: 
https://github.com/apache/beam/blob/52a178f0a66829bbd1d99fcaf70921a8bd9300f6/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSource.java#L189,
 
https://github.com/apache/beam/blob/52a178f0a66829bbd1d99fcaf70921a8bd9300f6/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSource.java#L208
   
   Actually, the default for the delimiter `delimiter=b'\n',` is a bit 
misleading, since under the hood it continues to split on both `\n` and `\r\n` 
currently.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] nikie commented on a change in pull request #15667: [BEAM-12730] Add custom delimiters to Python TextIO reads

Reply via email to