[GitHub] [beam] nikie commented on a change in pull request #15667: [BEAM-12730] Add custom delimiters to Python TextIO reads

GitBox Mon, 18 Oct 2021 02:44:49 -0700


nikie commented on a change in pull request #15667:
URL: https://github.com/apache/beam/pull/15667#discussion_r730750739




##########
File path: sdks/python/apache_beam/io/textio.py
##########
@@ -561,6 +570,7 @@ def __init__(
         skipped from each source file. Must be 0 or higher. Large number of
         skipped lines might impact performance.
       coder (~apache_beam.coders.coders.Coder): Coder used to decode each line.
+      delimiter (str or bytes): delimiter to split records

Review comment:
       @dmitriikuzinepam 
   Here is an example.
   With `LINE_LENGTH = 8192` pipeline prints 3 as expected.
   With `LINE_LENGTH = 8191` pipeline prints 2 as a surprise.
   8192 is the default buffer size defined here: 
https://github.com/apache/beam/blob/0cddc44bbc39ef51f94f5f9aaafd613355a170a2/sdks/python/apache_beam/io/textio.py#L57
   
   ```
   from tempfile import NamedTemporaryFile
   import apache_beam as beam
   
   # LINE_LENGTH = 8192
   LINE_LENGTH = 8191
   LINE_COUNT = 3
   
   with NamedTemporaryFile("wb") as temp_file:
     for _ in range(LINE_COUNT):
       temp_file.write(b'a' * LINE_LENGTH + b'\r\n')
     temp_file.flush()
     with beam.Pipeline() as pipeline:
       (
         pipeline
         | beam.io.ReadFromText(
           file_pattern=temp_file.name,
           delimiter=b'\r\n')
         | beam.combiners.Count.Globally()
         | beam.Map(print)
       )
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] nikie commented on a change in pull request #15667: [BEAM-12730] Add custom delimiters to Python TextIO reads

Reply via email to