nikie commented on a change in pull request #15667:
URL: https://github.com/apache/beam/pull/15667#discussion_r730750739
##########
File path: sdks/python/apache_beam/io/textio.py
##########
@@ -561,6 +570,7 @@ def __init__(
skipped from each source file. Must be 0 or higher. Large number of
skipped lines might impact performance.
coder (~apache_beam.coders.coders.Coder): Coder used to decode each line.
+ delimiter (str or bytes): delimiter to split records
Review comment:
@dmitriikuzinepam
Here is an example.
With `LINE_LENGTH = 8192` pipeline prints 3 as expected.
With `LINE_LENGTH = 8191` pipeline prints 2 as a surprise.
8192 is the default buffer size defined here:
https://github.com/apache/beam/blob/0cddc44bbc39ef51f94f5f9aaafd613355a170a2/sdks/python/apache_beam/io/textio.py#L57
```
from tempfile import NamedTemporaryFile
import apache_beam as beam
# LINE_LENGTH = 8192
LINE_LENGTH = 8191
LINE_COUNT = 3
with NamedTemporaryFile("wb") as temp_file:
for _ in range(LINE_COUNT):
temp_file.write(b'a' * LINE_LENGTH + b'\r\n')
temp_file.flush()
with beam.Pipeline() as pipeline:
(
pipeline
| beam.io.ReadFromText(
file_pattern=temp_file.name,
delimiter=b'\r\n')
| beam.combiners.Count.Globally()
| beam.Map(print)
)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]