Eugene Nikolaiev created BEAM-13189:
---------------------------------------
Summary: Add escapechar to Python TextIO reads
Key: BEAM-13189
URL: https://issues.apache.org/jira/browse/BEAM-13189
Project: Beam
Issue Type: New Feature
Components: io-py-common, io-py-files
Reporter: Eugene Nikolaiev
Existing TextIO connector can be used for splitting lines of CSV or
tab-delimited files for its ability to read large files in parallel and
rebalance the work. Each line then can be parsed with {{csv}} library
separately. This works, if there are no line delimiters inside the lines.
Otherwise the lines are split incorrectly.
One of tab-delimited dialects uses escape characters to escape the line and
column delimiters (usually backslash) instead of quoting the columns. This can
be parsed with Python {{csv}} library using
[escapechar|https://docs.python.org/3/library/csv.html#csv.Dialect.escapechar]
dialect parameter.
The escapechar itself can also be escaped to allow having such character before
the line delimiters.
Example of such file format usage: [Adobe Analytics Data
Feed|https://experienceleague.adobe.com/docs/analytics/export/analytics-data-feed/data-feed-contents/datafeeds-spec-chars.html?lang=en]
It would be nice if TextIO transforms {{ReadFromText}} and {{ReadAllFromText}}
had support for {{escapechar}} as follows:
{code:java}
import csv
import tempfile
import apache_beam as beam
with tempfile.NamedTemporaryFile('w') as temp_file:
# Write CSV lines with escaped line terminator
temp_file.write('a\\\na\taa\n')
temp_file.write('bb\tbb\n')
temp_file.flush()
# Read and print lines
with beam.Pipeline() as pipeline:
(
pipeline
| beam.io.ReadFromText(file_pattern=temp_file.name, escapechar=b'\\')
| beam.Map(lambda x: print(repr(x)))
)
# Read lines, parse and print TSV rows
with beam.Pipeline() as pipeline:
(
pipeline
| beam.io.ReadFromText(file_pattern=temp_file.name, escapechar=b'\\')
| beam.Map(lambda x: next(csv.reader([x], escapechar='\\',
delimiter='\t')))
| beam.Map(lambda x: print(repr(x)))
)
{code}
This would print:
{code:java}
'a\\\na\taa'
'bb\tbb'
['a\na', 'aa']
['bb', 'bb']
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)