[
https://issues.apache.org/jira/browse/BEAM-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anonymous updated BEAM-13189:
-----------------------------
Status: Triage Needed (was: Resolved)
> Add escapechar to Python TextIO reads
> -------------------------------------
>
> Key: BEAM-13189
> URL: https://issues.apache.org/jira/browse/BEAM-13189
> Project: Beam
> Issue Type: New Feature
> Components: io-py-common, io-py-files
> Reporter: Eugene Nikolaiev
> Assignee: Eugene Nikolaiev
> Priority: P2
> Fix For: 2.35.0
>
> Time Spent: 5h 40m
> Remaining Estimate: 0h
>
> Existing TextIO connector can be used for splitting lines of CSV or
> tab-delimited files for its ability to read large files in parallel and
> rebalance the work. Each line then can be parsed with {{csv}} library
> separately. This works, if there are no line delimiters inside the lines.
> Otherwise the lines are split incorrectly.
> One of tab-delimited dialects uses escape characters to escape the line and
> column delimiters (usually backslash) instead of quoting the columns. This
> can be parsed with Python {{csv}} library using
> [escapechar|https://docs.python.org/3/library/csv.html#csv.Dialect.escapechar]
> dialect parameter.
> The escapechar itself can also be escaped to allow having such character
> before the line delimiters.
> Example of such file format usage: [Adobe Analytics Data
> Feed|https://experienceleague.adobe.com/docs/analytics/export/analytics-data-feed/data-feed-contents/datafeeds-spec-chars.html?lang=en]
> It would be nice if TextIO transforms {{ReadFromText}} and
> {{ReadAllFromText}} had support for {{escapechar}} as follows:
>
> {code:java}
> import csv
> import tempfile
> import apache_beam as beam
> with tempfile.NamedTemporaryFile('w') as temp_file:
> # Write CSV lines with escaped line terminator
> temp_file.write('a\\\na\taa\n')
> temp_file.write('bb\tbb\n')
> temp_file.flush()
> # Read and print lines
> with beam.Pipeline() as pipeline:
> (
> pipeline
> | beam.io.ReadFromText(file_pattern=temp_file.name, escapechar=b'\\')
> | beam.Map(lambda x: print(repr(x)))
> )
> # Read lines, parse and print TSV rows
> with beam.Pipeline() as pipeline:
> (
> pipeline
> | beam.io.ReadFromText(file_pattern=temp_file.name, escapechar=b'\\')
> | beam.Map(lambda x: next(csv.reader([x], escapechar='\\',
> delimiter='\t')))
> | beam.Map(lambda x: print(repr(x)))
> )
> {code}
> This would print:
> {code:java}
> 'a\\\na\taa'
> 'bb\tbb'
> ['a\na', 'aa']
> ['bb', 'bb']
> {code}
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)