[ 
https://issues.apache.org/jira/browse/BEAM-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anonymous updated BEAM-13189:
-----------------------------
    Status: Triage Needed  (was: Resolved)

> Add escapechar to Python TextIO reads
> -------------------------------------
>
>                 Key: BEAM-13189
>                 URL: https://issues.apache.org/jira/browse/BEAM-13189
>             Project: Beam
>          Issue Type: New Feature
>          Components: io-py-common, io-py-files
>            Reporter: Eugene Nikolaiev
>            Assignee: Eugene Nikolaiev
>            Priority: P2
>             Fix For: 2.35.0
>
>          Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> Existing TextIO connector can be used for splitting lines of CSV or 
> tab-delimited files for its ability to read large files in parallel and 
> rebalance the work. Each line then can be parsed with {{csv}} library 
> separately. This works, if there are no line delimiters inside the lines. 
> Otherwise the lines are split incorrectly. 
> One of tab-delimited dialects uses escape characters to escape the line and 
> column delimiters (usually backslash) instead of quoting the columns. This 
> can be parsed with Python {{csv}} library using 
> [escapechar|https://docs.python.org/3/library/csv.html#csv.Dialect.escapechar]
>  dialect parameter.
> The escapechar itself can also be escaped to allow having such character 
> before the line delimiters.
> Example of such file format usage: [Adobe Analytics Data 
> Feed|https://experienceleague.adobe.com/docs/analytics/export/analytics-data-feed/data-feed-contents/datafeeds-spec-chars.html?lang=en]
> It would be nice if TextIO transforms {{ReadFromText}} and 
> {{ReadAllFromText}} had support for {{escapechar}} as follows:
>  
> {code:java}
> import csv
> import tempfile
> import apache_beam as beam
> with tempfile.NamedTemporaryFile('w') as temp_file:
>   # Write CSV lines with escaped line terminator
>   temp_file.write('a\\\na\taa\n')
>   temp_file.write('bb\tbb\n')
>   temp_file.flush()
>   # Read and print lines
>   with beam.Pipeline() as pipeline:
>     (
>       pipeline
>       | beam.io.ReadFromText(file_pattern=temp_file.name, escapechar=b'\\')
>       | beam.Map(lambda x: print(repr(x)))
>     )
>   # Read lines, parse and print TSV rows
>   with beam.Pipeline() as pipeline:
>     (
>       pipeline
>       | beam.io.ReadFromText(file_pattern=temp_file.name, escapechar=b'\\')
>       | beam.Map(lambda x: next(csv.reader([x], escapechar='\\', 
> delimiter='\t')))
>       | beam.Map(lambda x: print(repr(x)))
>     )
> {code}
> This would print:
> {code:java}
> 'a\\\na\taa'
> 'bb\tbb'
> ['a\na', 'aa']
> ['bb', 'bb']
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to