Eugene Nikolaiev created BEAM-13189:
---------------------------------------

             Summary: Add escapechar to Python TextIO reads
                 Key: BEAM-13189
                 URL: https://issues.apache.org/jira/browse/BEAM-13189
             Project: Beam
          Issue Type: New Feature
          Components: io-py-common, io-py-files
            Reporter: Eugene Nikolaiev


Existing TextIO connector can be used for splitting lines of CSV or 
tab-delimited files for its ability to read large files in parallel and 
rebalance the work. Each line then can be parsed with {{csv}} library 
separately. This works, if there are no line delimiters inside the lines. 
Otherwise the lines are split incorrectly. 

One of tab-delimited dialects uses escape characters to escape the line and 
column delimiters (usually backslash) instead of quoting the columns. This can 
be parsed with Python {{csv}} library using 
[escapechar|https://docs.python.org/3/library/csv.html#csv.Dialect.escapechar] 
dialect parameter.

The escapechar itself can also be escaped to allow having such character before 
the line delimiters.

Example of such file format usage: [Adobe Analytics Data 
Feed|https://experienceleague.adobe.com/docs/analytics/export/analytics-data-feed/data-feed-contents/datafeeds-spec-chars.html?lang=en]

It would be nice if TextIO transforms {{ReadFromText}} and {{ReadAllFromText}} 
had support for {{escapechar}} as follows:

 
{code:java}
import csv
import tempfile
import apache_beam as beam

with tempfile.NamedTemporaryFile('w') as temp_file:
  # Write CSV lines with escaped line terminator
  temp_file.write('a\\\na\taa\n')
  temp_file.write('bb\tbb\n')
  temp_file.flush()

  # Read and print lines
  with beam.Pipeline() as pipeline:
    (
      pipeline
      | beam.io.ReadFromText(file_pattern=temp_file.name, escapechar=b'\\')
      | beam.Map(lambda x: print(repr(x)))
    )

  # Read lines, parse and print TSV rows
  with beam.Pipeline() as pipeline:
    (
      pipeline
      | beam.io.ReadFromText(file_pattern=temp_file.name, escapechar=b'\\')
      | beam.Map(lambda x: next(csv.reader([x], escapechar='\\', 
delimiter='\t')))
      | beam.Map(lambda x: print(repr(x)))
    )
{code}
This would print:
{code:java}
'a\\\na\taa'
'bb\tbb'
['a\na', 'aa']
['bb', 'bb']
{code}
 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to