damccorm opened a new issue, #20312:
URL: https://github.com/apache/beam/issues/20312

   Apache Beam has TextIO class which can read text based files line by line, 
delimited by either a carriage return, newline, or a carriage return and a 
newline. This approach does not support CSV files which have records that span 
multiple lines. This is because there could be fields where there is a newline 
inside the double quotes.
   
   This Stackoverflow question is relevant for a feature that should be added 
to Apache Beam: 
[https://stackoverflow.com/questions/51439189/how-to-read-large-csv-with-beam](https://stackoverflow.com/questions/51439189/how-to-read-large-csv-with-beam)
   
   I can think of two libraries we could use for handling CSV files. The first 
one is using Apache Commons CSV library. Here is some example code which can 
use CSVRecord class for reading and writing CSV records:
   
   `{color:#172b4d}{{PipelineOptions options = PipelineOptionsFactory.create();`
    `Pipeline pipeline = Pipeline.create(options);`
    `PCollection<CSVRecord> records = pipeline.apply("ReadCSV", 
CSVIO.read().from("input.csv"));`
    records.apply("WriteCSV", CSVIO.write().to("output.csv"));{color}}}
   
   Another library we could use is Jackson CSV, which allows users to specify 
schemas for the columns: 
[https://github.com/FasterXML/jackson-dataformats-text/tree/master/csv](https://github.com/FasterXML/jackson-dataformats-text/tree/master/csv)
   
   The crux of the problem is this: can we read and write large CSV files in 
parallel, by splitting the records and distribute it to many workers? If so, 
would it be good to have a feature where Apache Beam supports reading/writing 
CSV files?
   
   Imported from Jira 
[BEAM-10030](https://issues.apache.org/jira/browse/BEAM-10030). Original Jira 
may contain additional context.
   Reported by: auroranil.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to