[ 
https://issues.apache.org/jira/browse/BEAM-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Joshi updated BEAM-10030:
---------------------------------
    Description: 
Apache Beam has TextIO which can read text based files line by line, delimited 
by either a carriage return, newline, or a carriage return and a newline. This 
approach does not support CSV files which have records that span multiple 
lines. This is because there could be fields where there is a newline inside 
the double quotes.

This Stackoverflow question is relevant for a feature that should be added to 
Apache Beam: 
[https://stackoverflow.com/questions/51439189/how-to-read-large-csv-with-beam]

I can think of two libraries we could use for handling CSV files. The first one 
is using Apache Commons CSV library. Here is some example code which can use 
CSVRecord class for reading and writing CSV records:

{{{color:#172b4d}{{PipelineOptions options = PipelineOptionsFactory.create();}}
 {{Pipeline pipeline = Pipeline.create(options);}}
 {{PCollection<CSVRecord> records = pipeline.apply("ReadCSV", 
CSVIO.read().from("input.csv"));}}
records.apply("WriteCSV", CSVIO.write().to("output.csv"));{color}}}

Another library we could use is Jackson CSV, which allows users to specify 
schemas for the columns: 
[https://github.com/FasterXML/jackson-dataformats-text/tree/master/csv]

The crux of the problem is this: can we read and write large CSV files in 
parallel? If so, would it be good to have a feature where Apache Beam supports 
reading/writing CSV files?

  was:
Apache Beam has TextIO which can read text based files line by line, delimited 
by either a carriage return, newline, or a carriage return and a newline. This 
approach does not support CSV files which have records that span multiple 
lines. This is because there could be fields where there is a newline inside 
the double quotes.

This Stackoverflow question is relevant for a feature that should be added to 
Apache Beam: 
[https://stackoverflow.com/questions/51439189/how-to-read-large-csv-with-beam]

I can think of two libraries we could use for handling CSV files. The first one 
is using Apache Commons CSV library. Here is some example code which can use 
CSVRecord class for reading and writing CSV records:

{color:#172b4d}{{PipelineOptions options = PipelineOptionsFactory.create();}}
{{Pipeline pipeline = Pipeline.create(options);}}
{{PCollection<CSVRecord> records = pipeline.apply("ReadCSV", 
CSVIO.read().from("input.csv"));}}
{{ records.apply("WriteCSV", CSVIO.write().to("output.csv"));}}{color}

Another library we could use is Jackson CSV, which allows users to specify 
schemas for the columns: 
[https://github.com/FasterXML/jackson-dataformats-text/tree/master/csv]

The crux of the problem is this: can we read and write large CSV files in 
parallel? If so, would it be good to have a feature where Apache Beam supports 
reading/writing CSV files?


> Add CSVIO for Java SDK
> ----------------------
>
>                 Key: BEAM-10030
>                 URL: https://issues.apache.org/jira/browse/BEAM-10030
>             Project: Beam
>          Issue Type: New Feature
>          Components: io-ideas
>            Reporter: Saurabh Joshi
>            Priority: P2
>
> Apache Beam has TextIO which can read text based files line by line, 
> delimited by either a carriage return, newline, or a carriage return and a 
> newline. This approach does not support CSV files which have records that 
> span multiple lines. This is because there could be fields where there is a 
> newline inside the double quotes.
> This Stackoverflow question is relevant for a feature that should be added to 
> Apache Beam: 
> [https://stackoverflow.com/questions/51439189/how-to-read-large-csv-with-beam]
> I can think of two libraries we could use for handling CSV files. The first 
> one is using Apache Commons CSV library. Here is some example code which can 
> use CSVRecord class for reading and writing CSV records:
> {{{color:#172b4d}{{PipelineOptions options = 
> PipelineOptionsFactory.create();}}
>  {{Pipeline pipeline = Pipeline.create(options);}}
>  {{PCollection<CSVRecord> records = pipeline.apply("ReadCSV", 
> CSVIO.read().from("input.csv"));}}
> records.apply("WriteCSV", CSVIO.write().to("output.csv"));{color}}}
> Another library we could use is Jackson CSV, which allows users to specify 
> schemas for the columns: 
> [https://github.com/FasterXML/jackson-dataformats-text/tree/master/csv]
> The crux of the problem is this: can we read and write large CSV files in 
> parallel? If so, would it be good to have a feature where Apache Beam 
> supports reading/writing CSV files?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to