[ 
https://issues.apache.org/jira/browse/HUDI-76?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964456#comment-16964456
 ] 

Ethan Guo commented on HUDI-76:
-------------------------------

After some exploration, here are my initial thoughts on how to implement this 
feature.
 * Source type: add a new source type `CSV` in SourceType
 * Create `CSVSource`, `CSVDFSSource`, and `CSVKafkaSource` classes to fetch 
new data
 ** Internally, the class need to convert text of CSV format to Avro and Row 
format.  Given that the conversion from Row to Avro is expensive, the design 
choice is to implement the conversion from CSV to Avro (Avro to Row conversion 
has already been there in Hudi).
 ** For the conversion from CSV to Avro, I've looked at the following libraries
 *** avro-tools: supports Avro to CSV conversion, not the reverse
 *** spark: Spark can read CSV files to get DataFrames / set of rows.  It uses 
[Univocity 
parser|[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala]]
 to parse CSV text and construct InternalRow.  
[univocity-parsers|[https://www.univocity.com/pages/univocity_parsers_tutorial.html]]
 is a collection of extremely fast and reliable Java-based parsers for CSV, TSV 
and Fixed Width files.  We can reuse part of its logic to construct Avro 
records from CSV text.
 * In terms of the CSV parsing options, we can provide the same semantics as 
what Spark has to be consistent.  A set of new Hudi CSV options will be added.  
These CSV parsing options can be passed to the Univocity parser directly.
 * Bridge the gap in `SourceFormatAdapter` for the new CSV SourceType

> CSV Source support for Hudi Delta Streamer
> ------------------------------------------
>
>                 Key: HUDI-76
>                 URL: https://issues.apache.org/jira/browse/HUDI-76
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: deltastreamer, Incremental Pull
>            Reporter: Balaji Varadarajan
>            Assignee: Ethan Guo
>            Priority: Minor
>
> DeltaStreamer does not have support to pull CSV data from sources (hdfs log 
> files/kafka). THis ticket is to provide support for csv sources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to