[ 
https://issues.apache.org/jira/browse/HUDI-76?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16975480#comment-16975480
 ] 

Ethan Guo commented on HUDI-76:
-------------------------------

[~vinoth] [~vbalaji] [~xleesf]

 

Sounds good.  I updated RFC-1 based on the discussion: 
[https://cwiki.apache.org/confluence/display/HUDI/RFC-1+%3A+CSV+Source+Support+for+Delta+Streamer]

 

For incremental pulls, I'm thinking of assuming that the CSV files are named 
after timestamps which are monotonically increasing.  Then the smallest unit of 
incremental pull will be one CSV file.  The filename of the last ingested CSV 
can be taken as the last checkpoint.  A directory that holds all CSV files to 
be ingested will be given in the config.

 

Do you guys think this is realistic?  I'd like to also know some use cases we 
need to consider, e.g., where can the CSV logs be generated.

 

> CSV Source support for Hudi Delta Streamer
> ------------------------------------------
>
>                 Key: HUDI-76
>                 URL: https://issues.apache.org/jira/browse/HUDI-76
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: deltastreamer, Incremental Pull
>            Reporter: Balaji Varadarajan
>            Assignee: Ethan Guo
>            Priority: Minor
>
> DeltaStreamer does not have support to pull CSV data from sources (hdfs log 
> files/kafka). THis ticket is to provide support for csv sources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to