[
https://issues.apache.org/jira/browse/NIFI-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pierre Villard resolved NIFI-1716.
----------------------------------
Resolution: Duplicate
Fix Version/s: 1.2.0
> Implement a SplitCsv processor, possibly also a GetCSV
> ------------------------------------------------------
>
> Key: NIFI-1716
> URL: https://issues.apache.org/jira/browse/NIFI-1716
> Project: Apache NiFi
> Issue Type: New Feature
> Components: Core Framework
> Reporter: Dmitry Goldenberg
> Fix For: 1.2.0
>
>
> I'm proposing a SplitCSV processor dedicated specifically to splitting CSV
> content which is assumed to be in the flowfile-content of its incoming
> flowfiles.
> It appears that the current mode of splitting a CSV file is by using the
> SplitText processor. However, it'd be great to have a CSV splitter to read
> CSV records one by one and use the header row's header names to convert each
> record into a FlowFile, with attributes set to correspond to the headers.
> Whether or not the first row is a header should be a boolean configuration
> option. In the absence of a header row, some sensible default column names
> should be utilized, for example, one convention could be: column1, column2,
> column3, etc. (or a naming strategy could be provided by the user in the
> configuration).
> Another option on the splitter needs to be the delimiter character (defaulted
> to comma).
> Empty lines shall be skipped from processing.
> Extracted cell values shall be (optionally) whitespace-trimmed.
> Jagged rows must have some sensible handling:
> 1) For a given row, if there are fewer cells than in the header row, cells
> shall be assigned to columns left to right, and any missing cells shall be
> considered empty.
> 2) For a given row, if there are more cells than in the header row, a
> (non-fatal) error shall be generated for the row and the row shall be dropped
> from processing.
> As typically done with CSV, delimiter characters are ignored within quotes.
> Elements may span multiple lines by having embedded carriage returns; such
> elements must be quoted.
> NIFI-1280 asks for a way to specify which columns are to be kept or skipped.
> I'm proposing that instead of a separate processor, this would be implemented
> as a configuration option on SplitCSV (a list of 0-based indices of columns
> that are to be kept).
> It may also make sense to expose a GetCSV ingress component which would share
> most of its functionality with SplitCSV. Perhaps it's easiest if users just
> follow a GetFile with SplitCSV, however in some cases it makes sense to save
> on reading the file into a flowfile-content but rather process all CSV data
> in-place, within a GetCSV.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)