[jira] [Commented] (NIFI-1280) Create FilterCSVColumns Processor

Julian Hyde (JIRA) Thu, 12 May 2016 21:42:20 -0700

    [ 
https://issues.apache.org/jira/browse/NIFI-1280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282393#comment-15282393
 ]


Julian Hyde commented on NIFI-1280:
-----------------------------------

I think using Calcite an “embedded” SQL engine makes a lot of sense. (Josh 
Wills did something similar in Scrunch.) Calcite embeds very nicely, because it 
is happy to get its metadata (E.g. table definitions) through SPIs and doesn’t 
need any installation.

This change isn't perfect, but it's a good start, so +1 from me.

I saw that [~Toivo Adams] needed to change the Csv adapter to read from a 
general source not just a file. In CALCITE-884 (in progress) we make a very 
similar change, so you use the same parser (say CSV parser) on web pages, files 
and potentially other data sources. I hope that when CALCITE-884 is complete we 
can remove CsvEnumerator2 from Nifi, but the copied file is OK for now.

One question is whether we use Calcite’s adapters (for CSV, JSON etc.) or 
Nifi’s. Does Nifi have parsers for more of the basic file types? I suspect it 
does, but I don't know. If so, we should create a way for a Nifi parser to send 
data to the embedded Calcite.

But Calcite adapters have a capability that I suspect is absent in Nifi 
adapters, namely the ability to push down filters, projections and potentially 
other kinds of processing. So we need to figure out how to surface that 
capability.

> Create FilterCSVColumns Processor
> ---------------------------------
>
>                 Key: NIFI-1280
>                 URL: https://issues.apache.org/jira/browse/NIFI-1280
>             Project: Apache NiFi
>          Issue Type: Task
>          Components: Extensions
>            Reporter: Mark Payne
>            Assignee: Toivo Adams
>
> We should have a Processor that allows users to easily filter out specific 
> columns from CSV data. For instance, a user would configure two different 
> properties: "Columns of Interest" (a comma-separated list of column indexes) 
> and "Filtering Strategy" (Keep Only These Columns, Remove Only These Columns).
> We can do this today with ReplaceText, but it is far more difficult than it 
> would be with this Processor, as the user has to use Regular Expressions, 
> etc. with ReplaceText.
> Eventually a Custom UI could even be built that allows a user to upload a 
> Sample CSV and choose which columns from there, similar to the way that Excel 
> works when importing CSV by dragging and selecting the desired columns? That 
> would certainly be a larger undertaking and would not need to be done for an 
> initial implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NIFI-1280) Create FilterCSVColumns Processor

Reply via email to