[jira] [Commented] (NIFI-1280) Create FilterCSVColumns Processor

Josh Elser (JIRA) Mon, 16 May 2016 09:44:27 -0700

    [ 
https://issues.apache.org/jira/browse/NIFI-1280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284810#comment-15284810
 ]


Josh Elser commented on NIFI-1280:
----------------------------------

{quote}
 I saw that Toivo Adams needed to change the Csv adapter to read from a general 
source not just a file. In CALCITE-884 (in progress) we make a very similar 
change, so you use the same parser (say CSV parser) on web pages, files and 
potentially other data sources. I hope that when CALCITE-884 is complete we can 
remove CsvEnumerator2 from Nifi, but the copied file is OK for now.
{quote}

I was talking to [~markap14] about this offline: I think ultimately the big 
problem for a "trivial" use of the CSV example was that the InputStream doesn't 
support some kind of a {{clone()}} operation. The {{File}} use in the 
"de-facto" example works around this simply enough. Maybe we could construct 
some sort of base-classes for Enumerators to build on top of that aren't based 
explicitly on a traditional notion of a "File" (and thus be re-used by other 
consumers)? Concretely, we could have some series of bytes, whether that's 
coming from a file on disk, over the network, or just from memory, Calcite 
doesn't really care (nor should it). We could probably make that better with 
some building blocks which would easy adoption in NiFi (as their internal 
representation is a hybrid sort of thing).

bq. One question is whether we use Calcite’s adapters (for CSV, JSON etc.) or 
Nifi’s. Does Nifi have parsers for more of the basic file types? I suspect it 
does, but I don't know. If so, we should create a way for a Nifi parser to send 
data to the embedded Calcite.

Yep, there are loads of "processors" in NiFi which can read all kinds of data 
formats. This goes back to my previous point, too :)

bq. Another possible integration with Calcite would be for Nifi to be a source 
for streaming (i.e. continuously executing) Calcite queries. Calcite wouldn't 
be embedded in Nifi, but rather, Calcite (or a streaming engine such as Flink, 
Storm, Samza, Apex, Beam) would continuously read from Nifi. These queries 
would be continuous and would therefore start with the words "select stream 
...".

Agreed, this would be cool. I'll hold my tongue for now to avoid using the 
wrong terminology for how NiFi does batching/streaming now :)

> Create FilterCSVColumns Processor
> ---------------------------------
>
>                 Key: NIFI-1280
>                 URL: https://issues.apache.org/jira/browse/NIFI-1280
>             Project: Apache NiFi
>          Issue Type: Task
>          Components: Extensions
>            Reporter: Mark Payne
>            Assignee: Toivo Adams
>
> We should have a Processor that allows users to easily filter out specific 
> columns from CSV data. For instance, a user would configure two different 
> properties: "Columns of Interest" (a comma-separated list of column indexes) 
> and "Filtering Strategy" (Keep Only These Columns, Remove Only These Columns).
> We can do this today with ReplaceText, but it is far more difficult than it 
> would be with this Processor, as the user has to use Regular Expressions, 
> etc. with ReplaceText.
> Eventually a Custom UI could even be built that allows a user to upload a 
> Sample CSV and choose which columns from there, similar to the way that Excel 
> works when importing CSV by dragging and selecting the desired columns? That 
> would certainly be a larger undertaking and would not need to be done for an 
> initial implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NIFI-1280) Create FilterCSVColumns Processor

Reply via email to