[jira] [Commented] (NIFI-1280) Create FilterCSVColumns Processor

ASF GitHub Bot (JIRA) Wed, 11 May 2016 10:32:42 -0700

    [ 
https://issues.apache.org/jira/browse/NIFI-1280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280488#comment-15280488
 ]


ASF GitHub Bot commented on NIFI-1280:
--------------------------------------

Github user markap14 commented on the pull request:

    https://github.com/apache/nifi/pull/420#issuecomment-218531630
  
    @ToivoAdams After looking at this a bit more, I've got a few very 
high-level thoughts. First is that what you have built here is incredibly 
awesome and powerful! I think FilterCSVColumns is the wrong name for this 
Processor - it should be perhaps TransformCSV, as leveraging Calcite allows us 
to do some pretty powerful queries, such as "SELECT * FROM CSV.A WHERE AMOUNT2 
< 99" and I imagine that this type of use case will be very common.
    
    I do have a concern, though, which is that if you attempt to perform a JOIN 
operation, for example "SELECT X.AMOUNT2, X.AMOUNT3 FROM CSV.A as X JOIN CSV.A 
AS Y ON X.AMOUNT2=Y.AMOUNT2" we end up with an IOException: Stream closed. This 
is because Calcite will have to read the data multiple times in order to 
perform the JOIN. I think we can get around this by changing the 
CsvSchemaFactory2 to be something like CsvInputStreamFactory, and that class, 
rather than receiving the InputStream directly would be passed the FlowFIle and 
ProcessSession and could create the InputStream on-demand. This would allow the 
data to be read multiple times by creating two InputStream's. The nice thing is 
that if this runs on a system with sufficient RAM the Operating System's disk 
cache will generally mean that we don't even have to read the data for the 
second pass unless it's a really massive amount of CSV.
    
    Additionally, I think we need to have some sort of validator for the SQL 
Select Statement property, as right now if the query is invalid, the processor 
is valid and just routes everything to failure.


> Create FilterCSVColumns Processor
> ---------------------------------
>
>                 Key: NIFI-1280
>                 URL: https://issues.apache.org/jira/browse/NIFI-1280
>             Project: Apache NiFi
>          Issue Type: Task
>          Components: Extensions
>            Reporter: Mark Payne
>            Assignee: Toivo Adams
>
> We should have a Processor that allows users to easily filter out specific 
> columns from CSV data. For instance, a user would configure two different 
> properties: "Columns of Interest" (a comma-separated list of column indexes) 
> and "Filtering Strategy" (Keep Only These Columns, Remove Only These Columns).
> We can do this today with ReplaceText, but it is far more difficult than it 
> would be with this Processor, as the user has to use Regular Expressions, 
> etc. with ReplaceText.
> Eventually a Custom UI could even be built that allows a user to upload a 
> Sample CSV and choose which columns from there, similar to the way that Excel 
> works when importing CSV by dragging and selecting the desired columns? That 
> would certainly be a larger undertaking and would not need to be done for an 
> initial implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NIFI-1280) Create FilterCSVColumns Processor

Reply via email to