[jira] [Commented] (NIFI-1280) Create FilterCSVColumns Processor

Julian Hyde (JIRA) Mon, 16 May 2016 12:44:39 -0700

    [ 
https://issues.apache.org/jira/browse/NIFI-1280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285161#comment-15285161
 ]


Julian Hyde commented on NIFI-1280:
-----------------------------------

[~Toivo Adams], Calcite represents the algebra, so it can represent any join 
algorithm as easily as any other. I agree that re-scanning data is a really bad 
idea for large data sets; big analytic databases is my field, and the field of 
Hive & Drill, which use Calcite extensively. There are cases where nested-loop 
joins make sense, if you know your data.

I was actually thinking of cases like self-join (e.g. {{select * from emp join 
emp as mgr on emp.mgr = mgr.id where emp.salary > mgr.salary}}, to find all 
employees who earn more than their manager) where you would like to have two 
scans over the same data set, and it would be hard to do that if you represent 
the {{emp}} table as an InputStream.

Calcite's built-in implementation of join (which generates iterator-style java 
code, and is suitable for small- to medium-sized data) has both a merge-join 
(sorting the input iterators only if they are not already sorted) and a 
theta-join that materializes just the right side of the join. More algorithms 
would be possible, and also Calcite can target your query at a distributed 
engine like Spark, Drill or Flink if your data is large.

> Create FilterCSVColumns Processor
> ---------------------------------
>
>                 Key: NIFI-1280
>                 URL: https://issues.apache.org/jira/browse/NIFI-1280
>             Project: Apache NiFi
>          Issue Type: Task
>          Components: Extensions
>            Reporter: Mark Payne
>            Assignee: Toivo Adams
>
> We should have a Processor that allows users to easily filter out specific 
> columns from CSV data. For instance, a user would configure two different 
> properties: "Columns of Interest" (a comma-separated list of column indexes) 
> and "Filtering Strategy" (Keep Only These Columns, Remove Only These Columns).
> We can do this today with ReplaceText, but it is far more difficult than it 
> would be with this Processor, as the user has to use Regular Expressions, 
> etc. with ReplaceText.
> Eventually a Custom UI could even be built that allows a user to upload a 
> Sample CSV and choose which columns from there, similar to the way that Excel 
> works when importing CSV by dragging and selecting the desired columns? That 
> would certainly be a larger undertaking and would not need to be done for an 
> initial implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NIFI-1280) Create FilterCSVColumns Processor

Reply via email to