Aw: Re: Re: Filtering large CSV files

Uwe Geercken Tue, 05 Apr 2016 13:07:54 -0700

Dmitry,

what I have is at the moment this:


https://github.com/uwegeercken/nifi_processors

Two processors: one that splits one CSV row and assigns the values to flowfile 
attributes. And one that merges the attributes with a template (apache 
velocity) to produce a different output.

I wanted to start with opencsv but ran into problems and got no time afterwards.

Rgds,

Uwe

> Gesendet: Dienstag, 05. April 2016 um 21:21 Uhr
> Von: "Dmitry Goldenberg" <[email protected]>
> An: [email protected]
> Betreff: Re: Re: Filtering large CSV files
>
> Hi Uwe,
> 
> Yes, that is what I was thinking of using for the CSV processor.  Will you
> be committing your version?
> 
> - Dmitry
> 
> On Tue, Apr 5, 2016 at 1:39 PM, Uwe Geercken <[email protected]> wrote:
> 
> > Dimitry,
> >
> > I was working on a processor for CSV files and one remark came up that we
> > might want to use the opencsv library for parsing the file.
> >
> > Here is the link: http://opencsv.sourceforge.net/
> >
> > Greetings,
> >
> > Uwe
> >
> > > Gesendet: Dienstag, 05. April 2016 um 13:00 Uhr
> > > Von: "Dmitry Goldenberg" <[email protected]>
> > > An: [email protected]
> > > Betreff: Re: Filtering large CSV files
> > >
> > > Hi Eric,
> > >
> > > Thinking about exactly these use-cases, I filed the following JIRA
> > ticket:
> > > NIFI-1716 <https://issues.apache.org/jira/browse/NIFI-1716>. It asks
> > for a
> > > SplitCSV processor, and actually for a GetCSV ingress which would address
> > > the issue of reading out of a large CSV treating it as a "data source".
> > I
> > > was thinking of actually implementing both and committing them.
> > >
> > > NIFI-1280 <https://issues.apache.org/jira/browse/NIFI-1280> is asking
> > for a
> > > way to filter the CSV columns.  I believe this is best achieved as the
> > CSV
> > > is getting parsed, in other words, on the GetCSV/SplitCSV, and not as a
> > > separate step.
> > >
> > > I'm not sure that SplitText is the best way to process CSV data to begin
> > > with, because with a CSV, there's a chance that a given cell may spill
> > over
> > > into multiple lines. Such would be the case of embedded newlines within a
> > > single, quoted cell. I don't think SplitText addresses that and that
> > would
> > > be one reason to implement GetCSV/SplitCSV using proper CSV parsing
> > > semantics, the other reason being efficiency of reading.
> > >
> > > As far as the limit on the capturing groups, that seems arbitrary. I
> > think
> > > that on GetCSV/SplitCSV, if you have a way to identify the filtered out
> > > columns by their number (index) that should go a long way; perhaps a
> > regex
> > > is also a good option.  I know it may seem that filtering should be a
> > > separate step in a given dataflow but from the point of view of
> > efficiency,
> > > I believe it belongs right in the GetCSV/SplitCSV processors as the CSV
> > > records are being read and processed.
> > >
> > > - Dmitry
> > >
> > >
> > >
> > >
> > > On Tue, Apr 5, 2016 at 6:36 AM, Eric FALK <[email protected]> wrote:
> > >
> > > > Dear all,
> > > >
> > > > I would require to filter large csv files in a data flow. By filtering
> > I
> > > > mean: scale down the file in terms of columns, and looking for a
> > particular
> > > > value to match a parameter. I looked into the example, of csv to JSON.
> > I do
> > > > have a couple of questions:
> > > >
> > > > -First I use a SplitText control get each line of the file. It makes
> > > > things slow, as it seems to generate a flow file for each line. Do I
> > have
> > > > to proceed this way, or is there an alternative? My csv files are
> > really
> > > > large and can have millions of lines.
> > > >
> > > > -In a second step I am extracting the values with the (.+),(.+),….,(.+)
> > > > technique, before using a processor to check for a match, on
> > ${csv.146} for
> > > > instance. Now I have a problem: my csv has 233 fields, so I am getting
> > the
> > > > message: “ReGex is required to have between 1 and 40 capturing groups
> > but
> > > > has 233”. Again, is there another way to proceed, am I missing
> > something?
> > > >
> > > > Best regards,
> > > > Eric
> > >
> >
>

Aw: Re: Re: Filtering large CSV files

Reply via email to