Re: Re: Re: Filtering large CSV files

Dmitry Goldenberg Tue, 05 Apr 2016 15:45:12 -0700

Uwe,

The Velocity based transformer sounds like a cool feature.  As far as the
splitter, I'm not quite groking why it treats its input as a single row to
split?  Shouldn't the input be a full CSV which you'd want to split?  I
guess you already have a splitter, perhaps based on SplitText.  What I want
to do is implement a SplitCSV (and GetCSV) which uses OpenCSV to split a
full CSV into individual rows.


- Dmitry

On Tue, Apr 5, 2016 at 4:06 PM, Uwe Geercken <[email protected]> wrote:

> Dmitry,
>
> what I have is at the moment this:
>
> https://github.com/uwegeercken/nifi_processors
>
> Two processors: one that splits one CSV row and assigns the values to
> flowfile attributes. And one that merges the attributes with a template
> (apache velocity) to produce a different output.
>
> I wanted to start with opencsv but ran into problems and got no time
> afterwards.
>
> Rgds,
>
> Uwe
>
> > Gesendet: Dienstag, 05. April 2016 um 21:21 Uhr
> > Von: "Dmitry Goldenberg" <[email protected]>
> > An: [email protected]
> > Betreff: Re: Re: Filtering large CSV files
> >
> > Hi Uwe,
> >
> > Yes, that is what I was thinking of using for the CSV processor.  Will
> you
> > be committing your version?
> >
> > - Dmitry
> >
> > On Tue, Apr 5, 2016 at 1:39 PM, Uwe Geercken <[email protected]>
> wrote:
> >
> > > Dimitry,
> > >
> > > I was working on a processor for CSV files and one remark came up that
> we
> > > might want to use the opencsv library for parsing the file.
> > >
> > > Here is the link: http://opencsv.sourceforge.net/
> > >
> > > Greetings,
> > >
> > > Uwe
> > >
> > > > Gesendet: Dienstag, 05. April 2016 um 13:00 Uhr
> > > > Von: "Dmitry Goldenberg" <[email protected]>
> > > > An: [email protected]
> > > > Betreff: Re: Filtering large CSV files
> > > >
> > > > Hi Eric,
> > > >
> > > > Thinking about exactly these use-cases, I filed the following JIRA
> > > ticket:
> > > > NIFI-1716 <https://issues.apache.org/jira/browse/NIFI-1716>. It asks
> > > for a
> > > > SplitCSV processor, and actually for a GetCSV ingress which would
> address
> > > > the issue of reading out of a large CSV treating it as a "data
> source".
> > > I
> > > > was thinking of actually implementing both and committing them.
> > > >
> > > > NIFI-1280 <https://issues.apache.org/jira/browse/NIFI-1280> is
> asking
> > > for a
> > > > way to filter the CSV columns.  I believe this is best achieved as
> the
> > > CSV
> > > > is getting parsed, in other words, on the GetCSV/SplitCSV, and not
> as a
> > > > separate step.
> > > >
> > > > I'm not sure that SplitText is the best way to process CSV data to
> begin
> > > > with, because with a CSV, there's a chance that a given cell may
> spill
> > > over
> > > > into multiple lines. Such would be the case of embedded newlines
> within a
> > > > single, quoted cell. I don't think SplitText addresses that and that
> > > would
> > > > be one reason to implement GetCSV/SplitCSV using proper CSV parsing
> > > > semantics, the other reason being efficiency of reading.
> > > >
> > > > As far as the limit on the capturing groups, that seems arbitrary. I
> > > think
> > > > that on GetCSV/SplitCSV, if you have a way to identify the filtered
> out
> > > > columns by their number (index) that should go a long way; perhaps a
> > > regex
> > > > is also a good option.  I know it may seem that filtering should be a
> > > > separate step in a given dataflow but from the point of view of
> > > efficiency,
> > > > I believe it belongs right in the GetCSV/SplitCSV processors as the
> CSV
> > > > records are being read and processed.
> > > >
> > > > - Dmitry
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Apr 5, 2016 at 6:36 AM, Eric FALK <[email protected]> wrote:
> > > >
> > > > > Dear all,
> > > > >
> > > > > I would require to filter large csv files in a data flow. By
> filtering
> > > I
> > > > > mean: scale down the file in terms of columns, and looking for a
> > > particular
> > > > > value to match a parameter. I looked into the example, of csv to
> JSON.
> > > I do
> > > > > have a couple of questions:
> > > > >
> > > > > -First I use a SplitText control get each line of the file. It
> makes
> > > > > things slow, as it seems to generate a flow file for each line. Do
> I
> > > have
> > > > > to proceed this way, or is there an alternative? My csv files are
> > > really
> > > > > large and can have millions of lines.
> > > > >
> > > > > -In a second step I am extracting the values with the
> (.+),(.+),….,(.+)
> > > > > technique, before using a processor to check for a match, on
> > > ${csv.146} for
> > > > > instance. Now I have a problem: my csv has 233 fields, so I am
> getting
> > > the
> > > > > message: “ReGex is required to have between 1 and 40 capturing
> groups
> > > but
> > > > > has 233”. Again, is there another way to proceed, am I missing
> > > something?
> > > > >
> > > > > Best regards,
> > > > > Eric
> > > >
> > >
> >
>

Re: Re: Re: Filtering large CSV files

Reply via email to