Uwe, The Velocity based transformer sounds like a cool feature. As far as the splitter, I'm not quite groking why it treats its input as a single row to split? Shouldn't the input be a full CSV which you'd want to split? I guess you already have a splitter, perhaps based on SplitText. What I want to do is implement a SplitCSV (and GetCSV) which uses OpenCSV to split a full CSV into individual rows.
- Dmitry On Tue, Apr 5, 2016 at 4:06 PM, Uwe Geercken <[email protected]> wrote: > Dmitry, > > what I have is at the moment this: > > https://github.com/uwegeercken/nifi_processors > > Two processors: one that splits one CSV row and assigns the values to > flowfile attributes. And one that merges the attributes with a template > (apache velocity) to produce a different output. > > I wanted to start with opencsv but ran into problems and got no time > afterwards. > > Rgds, > > Uwe > > > Gesendet: Dienstag, 05. April 2016 um 21:21 Uhr > > Von: "Dmitry Goldenberg" <[email protected]> > > An: [email protected] > > Betreff: Re: Re: Filtering large CSV files > > > > Hi Uwe, > > > > Yes, that is what I was thinking of using for the CSV processor. Will > you > > be committing your version? > > > > - Dmitry > > > > On Tue, Apr 5, 2016 at 1:39 PM, Uwe Geercken <[email protected]> > wrote: > > > > > Dimitry, > > > > > > I was working on a processor for CSV files and one remark came up that > we > > > might want to use the opencsv library for parsing the file. > > > > > > Here is the link: http://opencsv.sourceforge.net/ > > > > > > Greetings, > > > > > > Uwe > > > > > > > Gesendet: Dienstag, 05. April 2016 um 13:00 Uhr > > > > Von: "Dmitry Goldenberg" <[email protected]> > > > > An: [email protected] > > > > Betreff: Re: Filtering large CSV files > > > > > > > > Hi Eric, > > > > > > > > Thinking about exactly these use-cases, I filed the following JIRA > > > ticket: > > > > NIFI-1716 <https://issues.apache.org/jira/browse/NIFI-1716>. It asks > > > for a > > > > SplitCSV processor, and actually for a GetCSV ingress which would > address > > > > the issue of reading out of a large CSV treating it as a "data > source". > > > I > > > > was thinking of actually implementing both and committing them. > > > > > > > > NIFI-1280 <https://issues.apache.org/jira/browse/NIFI-1280> is > asking > > > for a > > > > way to filter the CSV columns. I believe this is best achieved as > the > > > CSV > > > > is getting parsed, in other words, on the GetCSV/SplitCSV, and not > as a > > > > separate step. > > > > > > > > I'm not sure that SplitText is the best way to process CSV data to > begin > > > > with, because with a CSV, there's a chance that a given cell may > spill > > > over > > > > into multiple lines. Such would be the case of embedded newlines > within a > > > > single, quoted cell. I don't think SplitText addresses that and that > > > would > > > > be one reason to implement GetCSV/SplitCSV using proper CSV parsing > > > > semantics, the other reason being efficiency of reading. > > > > > > > > As far as the limit on the capturing groups, that seems arbitrary. I > > > think > > > > that on GetCSV/SplitCSV, if you have a way to identify the filtered > out > > > > columns by their number (index) that should go a long way; perhaps a > > > regex > > > > is also a good option. I know it may seem that filtering should be a > > > > separate step in a given dataflow but from the point of view of > > > efficiency, > > > > I believe it belongs right in the GetCSV/SplitCSV processors as the > CSV > > > > records are being read and processed. > > > > > > > > - Dmitry > > > > > > > > > > > > > > > > > > > > On Tue, Apr 5, 2016 at 6:36 AM, Eric FALK <[email protected]> wrote: > > > > > > > > > Dear all, > > > > > > > > > > I would require to filter large csv files in a data flow. By > filtering > > > I > > > > > mean: scale down the file in terms of columns, and looking for a > > > particular > > > > > value to match a parameter. I looked into the example, of csv to > JSON. > > > I do > > > > > have a couple of questions: > > > > > > > > > > -First I use a SplitText control get each line of the file. It > makes > > > > > things slow, as it seems to generate a flow file for each line. Do > I > > > have > > > > > to proceed this way, or is there an alternative? My csv files are > > > really > > > > > large and can have millions of lines. > > > > > > > > > > -In a second step I am extracting the values with the > (.+),(.+),….,(.+) > > > > > technique, before using a processor to check for a match, on > > > ${csv.146} for > > > > > instance. Now I have a problem: my csv has 233 fields, so I am > getting > > > the > > > > > message: “ReGex is required to have between 1 and 40 capturing > groups > > > but > > > > > has 233”. Again, is there another way to proceed, am I missing > > > something? > > > > > > > > > > Best regards, > > > > > Eric > > > > > > > > > >
