Dimitry, I was working on a processor for CSV files and one remark came up that we might want to use the opencsv library for parsing the file.
Here is the link: http://opencsv.sourceforge.net/ Greetings, Uwe > Gesendet: Dienstag, 05. April 2016 um 13:00 Uhr > Von: "Dmitry Goldenberg" <[email protected]> > An: [email protected] > Betreff: Re: Filtering large CSV files > > Hi Eric, > > Thinking about exactly these use-cases, I filed the following JIRA ticket: > NIFI-1716 <https://issues.apache.org/jira/browse/NIFI-1716>. It asks for a > SplitCSV processor, and actually for a GetCSV ingress which would address > the issue of reading out of a large CSV treating it as a "data source". I > was thinking of actually implementing both and committing them. > > NIFI-1280 <https://issues.apache.org/jira/browse/NIFI-1280> is asking for a > way to filter the CSV columns. I believe this is best achieved as the CSV > is getting parsed, in other words, on the GetCSV/SplitCSV, and not as a > separate step. > > I'm not sure that SplitText is the best way to process CSV data to begin > with, because with a CSV, there's a chance that a given cell may spill over > into multiple lines. Such would be the case of embedded newlines within a > single, quoted cell. I don't think SplitText addresses that and that would > be one reason to implement GetCSV/SplitCSV using proper CSV parsing > semantics, the other reason being efficiency of reading. > > As far as the limit on the capturing groups, that seems arbitrary. I think > that on GetCSV/SplitCSV, if you have a way to identify the filtered out > columns by their number (index) that should go a long way; perhaps a regex > is also a good option. I know it may seem that filtering should be a > separate step in a given dataflow but from the point of view of efficiency, > I believe it belongs right in the GetCSV/SplitCSV processors as the CSV > records are being read and processed. > > - Dmitry > > > > > On Tue, Apr 5, 2016 at 6:36 AM, Eric FALK <[email protected]> wrote: > > > Dear all, > > > > I would require to filter large csv files in a data flow. By filtering I > > mean: scale down the file in terms of columns, and looking for a particular > > value to match a parameter. I looked into the example, of csv to JSON. I do > > have a couple of questions: > > > > -First I use a SplitText control get each line of the file. It makes > > things slow, as it seems to generate a flow file for each line. Do I have > > to proceed this way, or is there an alternative? My csv files are really > > large and can have millions of lines. > > > > -In a second step I am extracting the values with the (.+),(.+),….,(.+) > > technique, before using a processor to check for a match, on ${csv.146} for > > instance. Now I have a problem: my csv has 233 fields, so I am getting the > > message: “ReGex is required to have between 1 and 40 capturing groups but > > has 233”. Again, is there another way to proceed, am I missing something? > > > > Best regards, > > Eric >
