2013/7/31 Gary Gregory <garydgreg...@gmail.com> > On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg <ebo...@apache.org> wrote: > > > Le 30/07/2013 23:26, Gary Gregory a écrit : > > > And another thing: internally, the header should be a Set<String>, not > a > > > String[]. I plan on fixing that later too. > > > > Why should it be a set? Is there an impact on the performance? > > > > Well, I did not finish my though on that one, sorry about that, please > allow me to walk through my use cases. The issue is about the feature, not > performance. > > At first glance, using a set avoids an inherent problem with any non-set > data structure: defining duplicates. What does the following mean? > > withHeader("A", "B", "C", "A"); > > It's is a recipe for garbage results: record.get("A") returns what? > > Today, I added some CSVFormat validation code that checks for duplicate > column names. If you build a format with withHeader("A", "B", "C", "A"); > you will get an ISE when validate() is called. > > If we had withHeader(Set) and document it as the 'main' way to specify > column names, then we can say that withHeader(String...) is just a > syntactical convenience and turn the String[] into a Set. But that will not > work. > > The problem with a Java Set is that it is not ordered and the current > implementation relies on order of the String[]. But why? What the current > implementation says is: ignore what the header line of the file is and use > the given column names at the given positions. A perfectly good user story. > So for withHeader("A", "B", "C"), "A" is column 0, "B" is column 1, and so > on. Ok, that's one usage. > > Taking a step back, I want to talk about why should the column name order > matter when you are calling withHeader(). I would like to be able to tell > the parser that I want to use a Set of column names and have it figure out, > based on the header line, the columns indices. This is quite different than > what we have now. > > A use case I have now is a CSV file with a lot of columns (~90) but I only > care about a small subset of the columns (~10). I'd like to be able to say > withHeader(Set) where the Set may be a subset of the actual column names in > the header line. This is different from withHeader(String[]) because the > names in the Set must match the names in the header record. >
I'm not sure if we should try to build in all this different cases (guessing headers, using the first record as headers, only use a subset of the available headers) into one implementation. What you are talking about sounds more like a view or a projection of the actual content being parsed. Do we really need this for 1.0 or can it be postponed? > > So I think it boils down to ignoring my comment about using a Set > internally and adding a feature where I can tell the parser that I want to > use a set of column names and not worry about the order, because the parser > will match up the column names when it reads the header line. > > Gary > > > > > > > > Emmanuel Bourg > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > For additional commands, e-mail: dev-h...@commons.apache.org > > > > > > > -- > E-Mail: garydgreg...@gmail.com | ggreg...@apache.org > Java Persistence with Hibernate, Second Edition< > http://www.manning.com/bauer3/> > JUnit in Action, Second Edition <http://www.manning.com/tahchiev/> > Spring Batch in Action <http://www.manning.com/templier/> > Blog: http://garygregory.wordpress.com > Home: http://garygregory.com/ > Tweet! http://twitter.com/GaryGregory > -- http://people.apache.org/~britter/ http://www.systemoutprintln.de/ http://twitter.com/BenediktRitter http://github.com/britter