Re: [CSV] Headers and the first record

Benedikt Ritter Wed, 31 Jul 2013 00:39:10 -0700

2013/7/31 Gary Gregory <garydgreg...@gmail.com>

> On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg <ebo...@apache.org> wrote:
>
> > Le 30/07/2013 23:26, Gary Gregory a écrit :
> > > And another thing: internally, the header should be a Set<String>, not
> a
> > > String[]. I plan on fixing that later too.
> >
> > Why should it be a set? Is there an impact on the performance?
> >
>
> Well, I did not finish my though on that one, sorry about that, please
> allow me to walk through my use cases. The issue is about the feature, not
> performance.
>
> At first glance, using a set avoids an inherent problem with any non-set
> data structure: defining duplicates. What does the following mean?
>
> withHeader("A", "B", "C", "A");
>
> It's is a recipe for garbage results: record.get("A") returns what?
>
> Today, I added some CSVFormat validation code that checks for duplicate
> column names. If you build a format with withHeader("A", "B", "C", "A");
> you will get an ISE when validate() is called.
>
> If we had withHeader(Set) and document it as the 'main' way to specify
> column names, then we can say that withHeader(String...) is just a
> syntactical convenience and turn the String[] into a Set. But that will not
> work.
>
> The problem with a Java Set is that it is not ordered and the current
> implementation relies on order of the String[]. But why? What the current
> implementation says is: ignore what the header line of the file is and use
> the given column names at the given positions. A perfectly good user story.
> So for withHeader("A", "B", "C"), "A" is column 0, "B" is column 1, and so
> on. Ok, that's one usage.
>
> Taking a step back, I want to talk about why should the column name order
> matter when you are calling withHeader(). I would like to be able to tell
> the parser that I want to use a Set of column names and have it figure out,
> based on the header line, the columns indices. This is quite different than
> what we have now.
>
> A use case I have now is a CSV file with a lot of columns (~90) but I only
> care about a small subset of the columns (~10). I'd like to be able to say
> withHeader(Set) where the Set may be a subset of the actual column names in
> the header line. This is different from withHeader(String[]) because the
> names in the Set must match the names in the header record.
>


I'm not sure if we should try to build in all this different cases
(guessing headers, using the first record as headers, only use a subset of
the available headers) into one implementation.

What you are talking about sounds more like a view or a projection of the
actual content being parsed.
Do we really need this for 1.0 or can it be postponed?


>
> So I think it boils down to ignoring my comment about using a Set
> internally and adding a feature where I can tell the parser that I want to
> use a set of column names and not worry about the order, because the parser
> will match up the column names when it reads the header line.
>
> Gary
>
>
> >
> >
> > Emmanuel Bourg
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > For additional commands, e-mail: dev-h...@commons.apache.org
> >
> >
>
>
> --
> E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
> Java Persistence with Hibernate, Second Edition<
> http://www.manning.com/bauer3/>
> JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
> Spring Batch in Action <http://www.manning.com/templier/>
> Blog: http://garygregory.wordpress.com
> Home: http://garygregory.com/
> Tweet! http://twitter.com/GaryGregory
>



-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Re: [CSV] Headers and the first record

Reply via email to