subject:"Re\: \[csv\] Headers"

Re: [CSV] Headers and the first record

2013-07-31 Thread Benedikt Ritter

2013/7/31 Gary Gregory garydgreg...@gmail.com

 On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg ebo...@apache.org wrote:

  Le 30/07/2013 23:26, Gary Gregory a écrit :
   And another thing: internally, the header should be a SetString, not
 a
   String[]. I plan on fixing that later too.
 
  Why should it be a set? Is there an impact on the performance?
 

 Well, I did not finish my though on that one, sorry about that, please
 allow me to walk through my use cases. The issue is about the feature, not
 performance.

 At first glance, using a set avoids an inherent problem with any non-set
 data structure: defining duplicates. What does the following mean?

 withHeader(A, B, C, A);

 It's is a recipe for garbage results: record.get(A) returns what?

 Today, I added some CSVFormat validation code that checks for duplicate
 column names. If you build a format with withHeader(A, B, C, A);
 you will get an ISE when validate() is called.

 If we had withHeader(Set) and document it as the 'main' way to specify
 column names, then we can say that withHeader(String...) is just a
 syntactical convenience and turn the String[] into a Set. But that will not
 work.

 The problem with a Java Set is that it is not ordered and the current
 implementation relies on order of the String[]. But why? What the current
 implementation says is: ignore what the header line of the file is and use
 the given column names at the given positions. A perfectly good user story.
 So for withHeader(A, B, C), A is column 0, B is column 1, and so
 on. Ok, that's one usage.

 Taking a step back, I want to talk about why should the column name order
 matter when you are calling withHeader(). I would like to be able to tell
 the parser that I want to use a Set of column names and have it figure out,
 based on the header line, the columns indices. This is quite different than
 what we have now.

 A use case I have now is a CSV file with a lot of columns (~90) but I only
 care about a small subset of the columns (~10). I'd like to be able to say
 withHeader(Set) where the Set may be a subset of the actual column names in
 the header line. This is different from withHeader(String[]) because the
 names in the Set must match the names in the header record.


I'm not sure if we should try to build in all this different cases
(guessing headers, using the first record as headers, only use a subset of
the available headers) into one implementation.

What you are talking about sounds more like a view or a projection of the
actual content being parsed.
Do we really need this for 1.0 or can it be postponed?



 So I think it boils down to ignoring my comment about using a Set
 internally and adding a feature where I can tell the parser that I want to
 use a set of column names and not worry about the order, because the parser
 will match up the column names when it reads the header line.

 Gary


 
 
  Emmanuel Bourg
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
  For additional commands, e-mail: dev-h...@commons.apache.org
 
 


 --
 E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
 Java Persistence with Hibernate, Second Edition
 http://www.manning.com/bauer3/
 JUnit in Action, Second Edition http://www.manning.com/tahchiev/
 Spring Batch in Action http://www.manning.com/templier/
 Blog: http://garygregory.wordpress.com
 Home: http://garygregory.com/
 Tweet! http://twitter.com/GaryGregory




-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Re: [CSV] Headers and the first record

2013-07-31 Thread sebb

On 31 July 2013 08:38, Benedikt Ritter brit...@apache.org wrote:
 2013/7/31 Gary Gregory garydgreg...@gmail.com

 On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg ebo...@apache.org wrote:

  Le 30/07/2013 23:26, Gary Gregory a écrit :
   And another thing: internally, the header should be a SetString, not
 a
   String[]. I plan on fixing that later too.
 
  Why should it be a set? Is there an impact on the performance?
 

 Well, I did not finish my though on that one, sorry about that, please
 allow me to walk through my use cases. The issue is about the feature, not
 performance.

 At first glance, using a set avoids an inherent problem with any non-set
 data structure: defining duplicates. What does the following mean?

 withHeader(A, B, C, A);

 It's is a recipe for garbage results: record.get(A) returns what?

 Today, I added some CSVFormat validation code that checks for duplicate
 column names. If you build a format with withHeader(A, B, C, A);
 you will get an ISE when validate() is called.

 If we had withHeader(Set) and document it as the 'main' way to specify
 column names, then we can say that withHeader(String...) is just a
 syntactical convenience and turn the String[] into a Set. But that will not
 work.

 The problem with a Java Set is that it is not ordered and the current
 implementation relies on order of the String[]. But why? What the current
 implementation says is: ignore what the header line of the file is and use
 the given column names at the given positions. A perfectly good user story.
 So for withHeader(A, B, C), A is column 0, B is column 1, and so
 on. Ok, that's one usage.

 Taking a step back, I want to talk about why should the column name order
 matter when you are calling withHeader(). I would like to be able to tell
 the parser that I want to use a Set of column names and have it figure out,
 based on the header line, the columns indices. This is quite different than
 what we have now.

 A use case I have now is a CSV file with a lot of columns (~90) but I only
 care about a small subset of the columns (~10). I'd like to be able to say
 withHeader(Set) where the Set may be a subset of the actual column names in
 the header line. This is different from withHeader(String[]) because the
 names in the Set must match the names in the header record.


 I'm not sure if we should try to build in all this different cases
 (guessing headers, using the first record as headers, only use a subset of
 the available headers) into one implementation.

 What you are talking about sounds more like a view or a projection of the
 actual content being parsed.
 Do we really need this for 1.0 or can it be postponed?

Agreed, this is something that needs more work before it could be included.

There will always be some extra item that would be nice to have; this
seems non-essential to me.



 So I think it boils down to ignoring my comment about using a Set
 internally and adding a feature where I can tell the parser that I want to
 use a set of column names and not worry about the order, because the parser
 will match up the column names when it reads the header line.

 Gary


 
 
  Emmanuel Bourg
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
  For additional commands, e-mail: dev-h...@commons.apache.org
 
 


 --
 E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
 Java Persistence with Hibernate, Second Edition
 http://www.manning.com/bauer3/
 JUnit in Action, Second Edition http://www.manning.com/tahchiev/
 Spring Batch in Action http://www.manning.com/templier/
 Blog: http://garygregory.wordpress.com
 Home: http://garygregory.com/
 Tweet! http://twitter.com/GaryGregory




 --
 http://people.apache.org/~britter/
 http://www.systemoutprintln.de/
 http://twitter.com/BenediktRitter
 http://github.com/britter

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [CSV] Headers and the first record

2013-07-31 Thread Gary Gregory

On Jul 31, 2013, at 3:38, Benedikt Ritter brit...@apache.org wrote:

 2013/7/31 Gary Gregory garydgreg...@gmail.com

 On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg ebo...@apache.org wrote:

 Le 30/07/2013 23:26, Gary Gregory a écrit :
 And another thing: internally, the header should be a SetString, not
 a
 String[]. I plan on fixing that later too.

 Why should it be a set? Is there an impact on the performance?

 Well, I did not finish my though on that one, sorry about that, please
 allow me to walk through my use cases. The issue is about the feature, not
 performance.

 At first glance, using a set avoids an inherent problem with any non-set
 data structure: defining duplicates. What does the following mean?

 withHeader(A, B, C, A);

 It's is a recipe for garbage results: record.get(A) returns what?

 Today, I added some CSVFormat validation code that checks for duplicate
 column names. If you build a format with withHeader(A, B, C, A);
 you will get an ISE when validate() is called.

 If we had withHeader(Set) and document it as the 'main' way to specify
 column names, then we can say that withHeader(String...) is just a
 syntactical convenience and turn the String[] into a Set. But that will not
 work.

 The problem with a Java Set is that it is not ordered and the current
 implementation relies on order of the String[]. But why? What the current
 implementation says is: ignore what the header line of the file is and use
 the given column names at the given positions. A perfectly good user story.
 So for withHeader(A, B, C), A is column 0, B is column 1, and so
 on. Ok, that's one usage.

 Taking a step back, I want to talk about why should the column name order
 matter when you are calling withHeader(). I would like to be able to tell
 the parser that I want to use a Set of column names and have it figure out,
 based on the header line, the columns indices. This is quite different than
 what we have now.

 A use case I have now is a CSV file with a lot of columns (~90) but I only
 care about a small subset of the columns (~10). I'd like to be able to say
 withHeader(Set) where the Set may be a subset of the actual column names in
 the header line. This is different from withHeader(String[]) because the
 names in the Set must match the names in the header record.

 I'm not sure if we should try to build in all this different cases
 (guessing headers, using the first record as headers, only use a subset of
 the available headers) into one implementation.

 What you are talking about sounds more like a view or a projection of the
 actual content being parsed.
 Do we really need this for 1.0 or can it be postponed?

This is a real scenario and a real need, not some imaginary complication ;)

Even if it is not implemented for 1.0, we should talk about how it
should be done such that it fits in and does not cause API problems
later. And if I can get it done by then, then that much the better.

Gary




 So I think it boils down to ignoring my comment about using a Set
 internally and adding a feature where I can tell the parser that I want to
 use a set of column names and not worry about the order, because the parser
 will match up the column names when it reads the header line.

 Gary




 Emmanuel Bourg


 -
 To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
 For additional commands, e-mail: dev-h...@commons.apache.org


 --
 E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
 Java Persistence with Hibernate, Second Edition
 http://www.manning.com/bauer3/
 JUnit in Action, Second Edition http://www.manning.com/tahchiev/
 Spring Batch in Action http://www.manning.com/templier/
 Blog: http://garygregory.wordpress.com
 Home: http://garygregory.com/
 Tweet! http://twitter.com/GaryGregory



 --
 http://people.apache.org/~britter/
 http://www.systemoutprintln.de/
 http://twitter.com/BenediktRitter
 http://github.com/britter

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [CSV] Headers and the first record

2013-07-31 Thread Gary Gregory

On Tue, Jul 30, 2013 at 5:47 PM, Emmanuel Bourg ebo...@apache.org wrote:

 Le 30/07/2013 23:24, Gary Gregory a écrit :

  Yeah, that's too clever IMO. I expected the same behavior WRT record
  reading with the only difference being if I let the parser guess or not.

 Too clever? I didn't feel like I designed a rocket with this feature
 though :) That's an important feature to me and I'd like to preserve it.

 If the header is defined in the file I don't want to skip the first
 record manually, the parser should take care of it.


But that is exactly what _was_ happening! ;)

If I called withHeader(A, B, C) the header was not skipped.
If I called withHeader(new String[]{}) the header was skipped.
If I called withHeader() the header was skipped (same as line above).

In both cases, I am telling the parser that there is a header, but it is
not skipped in both cases. That's the inconsistency I fixed.

What I am asking is: should we have a saveHeader setting such that IF you
ask for headers, then we save that record in the parser, it is currently
lost, or, actually transformed into the header map.

Gary


 That also means the
 user code can remain the same, whether the header is defined in the code
 or in the file.


  The current code now always reads the header line if you set any non-null
  header. If you call withHeader() with no args it is a non-null call with
 an
  empty String[].

 I guess a null header or an empty header is just the same and means the
 first record must be used as the header.

 Emmanuel Bourg


 -
 To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
 For additional commands, e-mail: dev-h...@commons.apache.org




-- 
E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/
JUnit in Action, Second Edition http://www.manning.com/tahchiev/
Spring Batch in Action http://www.manning.com/templier/
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

2013-07-31 Thread Emmanuel Bourg

Le 31/07/2013 15:08, Gary Gregory a écrit :

 But that is exactly what _was_ happening! ;)
 
 If I called withHeader(A, B, C) the header was not skipped.

Sounds good. The header is defined in the code, we don't expect to see
the header in the file so nothing is skipped.

 If I called withHeader(new String[]{}) the header was skipped.

Correct. The header is not defined in the code, the parser uses the
first record as header and doesn't return it when iterating.

 If I called withHeader() the header was skipped (same as line above).

Sounds good too.


What was the issue again ? ;)


 What I am asking is: should we have a saveHeader setting such that IF you
 ask for headers, then we save that record in the parser, it is currently
 lost, or, actually transformed into the header map.

Keeping the header around might be useful, I wouldn't create a format
parameter for this though. It could be made available at the record
level, much like ResultSet.getMetaData().

Emmanuel Bourg


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [CSV] Headers and the first record

2013-07-31 Thread Gary Gregory

On Wed, Jul 31, 2013 at 8:58 AM, Gary Gregory garydgreg...@gmail.comwrote:

 On Jul 31, 2013, at 3:38, Benedikt Ritter brit...@apache.org wrote:

  2013/7/31 Gary Gregory garydgreg...@gmail.com
 
  On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg ebo...@apache.org
 wrote:
 
  Le 30/07/2013 23:26, Gary Gregory a écrit :
  And another thing: internally, the header should be a SetString, not
  a
  String[]. I plan on fixing that later too.
 
  Why should it be a set? Is there an impact on the performance?
 
  Well, I did not finish my though on that one, sorry about that, please
  allow me to walk through my use cases. The issue is about the feature,
 not
  performance.
 
  At first glance, using a set avoids an inherent problem with any non-set
  data structure: defining duplicates. What does the following mean?
 
  withHeader(A, B, C, A);
 
  It's is a recipe for garbage results: record.get(A) returns what?
 
  Today, I added some CSVFormat validation code that checks for duplicate
  column names. If you build a format with withHeader(A, B, C, A);
  you will get an ISE when validate() is called.
 
  If we had withHeader(Set) and document it as the 'main' way to specify
  column names, then we can say that withHeader(String...) is just a
  syntactical convenience and turn the String[] into a Set. But that will
 not
  work.
 
  The problem with a Java Set is that it is not ordered and the current
  implementation relies on order of the String[]. But why? What the
 current
  implementation says is: ignore what the header line of the file is and
 use
  the given column names at the given positions. A perfectly good user
 story.
  So for withHeader(A, B, C), A is column 0, B is column 1, and
 so
  on. Ok, that's one usage.
 
  Taking a step back, I want to talk about why should the column name
 order
  matter when you are calling withHeader(). I would like to be able to
 tell
  the parser that I want to use a Set of column names and have it figure
 out,
  based on the header line, the columns indices. This is quite different
 than
  what we have now.
 
  A use case I have now is a CSV file with a lot of columns (~90) but I
 only
  care about a small subset of the columns (~10). I'd like to be able to
 say
  withHeader(Set) where the Set may be a subset of the actual column
 names in
  the header line. This is different from withHeader(String[]) because the
  names in the Set must match the names in the header record.
 
  I'm not sure if we should try to build in all this different cases
  (guessing headers, using the first record as headers, only use a subset
 of
  the available headers) into one implementation.
 
  What you are talking about sounds more like a view or a projection of the
  actual content being parsed.
  Do we really need this for 1.0 or can it be postponed?

 This is a real scenario and a real need, not some imaginary complication ;)


But I could work with current framework and use withHeaders(new String[]{})
and let the parser find the headers. Then I can just do record.get(A)
with the columns I care about. It just feels a little more mysterious.

I think the only wrinkle left for me is that I want validation that the
columns I care about are there. Right now get(String) throws
IllegalArgumentException if you give it an unknown column, which will fail
fast enough on the first record.

So I'll go down that road until the next speed bump...

Gary



 Even if it is not implemented for 1.0, we should talk about how it
 should be done such that it fits in and does not cause API problems
 later. And if I can get it done by then, then that much the better.

 Gary

 
 
 
  So I think it boils down to ignoring my comment about using a Set
  internally and adding a feature where I can tell the parser that I want
 to
  use a set of column names and not worry about the order, because the
 parser
  will match up the column names when it reads the header line.
 
  Gary
 
 
 
 
  Emmanuel Bourg
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
  For additional commands, e-mail: dev-h...@commons.apache.org
 
 
  --
  E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
  Java Persistence with Hibernate, Second Edition
  http://www.manning.com/bauer3/
  JUnit in Action, Second Edition http://www.manning.com/tahchiev/
  Spring Batch in Action http://www.manning.com/templier/
  Blog: http://garygregory.wordpress.com
  Home: http://garygregory.com/
  Tweet! http://twitter.com/GaryGregory
 
 
 
  --
  http://people.apache.org/~britter/
  http://www.systemoutprintln.de/
  http://twitter.com/BenediktRitter
  http://github.com/britter




-- 
E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/
JUnit in Action, Second Edition http://www.manning.com/tahchiev/
Spring Batch in Action http://www.manning.com/templier/
Blog: http://garygregory.wordpress.com

[CSV] Accessing a subset of the available headers (Was: Re: [CSV] Headers and the first record)

2013-07-31 Thread Benedikt Ritter

snip

 A use case I have now is a CSV file with a lot of columns (~90) but I
only
 care about a small subset of the columns (~10). I'd like to be able to
say
 withHeader(Set) where the Set may be a subset of the actual column names
in
 the header line. This is different from withHeader(String[]) because the
 names in the Set must match the names in the header record.

 
  What you are talking about sounds more like a view or a projection of the
  actual content being parsed.
  Do we really need this for 1.0 or can it be postponed?

 This is a real scenario and a real need, not some imaginary complication ;)

 Even if it is not implemented for 1.0, we should talk about how it
 should be done such that it fits in and does not cause API problems
 later. And if I can get it done by then, then that much the better.


Okay, then let's discuss this on a new thread :-)

As I've said, I think we should not push to much into
withHeaders(String...). Maybe this is some sort of view, where you can pass
a parser and the headers you are interested in and it will return an
IterableCSVRecord (or CSVParser) that just gives access to the specified
headers you are interessted in?

Would it be possible to give a code example of what you have to do with to
current API in your use case and what you want?

Benedikt



-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Re: [CSV] Headers and the first record

2013-07-31 Thread Gary Gregory

On Wed, Jul 31, 2013 at 9:34 AM, Emmanuel Bourg ebo...@apache.org wrote:

 Le 31/07/2013 15:08, Gary Gregory a écrit :

  But that is exactly what _was_ happening! ;)
 
  If I called withHeader(A, B, C) the header was not skipped.

 Sounds good. The header is defined in the code, we don't expect to see
 the header in the file so nothing is skipped.


NOT good! ;) This is where we disagree. The parser used to behave
differently depending on the contents of the String[].
- From an API design standpoint, it's smelly to me.
- The feature is hard to understand. If we want that, we need two APIs for
two behaviors.

Using the withHeader API, I can tell the parser to:
- Ignore the fact that there is a header record, I am overriding it with my
own names
- There is no header record, so I am telling you what the header names are.

These two features clash because in one case the file has a header line and
in the other the file does not. This is why we need settings with different
names.

That or a setting that says 'skip the first record, it's the header, I do
not want to see it as a data record'

I see three scenarios:

1) I set the headers (the file does not have one), do not skip the first
record
2) I override the existing header record, skip the first record
3) The parser guesses the headers based on reading the first record, which
skips the first record as a data record

This can be accommodated with a skipHeaderRecord boolean setting.

I do not care what the default behavior is as long as I can say this file
has headers, guess them please, and skip record 0 and this file has a
header record, but I'm telling you to call them A, B, and C, so skip record
0

1) withHeader(A, B, C).skipHeaderRecord(false);
2) withHeader(A, B, C).skipHeaderRecord(true);
3) withHeader()

Is there a better name for skipHeaderRecord? Maybe:

1b) withHeader(A, B, C).firstRecordIsHeader(false);
2b) withHeader(A, B, C).firstRecordIsHeader(true);

Here the difference is that the API does not describe behavior, instead it
describes the data, and behavior is implied.

There is also:

1c) withHeader(A, B, C)
2c) withHeaderOverride(A, B, C)

Thoughts?

Gary



  If I called withHeader(new String[]{}) the header was skipped.

 Correct. The header is not defined in the code, the parser uses the
 first record as header and doesn't return it when iterating.

  If I called withHeader() the header was skipped (same as line above).

 Sounds good too.


 What was the issue again ? ;)


  What I am asking is: should we have a saveHeader setting such that IF you
  ask for headers, then we save that record in the parser, it is currently
  lost, or, actually transformed into the header map.

 Keeping the header around might be useful, I wouldn't create a format
 parameter for this though. It could be made available at the record
 level, much like ResultSet.getMetaData().

 Emmanuel Bourg


 -
 To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
 For additional commands, e-mail: dev-h...@commons.apache.org




-- 
E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/
JUnit in Action, Second Edition http://www.manning.com/tahchiev/
Spring Batch in Action http://www.manning.com/templier/
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Accessing a subset of the available headers (Was: Re: [CSV] Headers and the first record)

2013-07-31 Thread Gary Gregory

On Wed, Jul 31, 2013 at 10:42 AM, Benedikt Ritter brit...@apache.orgwrote:

 snip

  A use case I have now is a CSV file with a lot of columns (~90) but I
 only
  care about a small subset of the columns (~10). I'd like to be able to
 say
  withHeader(Set) where the Set may be a subset of the actual column names
 in
  the header line. This is different from withHeader(String[]) because the
  names in the Set must match the names in the header record.

  
   What you are talking about sounds more like a view or a projection of
 the
   actual content being parsed.
   Do we really need this for 1.0 or can it be postponed?
 
  This is a real scenario and a real need, not some imaginary complication
 ;)
 
  Even if it is not implemented for 1.0, we should talk about how it
  should be done such that it fits in and does not cause API problems
  later. And if I can get it done by then, then that much the better.
 

 Okay, then let's discuss this on a new thread :-)

 As I've said, I think we should not push to much into
 withHeaders(String...). Maybe this is some sort of view, where you can pass
 a parser and the headers you are interested in and it will return an
 IterableCSVRecord (or CSVParser) that just gives access to the specified
 headers you are interessted in?

 Would it be possible to give a code example of what you have to do with to
 current API in your use case and what you want?


I am switching to withHeader() with no arg (same as a new String[]{}) and
let the parser guess the headers and then pray that the names match between
the app and the files. Which is just as unsafe as forcing the headers in
fixed order on the parser because the column order might have changed.
Ideally, the column order should not matter, which it does not when you do
a record.get(String), which is nice.

Calling withHeader() with no args is less brittle than calling it with 90
args. The benefit is that the column order in the file can change without
affecting the app, which is good. I could use a little more bullet-proofing
by making the column names optionally case-insensitive, but that's a
different feature.

Ideally, I want to define the column names in the app as a simple Java
enum, then use an enum as a record key. That does not work for column names
that have spaces in them as mine do, so it's back to classic static final
Strings as keys. I could create a fancier custom enum but it's not worth it
for now.

Gary


 Benedikt



 --
 http://people.apache.org/~britter/
 http://www.systemoutprintln.de/
 http://twitter.com/BenediktRitter
 http://github.com/britter




-- 
E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/
JUnit in Action, Second Edition http://www.manning.com/tahchiev/
Spring Batch in Action http://www.manning.com/templier/
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

2013-07-31 Thread Mark Fortner

I took a brief look at the API for CSV, and thought I would share a typical
use case from the biotech industry.  We deal with a lot of instruments that
produce a multiline header.  The header usually contains experiment
conditions.  You can think of this as metadata for the columnar data.  The
experiment conditions usually contain things like the name of the scientist
using the instrument, the time of day the experiment was run, and some
instrument configuration settings.  Usually when we parse CSV files, we
have to parse the header first, extract all relevant data, and then parse
the rows of data.

In addition to the experiment conditions header, there are also column
headers.  The column headers can be multi-lined as well.  For example, you
might have a column header whose first line contains chemical compound IDs
or names, and the second line of the column header contains the
concentrations for those compounds. The data values represent the percent
inhibition at those concentrations. Like this:

Erlotinib
1uM 10 uM 100 uM 1nM
0.01  0.001  0.0001 0.1
...

Since the position and types of header and body data vary, we typically use
 parse configuration files that describe what data can be found where.
 The parse configuration varies not only per instrument but also per
experimental protocol. So there are usually numerous configuration files in
your typical lab.  The configuration files can also be stored in a
database.  This is usually part of a file-watching web app.  It allows
scientists to add support for new experiments or instruments without having
to get a developer to write more code.

In the API I saw support for hard-coded configurations via the CSVFormat
object, but I didn't see any support for creating and using persistable
configurations.  You may want to consider that as you move forward.

Hope this helps,

Mark



On Wed, Jul 31, 2013 at 6:36 AM, Gary Gregory garydgreg...@gmail.comwrote:

 On Wed, Jul 31, 2013 at 8:58 AM, Gary Gregory garydgreg...@gmail.com
 wrote:

  On Jul 31, 2013, at 3:38, Benedikt Ritter brit...@apache.org wrote:
 
   2013/7/31 Gary Gregory garydgreg...@gmail.com
  
   On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg ebo...@apache.org
  wrote:
  
   Le 30/07/2013 23:26, Gary Gregory a écrit :
   And another thing: internally, the header should be a SetString,
 not
   a
   String[]. I plan on fixing that later too.
  
   Why should it be a set? Is there an impact on the performance?
  
   Well, I did not finish my though on that one, sorry about that, please
   allow me to walk through my use cases. The issue is about the feature,
  not
   performance.
  
   At first glance, using a set avoids an inherent problem with any
 non-set
   data structure: defining duplicates. What does the following mean?
  
   withHeader(A, B, C, A);
  
   It's is a recipe for garbage results: record.get(A) returns what?
  
   Today, I added some CSVFormat validation code that checks for
 duplicate
   column names. If you build a format with withHeader(A, B, C,
 A);
   you will get an ISE when validate() is called.
  
   If we had withHeader(Set) and document it as the 'main' way to specify
   column names, then we can say that withHeader(String...) is just a
   syntactical convenience and turn the String[] into a Set. But that
 will
  not
   work.
  
   The problem with a Java Set is that it is not ordered and the current
   implementation relies on order of the String[]. But why? What the
  current
   implementation says is: ignore what the header line of the file is and
  use
   the given column names at the given positions. A perfectly good user
  story.
   So for withHeader(A, B, C), A is column 0, B is column 1,
 and
  so
   on. Ok, that's one usage.
  
   Taking a step back, I want to talk about why should the column name
  order
   matter when you are calling withHeader(). I would like to be able to
  tell
   the parser that I want to use a Set of column names and have it figure
  out,
   based on the header line, the columns indices. This is quite different
  than
   what we have now.
  
   A use case I have now is a CSV file with a lot of columns (~90) but I
  only
   care about a small subset of the columns (~10). I'd like to be able to
  say
   withHeader(Set) where the Set may be a subset of the actual column
  names in
   the header line. This is different from withHeader(String[]) because
 the
   names in the Set must match the names in the header record.
  
   I'm not sure if we should try to build in all this different cases
   (guessing headers, using the first record as headers, only use a subset
  of
   the available headers) into one implementation.
  
   What you are talking about sounds more like a view or a projection of
 the
   actual content being parsed.
   Do we really need this for 1.0 or can it be postponed?
 
  This is a real scenario and a real need, not some imaginary complication
 ;)
 

 But I could work with current framework and use withHeaders(new

Re: [CSV] Headers and the first record

2013-07-31 Thread Gary Gregory

On Wed, Jul 31, 2013 at 11:14 AM, Mark Fortner phidia...@gmail.com wrote:

 I took a brief look at the API for CSV, and thought I would share a typical
 use case from the biotech industry.  We deal with a lot of instruments that
 produce a multiline header.  The header usually contains experiment
 conditions.  You can think of this as metadata for the columnar data.  The
 experiment conditions usually contain things like the name of the scientist
 using the instrument, the time of day the experiment was run, and some
 instrument configuration settings.  Usually when we parse CSV files, we
 have to parse the header first, extract all relevant data, and then parse
 the rows of data.

 In addition to the experiment conditions header, there are also column
 headers.  The column headers can be multi-lined as well.  For example, you
 might have a column header whose first line contains chemical compound IDs
 or names, and the second line of the column header contains the
 concentrations for those compounds. The data values represent the percent
 inhibition at those concentrations. Like this:

 Erlotinib
 1uM 10 uM 100 uM 1nM
 0.01  0.001  0.0001 0.1
 ...

 Since the position and types of header and body data vary, we typically use
  parse configuration files that describe what data can be found where.
  The parse configuration varies not only per instrument but also per
 experimental protocol. So there are usually numerous configuration files in
 your typical lab.  The configuration files can also be stored in a
 database.  This is usually part of a file-watching web app.  It allows
 scientists to add support for new experiments or instruments without having
 to get a developer to write more code.

 In the API I saw support for hard-coded configurations via the CSVFormat
 object, but I didn't see any support for creating and using persistable
 configurations.  You may want to consider that as you move forward.


Thank you for taking the time to offer your point of view here.

CSVFormat implements Serializable, so you can use plain old Java
serialization, it's not human readable, but it's something.

If we moved to Java 6, we could annotate CSVFormat with JAXB so you can
have XML IO. Personally, I do not think we should do our own XML IO, so
JAXB is the best path IMO since it is built-in Java 6.

What do you currently use to parse your CSV files?

Would Commons-CSV work for you as well? If not, how so?

Would you be willing to experiment with the current code?

Thank you,
Gary


 Hope this helps,

 Mark



 On Wed, Jul 31, 2013 at 6:36 AM, Gary Gregory garydgreg...@gmail.com
 wrote:

  On Wed, Jul 31, 2013 at 8:58 AM, Gary Gregory garydgreg...@gmail.com
  wrote:
 
   On Jul 31, 2013, at 3:38, Benedikt Ritter brit...@apache.org wrote:
  
2013/7/31 Gary Gregory garydgreg...@gmail.com
   
On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg ebo...@apache.org
   wrote:
   
Le 30/07/2013 23:26, Gary Gregory a écrit :
And another thing: internally, the header should be a SetString,
  not
a
String[]. I plan on fixing that later too.
   
Why should it be a set? Is there an impact on the performance?
   
Well, I did not finish my though on that one, sorry about that,
 please
allow me to walk through my use cases. The issue is about the
 feature,
   not
performance.
   
At first glance, using a set avoids an inherent problem with any
  non-set
data structure: defining duplicates. What does the following mean?
   
withHeader(A, B, C, A);
   
It's is a recipe for garbage results: record.get(A) returns what?
   
Today, I added some CSVFormat validation code that checks for
  duplicate
column names. If you build a format with withHeader(A, B, C,
  A);
you will get an ISE when validate() is called.
   
If we had withHeader(Set) and document it as the 'main' way to
 specify
column names, then we can say that withHeader(String...) is just a
syntactical convenience and turn the String[] into a Set. But that
  will
   not
work.
   
The problem with a Java Set is that it is not ordered and the
 current
implementation relies on order of the String[]. But why? What the
   current
implementation says is: ignore what the header line of the file is
 and
   use
the given column names at the given positions. A perfectly good user
   story.
So for withHeader(A, B, C), A is column 0, B is column 1,
  and
   so
on. Ok, that's one usage.
   
Taking a step back, I want to talk about why should the column name
   order
matter when you are calling withHeader(). I would like to be able to
   tell
the parser that I want to use a Set of column names and have it
 figure
   out,
based on the header line, the columns indices. This is quite
 different
   than
what we have now.
   
A use case I have now is a CSV file with a lot of columns (~90) but
 I
   only
care about a small subset of the columns (~10). I'd like to be

Re: [CSV] Headers and the first record

2013-07-31 Thread Gary Gregory

On Wed, Jul 31, 2013 at 10:48 AM, Gary Gregory garydgreg...@gmail.comwrote:

 On Wed, Jul 31, 2013 at 9:34 AM, Emmanuel Bourg ebo...@apache.org wrote:

 Le 31/07/2013 15:08, Gary Gregory a écrit :

  But that is exactly what _was_ happening! ;)
 
  If I called withHeader(A, B, C) the header was not skipped.

 Sounds good. The header is defined in the code, we don't expect to see
 the header in the file so nothing is skipped.


 NOT good! ;) This is where we disagree. The parser used to behave
 differently depending on the contents of the String[].
 - From an API design standpoint, it's smelly to me.
 - The feature is hard to understand. If we want that, we need two APIs for
 two behaviors.

 Using the withHeader API, I can tell the parser to:
 - Ignore the fact that there is a header record, I am overriding it with
 my own names
 - There is no header record, so I am telling you what the header names are.

 These two features clash because in one case the file has a header line
 and in the other the file does not. This is why we need settings with
 different names.

 That or a setting that says 'skip the first record, it's the header, I do
 not want to see it as a data record'

 I see three scenarios:

 1) I set the headers (the file does not have one), do not skip the first
 record
 2) I override the existing header record, skip the first record
 3) The parser guesses the headers based on reading the first record, which
 skips the first record as a data record

 This can be accommodated with a skipHeaderRecord boolean setting.

 I do not care what the default behavior is as long as I can say this file
 has headers, guess them please, and skip record 0 and this file has a
 header record, but I'm telling you to call them A, B, and C, so skip record
 0

 1) withHeader(A, B, C).skipHeaderRecord(false);
 2) withHeader(A, B, C).skipHeaderRecord(true);
 3) withHeader()

 Is there a better name for skipHeaderRecord? Maybe:

 1b) withHeader(A, B, C).firstRecordIsHeader(false);
 2b) withHeader(A, B, C).firstRecordIsHeader(true);

 Here the difference is that the API does not describe behavior, instead it
 describes the data, and behavior is implied.

 There is also:

 1c) withHeader(A, B, C)
 2c) withHeaderOverride(A, B, C)

 Thoughts?


I reverted back to NOT skipping a record when withHeader is called with a
non-empty array; and added a skipHeaderRecord setting to CSVFormat to use
when headers are initialized.

Gary



 Gary



  If I called withHeader(new String[]{}) the header was skipped.

 Correct. The header is not defined in the code, the parser uses the
 first record as header and doesn't return it when iterating.

  If I called withHeader() the header was skipped (same as line above).

 Sounds good too.


 What was the issue again ? ;)


  What I am asking is: should we have a saveHeader setting such that IF
 you
  ask for headers, then we save that record in the parser, it is currently
  lost, or, actually transformed into the header map.

 Keeping the header around might be useful, I wouldn't create a format
 parameter for this though. It could be made available at the record
 level, much like ResultSet.getMetaData().

 Emmanuel Bourg


 -
 To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
 For additional commands, e-mail: dev-h...@commons.apache.org




 --
 E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
 Java Persistence with Hibernate, Second 
 Editionhttp://www.manning.com/bauer3/
 JUnit in Action, Second Edition http://www.manning.com/tahchiev/
 Spring Batch in Action http://www.manning.com/templier/
 Blog: http://garygregory.wordpress.com
 Home: http://garygregory.com/
 Tweet! http://twitter.com/GaryGregory




-- 
E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/
JUnit in Action, Second Edition http://www.manning.com/tahchiev/
Spring Batch in Action http://www.manning.com/templier/
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Accessing a subset of the available headers (Was: Re: [CSV] Headers and the first record)

2013-07-31 Thread Benedikt Ritter

2013/7/31 Gary Gregory garydgreg...@gmail.com

 On Wed, Jul 31, 2013 at 10:42 AM, Benedikt Ritter brit...@apache.org
 wrote:

  snip
 
   A use case I have now is a CSV file with a lot of columns (~90) but I
  only
   care about a small subset of the columns (~10). I'd like to be able to
  say
   withHeader(Set) where the Set may be a subset of the actual column
 names
  in
   the header line. This is different from withHeader(String[]) because
 the
   names in the Set must match the names in the header record.
 
   
What you are talking about sounds more like a view or a projection of
  the
actual content being parsed.
Do we really need this for 1.0 or can it be postponed?
  
   This is a real scenario and a real need, not some imaginary
 complication
  ;)
  
   Even if it is not implemented for 1.0, we should talk about how it
   should be done such that it fits in and does not cause API problems
   later. And if I can get it done by then, then that much the better.
  
 
  Okay, then let's discuss this on a new thread :-)
 
  As I've said, I think we should not push to much into
  withHeaders(String...). Maybe this is some sort of view, where you can
 pass
  a parser and the headers you are interested in and it will return an
  IterableCSVRecord (or CSVParser) that just gives access to the
 specified
  headers you are interessted in?
 
  Would it be possible to give a code example of what you have to do with
 to
  current API in your use case and what you want?
 

 I am switching to withHeader() with no arg (same as a new String[]{}) and
 let the parser guess the headers and then pray that the names match between
 the app and the files. Which is just as unsafe as forcing the headers in
 fixed order on the parser because the column order might have changed.
 Ideally, the column order should not matter, which it does not when you do
 a record.get(String), which is nice.

 Calling withHeader() with no args is less brittle than calling it with 90
 args. The benefit is that the column order in the file can change without
 affecting the app, which is good. I could use a little more bullet-proofing
 by making the column names optionally case-insensitive, but that's a
 different feature.

 Ideally, I want to define the column names in the app as a simple Java
 enum, then use an enum as a record key. That does not work for column names
 that have spaces in them as mine do, so it's back to classic static final
 Strings as keys. I could create a fancier custom enum but it's not worth it
 for now.


Hey Gary,

I still don't understand what you are suggesting. At first I though this
was about accessing a subset of the actual columns (you said your file has
90 columns but you are only interested in ~10).

Your last message sounds more like you're looking for a better way to make
sure the headers parsed from the file match what you are expecting. I guess
this is why getHeaderMap is now public (?!)

What am I missing?

Benedikt



 Gary


  Benedikt
 
 
 
  --
  http://people.apache.org/~britter/
  http://www.systemoutprintln.de/
  http://twitter.com/BenediktRitter
  http://github.com/britter
 



 --
 E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
 Java Persistence with Hibernate, Second Edition
 http://www.manning.com/bauer3/
 JUnit in Action, Second Edition http://www.manning.com/tahchiev/
 Spring Batch in Action http://www.manning.com/templier/
 Blog: http://garygregory.wordpress.com
 Home: http://garygregory.com/
 Tweet! http://twitter.com/GaryGregory




-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Re: [CSV] Accessing a subset of the available headers (Was: Re: [CSV] Headers and the first record)

2013-07-31 Thread Gary Gregory

On Wed, Jul 31, 2013 at 2:38 PM, Benedikt Ritter brit...@apache.org wrote:

 2013/7/31 Gary Gregory garydgreg...@gmail.com

  On Wed, Jul 31, 2013 at 10:42 AM, Benedikt Ritter brit...@apache.org
  wrote:
 
   snip
  
A use case I have now is a CSV file with a lot of columns (~90) but
 I
   only
care about a small subset of the columns (~10). I'd like to be able
 to
   say
withHeader(Set) where the Set may be a subset of the actual column
  names
   in
the header line. This is different from withHeader(String[]) because
  the
names in the Set must match the names in the header record.
  

 What you are talking about sounds more like a view or a projection
 of
   the
 actual content being parsed.
 Do we really need this for 1.0 or can it be postponed?
   
This is a real scenario and a real need, not some imaginary
  complication
   ;)
   
Even if it is not implemented for 1.0, we should talk about how it
should be done such that it fits in and does not cause API problems
later. And if I can get it done by then, then that much the better.
   
  
   Okay, then let's discuss this on a new thread :-)
  
   As I've said, I think we should not push to much into
   withHeaders(String...). Maybe this is some sort of view, where you can
  pass
   a parser and the headers you are interested in and it will return an
   IterableCSVRecord (or CSVParser) that just gives access to the
  specified
   headers you are interessted in?
  
   Would it be possible to give a code example of what you have to do with
  to
   current API in your use case and what you want?
  
 
  I am switching to withHeader() with no arg (same as a new String[]{}) and
  let the parser guess the headers and then pray that the names match
 between
  the app and the files. Which is just as unsafe as forcing the headers in
  fixed order on the parser because the column order might have changed.
  Ideally, the column order should not matter, which it does not when you
 do
  a record.get(String), which is nice.
 
  Calling withHeader() with no args is less brittle than calling it with 90
  args. The benefit is that the column order in the file can change without
  affecting the app, which is good. I could use a little more
 bullet-proofing
  by making the column names optionally case-insensitive, but that's a
  different feature.
 
  Ideally, I want to define the column names in the app as a simple Java
  enum, then use an enum as a record key. That does not work for column
 names
  that have spaces in them as mine do, so it's back to classic static final
  Strings as keys. I could create a fancier custom enum but it's not worth
 it
  for now.
 

 Hey Gary,

 I still don't understand what you are suggesting. At first I though this
 was about accessing a subset of the actual columns (you said your file has
 90 columns but you are only interested in ~10).

 Your last message sounds more like you're looking for a better way to make
 sure the headers parsed from the file match what you are expecting. I guess
 this is why getHeaderMap is now public (?!)


 What am I missing?


Sorry, it seems I keep on mixing up the topics it seems. More my many
columned file, I'm going with withHeaders() [no args] and get(String).
That's good enough but I still need to have the proper header skipping,
which is now in.

Yes, I'm looking for what amounts to schema validation, but since
get(String) will fail on the first record, that's fail-fast enough for now
:)

getHeaderMap() has been public for a long time, so that's not an issue here.

getHeader() OTOH is now public because I want to be able to build on one
format to get a new one.

Gary




 Benedikt


 
  Gary
 
 
   Benedikt
  
  
  
   --
   http://people.apache.org/~britter/
   http://www.systemoutprintln.de/
   http://twitter.com/BenediktRitter
   http://github.com/britter
  
 
 
 
  --
  E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
  Java Persistence with Hibernate, Second Edition
  http://www.manning.com/bauer3/
  JUnit in Action, Second Edition http://www.manning.com/tahchiev/
  Spring Batch in Action http://www.manning.com/templier/
  Blog: http://garygregory.wordpress.com
  Home: http://garygregory.com/
  Tweet! http://twitter.com/GaryGregory
 



 --
 http://people.apache.org/~britter/
 http://www.systemoutprintln.de/
 http://twitter.com/BenediktRitter
 http://github.com/britter




-- 
E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/
JUnit in Action, Second Edition http://www.manning.com/tahchiev/
Spring Batch in Action http://www.manning.com/templier/
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

2013-07-31 Thread Mark Fortner

Hi Gary,
One other complication I forgot to mention.  Compounds are usually run
multiple times.  So the same compound will appear with the same set of
concentrations.  In practice you would end up with column headers that have
the same text in them, so this issue with using a Set vs String[] for the
column names would complicate things.


 CSVFormat implements Serializable, so you can use plain old Java
 serialization, it's not human readable, but it's something.


A human readable configuration would probably be a high priority.



 If we moved to Java 6, we could annotate CSVFormat with JAXB so you can
 have XML IO. Personally, I do not think we should do our own XML IO, so
 JAXB is the best path IMO since it is built-in Java 6.


It would be best if there were a CSVFormat serializer so that the CSVFormat
could be injected.  Using JAXB would be fine as a default implementation,
but I imagine that the configuration format would change.  Or that a user
might decide to store individual configuration items in a database.



 What do you currently use to parse your CSV files?


Most biotech companies have their own home grown tools for parsing
instrument files.  There isn't a standard library.



 Would Commons-CSV work for you as well? If not, how so?


As I understand it, the code doesn't support experiment condition-type
parameters, like this:

Date: 12/10/13
Protocol: Selectivity Profile 1Instrument Name: Gandalf
Scientist: John Smith


 Would you be willing to experiment with the current code?


Sure. If the previous issues were addressed.

I'm curious if other industries have similar issues?  I assume that anyone
that deals with instrument data might have similar needs.

Mark

Re: [CSV] Headers and the first record

2013-07-31 Thread Gary Gregory

On Wed, Jul 31, 2013 at 3:44 PM, Mark Fortner phidia...@gmail.com wrote:

 Hi Gary,
 One other complication I forgot to mention.  Compounds are usually run
 multiple times.  So the same compound will appear with the same set of
 concentrations.  In practice you would end up with column headers that have
 the same text in them, so this issue with using a Set vs String[] for the
 column names would complicate things.


  CSVFormat implements Serializable, so you can use plain old Java
  serialization, it's not human readable, but it's something.
 

 A human readable configuration would probably be a high priority.


 
  If we moved to Java 6, we could annotate CSVFormat with JAXB so you can
  have XML IO. Personally, I do not think we should do our own XML IO, so
  JAXB is the best path IMO since it is built-in Java 6.
 

 It would be best if there were a CSVFormat serializer so that the CSVFormat
 could be injected.  Using JAXB would be fine as a default implementation,
 but I imagine that the configuration format would change.  Or that a user
 might decide to store individual configuration items in a database.


 
  What do you currently use to parse your CSV files?
 

 Most biotech companies have their own home grown tools for parsing
 instrument files.  There isn't a standard library.


 
  Would Commons-CSV work for you as well? If not, how so?
 

 As I understand it, the code doesn't support experiment condition-type
 parameters, like this:

 Date: 12/10/13
 Protocol: Selectivity Profile 1Instrument Name: Gandalf
 Scientist: John Smith


This does not look like a classic CSV file.

It sounds like your files contain different sections in different formats.

In its current state, commons-csv might not be right for you. What does the
rest of the file look like?

Gary




  Would you be willing to experiment with the current code?
 
 
 Sure. If the previous issues were addressed.

 I'm curious if other industries have similar issues?  I assume that anyone
 that deals with instrument data might have similar needs.

 Mark




-- 
E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/
JUnit in Action, Second Edition http://www.manning.com/tahchiev/
Spring Batch in Action http://www.manning.com/templier/
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

2013-07-31 Thread Mark Fortner

Hi Gary,


 This does not look like a classic CSV file.


I guess it depends on what your definition of classic is. :-)  This is
pretty typical for most drug discovery companies.


 It sounds like your files contain different sections in different formats.


True.



 In its current state, commons-csv might not be right for you. What does the
 rest of the file look like?


The data section looks similar to this.

  Erlotinib - Run 1  Erlotinib - Run 2
Target   1uM 10 uM 100 uM 1nM 1uM 10 uM 100 uM 1nM
BRCA1   0.01  0.001  0.0001 0.1   0.01  0.001  0.0001 0.1
BRCA2   0.20.002  0.0002 0.2   0.20.002  0.0002 0.2


Regards,

Mark

Re: [CSV] Headers and the first record

2013-07-31 Thread Gary Gregory

On Wed, Jul 31, 2013 at 4:38 PM, Mark Fortner phidia...@gmail.com wrote:

 Hi Gary,


  This does not look like a classic CSV file.


 I guess it depends on what your definition of classic is. :-)  This is
 pretty typical for most drug discovery companies.


  It sounds like your files contain different sections in different
 formats.
 

 True.


 
  In its current state, commons-csv might not be right for you. What does
 the
  rest of the file look like?


 The data section looks similar to this.

   Erlotinib - Run 1  Erlotinib - Run 2
 Target   1uM 10 uM 100 uM 1nM 1uM 10 uM 100 uM 1nM
 BRCA1   0.01  0.001  0.0001 0.1   0.01  0.001  0.0001 0.1
 BRCA2   0.20.002  0.0002 0.2   0.20.002  0.0002 0.2



Hm... so it looks like you have a couple of rows that each have a different
format.

For some rows, the format has the header and it's value on the same line:

Date: 12/10/13
Protocol: Selectivity Profile 1Instrument Name: Gandalf
Scientist: John Smith

Which is different from the 'usual' column we see. You format is more like
a spreadsheet than a CSV file.

Nonetheless, we would need to extend our current feature set to accommodate
this format.

I could see the client code looking like this:

// row one is a key: value pair
format.addKeyValueRow(1, :);

// row two is 2 key: value pairs, separated by a tab
format.addKeyValueRow(2, :, \t); // 2 pairs

The args should also be a format object of some kind, like we have a
CSVFormat object now.

This seems out of scope for 1.0 if we are itching to get 1.0 out the door.

Gary

Regards,

 Mark




-- 
E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/
JUnit in Action, Second Edition http://www.manning.com/tahchiev/
Spring Batch in Action http://www.manning.com/templier/
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

2013-07-30 Thread Gary Gregory

Hi All:

I see now, the behavior is different depending on what you pass to
withHeader()! Confusing indeed.

If you call withHeader with Strings, the first line is not read and it is
returned as a record.

If you call withHeader with no arguments, the first line _is_ read and it
is NOT returned as a record.

I think I'll change it so that withHeader causes the first line to be
skipped, always, and add an option skipHeaders with a default of true. So
if you really want to set the headers AND see what they are, you can do
that.

Gary


On Tue, Jul 30, 2013 at 3:44 PM, Gary Gregory garydgreg...@gmail.comwrote:

 Hi All:

 I have Excel files with headers. So I use withHeaders() of course to map
 the headers.

 When I call parser.iterator().next(), the first record is the header
 record, not data.

 I always have to skip this first line since it is not data.

 I wonder if:

 1) We should automatically skip the header line for next() and
 parser.getRecords(), or
 2) Add a skipHeader boolean setting to control the above behavior, where
 the default is...?

 (2) is the most flexible.

 Thoughts?

 Gary
 --
 E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
 Java Persistence with Hibernate, Second 
 Editionhttp://www.manning.com/bauer3/
 JUnit in Action, Second Edition http://www.manning.com/tahchiev/
 Spring Batch in Action http://www.manning.com/templier/
 Blog: http://garygregory.wordpress.com
 Home: http://garygregory.com/
 Tweet! http://twitter.com/GaryGregory




-- 
E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/
JUnit in Action, Second Edition http://www.manning.com/tahchiev/
Spring Batch in Action http://www.manning.com/templier/
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

2013-07-30 Thread Gary Gregory

Actually, if you use withHeader(), no args, you _cannot_ get back the first
record, so that makes skipHeader=false not possible without making the
parser track the first record separately.

In the interest of simplicity, I am going to make it simple: if you use
withHeader of any kind, then the first record is read.

Gary


On Tue, Jul 30, 2013 at 4:15 PM, Gary Gregory garydgreg...@gmail.comwrote:

 Hi All:

 I see now, the behavior is different depending on what you pass to
 withHeader()! Confusing indeed.

 If you call withHeader with Strings, the first line is not read and it is
 returned as a record.

 If you call withHeader with no arguments, the first line _is_ read and it
 is NOT returned as a record.

 I think I'll change it so that withHeader causes the first line to be
 skipped, always, and add an option skipHeaders with a default of true. So
 if you really want to set the headers AND see what they are, you can do
 that.

 Gary


 On Tue, Jul 30, 2013 at 3:44 PM, Gary Gregory garydgreg...@gmail.comwrote:

 Hi All:

 I have Excel files with headers. So I use withHeaders() of course to map
 the headers.

 When I call parser.iterator().next(), the first record is the header
 record, not data.

 I always have to skip this first line since it is not data.

 I wonder if:

 1) We should automatically skip the header line for next() and
 parser.getRecords(), or
 2) Add a skipHeader boolean setting to control the above behavior, where
 the default is...?

 (2) is the most flexible.

 Thoughts?

 Gary
 --
 E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
 Java Persistence with Hibernate, Second 
 Editionhttp://www.manning.com/bauer3/
 JUnit in Action, Second Edition http://www.manning.com/tahchiev/
 Spring Batch in Action http://www.manning.com/templier/
 Blog: http://garygregory.wordpress.com
 Home: http://garygregory.com/
 Tweet! http://twitter.com/GaryGregory




 --
 E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
 Java Persistence with Hibernate, Second 
 Editionhttp://www.manning.com/bauer3/
 JUnit in Action, Second Edition http://www.manning.com/tahchiev/
 Spring Batch in Action http://www.manning.com/templier/
 Blog: http://garygregory.wordpress.com
 Home: http://garygregory.com/
 Tweet! http://twitter.com/GaryGregory




-- 
E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/
JUnit in Action, Second Edition http://www.manning.com/tahchiev/
Spring Batch in Action http://www.manning.com/templier/
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

2013-07-30 Thread Emmanuel Bourg

I haven't checked the current code, but the intended behavior was:

- no args: the first record defines the header and is not returned when
iterating

- args: the header is defined independently of the data, all the records
are returned when iterating

Emmanuel Bourg


Le 30/07/2013 22:23, Gary Gregory a écrit :
 Actually, if you use withHeader(), no args, you _cannot_ get back the first
 record, so that makes skipHeader=false not possible without making the
 parser track the first record separately.
 
 In the interest of simplicity, I am going to make it simple: if you use
 withHeader of any kind, then the first record is read.


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [CSV] Headers and the first record

2013-07-30 Thread Gary Gregory

On Tue, Jul 30, 2013 at 5:15 PM, Emmanuel Bourg ebo...@apache.org wrote:

 I haven't checked the current code, but the intended behavior was:

 - no args: the first record defines the header and is not returned when
 iterating

 - args: the header is defined independently of the data, all the records
 are returned when iterating


Yeah, that's too clever IMO. I expected the same behavior WRT record
reading with the only difference being if I let the parser guess or not.

The current code now always reads the header line if you set any non-null
header. If you call withHeader() with no args it is a non-null call with an
empty String[].

The idea being that if I use headers and I ask the parser to guess or give
it the headers, I do not need to have the header line as a record.

I plan on adding a setting that allows the header record to be saved for
callers who care.

Gary



 Emmanuel Bourg


 Le 30/07/2013 22:23, Gary Gregory a écrit :
  Actually, if you use withHeader(), no args, you _cannot_ get back the
 first
  record, so that makes skipHeader=false not possible without making the
  parser track the first record separately.
 
  In the interest of simplicity, I am going to make it simple: if you use
  withHeader of any kind, then the first record is read.


 -
 To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
 For additional commands, e-mail: dev-h...@commons.apache.org




-- 
E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/
JUnit in Action, Second Edition http://www.manning.com/tahchiev/
Spring Batch in Action http://www.manning.com/templier/
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

2013-07-30 Thread Gary Gregory

And another thing: internally, the header should be a SetString, not a
String[]. I plan on fixing that later too.

Gary


On Tue, Jul 30, 2013 at 5:24 PM, Gary Gregory garydgreg...@gmail.comwrote:

 On Tue, Jul 30, 2013 at 5:15 PM, Emmanuel Bourg ebo...@apache.org wrote:

 I haven't checked the current code, but the intended behavior was:

 - no args: the first record defines the header and is not returned when
 iterating

 - args: the header is defined independently of the data, all the records
 are returned when iterating


 Yeah, that's too clever IMO. I expected the same behavior WRT record
 reading with the only difference being if I let the parser guess or not.

 The current code now always reads the header line if you set any non-null
 header. If you call withHeader() with no args it is a non-null call with an
 empty String[].

 The idea being that if I use headers and I ask the parser to guess or give
 it the headers, I do not need to have the header line as a record.

 I plan on adding a setting that allows the header record to be saved for
 callers who care.

 Gary



 Emmanuel Bourg


 Le 30/07/2013 22:23, Gary Gregory a écrit :
  Actually, if you use withHeader(), no args, you _cannot_ get back the
 first
  record, so that makes skipHeader=false not possible without making the
  parser track the first record separately.
 
  In the interest of simplicity, I am going to make it simple: if you use
  withHeader of any kind, then the first record is read.


 -
 To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
 For additional commands, e-mail: dev-h...@commons.apache.org




 --
 E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
 Java Persistence with Hibernate, Second 
 Editionhttp://www.manning.com/bauer3/
 JUnit in Action, Second Edition http://www.manning.com/tahchiev/
 Spring Batch in Action http://www.manning.com/templier/
 Blog: http://garygregory.wordpress.com
 Home: http://garygregory.com/
 Tweet! http://twitter.com/GaryGregory




-- 
E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/
JUnit in Action, Second Edition http://www.manning.com/tahchiev/
Spring Batch in Action http://www.manning.com/templier/
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

2013-07-30 Thread Emmanuel Bourg

Le 30/07/2013 23:26, Gary Gregory a écrit :
 And another thing: internally, the header should be a SetString, not a
 String[]. I plan on fixing that later too.

Why should it be a set? Is there an impact on the performance?


Emmanuel Bourg


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [CSV] Headers and the first record

2013-07-30 Thread Emmanuel Bourg

Le 30/07/2013 23:24, Gary Gregory a écrit :

 Yeah, that's too clever IMO. I expected the same behavior WRT record
 reading with the only difference being if I let the parser guess or not.

Too clever? I didn't feel like I designed a rocket with this feature
though :) That's an important feature to me and I'd like to preserve it.

If the header is defined in the file I don't want to skip the first
record manually, the parser should take care of it. That also means the
user code can remain the same, whether the header is defined in the code
or in the file.


 The current code now always reads the header line if you set any non-null
 header. If you call withHeader() with no args it is a non-null call with an
 empty String[].

I guess a null header or an empty header is just the same and means the
first record must be used as the header.

Emmanuel Bourg


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [CSV] Headers and the first record

2013-07-30 Thread Gary Gregory

On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg ebo...@apache.org wrote:

 Le 30/07/2013 23:26, Gary Gregory a écrit :
  And another thing: internally, the header should be a SetString, not a
  String[]. I plan on fixing that later too.

 Why should it be a set? Is there an impact on the performance?


Well, I did not finish my though on that one, sorry about that, please
allow me to walk through my use cases. The issue is about the feature, not
performance.

At first glance, using a set avoids an inherent problem with any non-set
data structure: defining duplicates. What does the following mean?

withHeader(A, B, C, A);

It's is a recipe for garbage results: record.get(A) returns what?

Today, I added some CSVFormat validation code that checks for duplicate
column names. If you build a format with withHeader(A, B, C, A);
you will get an ISE when validate() is called.

If we had withHeader(Set) and document it as the 'main' way to specify
column names, then we can say that withHeader(String...) is just a
syntactical convenience and turn the String[] into a Set. But that will not
work.

The problem with a Java Set is that it is not ordered and the current
implementation relies on order of the String[]. But why? What the current
implementation says is: ignore what the header line of the file is and use
the given column names at the given positions. A perfectly good user story.
So for withHeader(A, B, C), A is column 0, B is column 1, and so
on. Ok, that's one usage.

Taking a step back, I want to talk about why should the column name order
matter when you are calling withHeader(). I would like to be able to tell
the parser that I want to use a Set of column names and have it figure out,
based on the header line, the columns indices. This is quite different than
what we have now.

A use case I have now is a CSV file with a lot of columns (~90) but I only
care about a small subset of the columns (~10). I'd like to be able to say
withHeader(Set) where the Set may be a subset of the actual column names in
the header line. This is different from withHeader(String[]) because the
names in the Set must match the names in the header record.

So I think it boils down to ignoring my comment about using a Set
internally and adding a feature where I can tell the parser that I want to
use a set of column names and not worry about the order, because the parser
will match up the column names when it reads the header line.

Gary




 Emmanuel Bourg


 -
 To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
 For additional commands, e-mail: dev-h...@commons.apache.org




-- 
E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/
JUnit in Action, Second Edition http://www.manning.com/tahchiev/
Spring Batch in Action http://www.manning.com/templier/
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

2013-07-30 Thread Gary Gregory

On Tue, Jul 30, 2013 at 5:47 PM, Emmanuel Bourg ebo...@apache.org wrote:

 Le 30/07/2013 23:24, Gary Gregory a écrit :

  Yeah, that's too clever IMO. I expected the same behavior WRT record
  reading with the only difference being if I let the parser guess or not.

 Too clever? I didn't feel like I designed a rocket with this feature
 though :) That's an important feature to me and I'd like to preserve it.

 If the header is defined in the file I don't want to skip the first
 record manually, the parser should take care of it. That also means the
 user code can remain the same, whether the header is defined in the code
 or in the file.


Let me reply to this part tomorrow (it's late here ;)




  The current code now always reads the header line if you set any non-null
  header. If you call withHeader() with no args it is a non-null call with
 an
  empty String[].

 I guess a null header or an empty header is just the same and means the
 first record must be used as the header.


It is not the same at all. A null header String[] is different from a
length 0 array. It's been like that for a while.

Gary



 Emmanuel Bourg


 -
 To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
 For additional commands, e-mail: dev-h...@commons.apache.org




-- 
E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/
JUnit in Action, Second Edition http://www.manning.com/tahchiev/
Spring Batch in Action http://www.manning.com/templier/
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [csv] Headers

2012-03-15 Thread Benedikt Ritter

Am 15. März 2012 01:58 schrieb Emmanuel Bourg ebo...@apache.org:
 There is another alternative, we might replace the records returned as a
 String[] by a CSVRecord class able to access the fields by id or by name.
 This would be similar to a JDBC resultset (except for the looping logic)


sounds good. This discussion showed, that a record is more than a
String array. So having a specialized class is a good idea.

 This avoids the duplication of the parser, which might still be generified
 later to support custom beans.

 The example becomes:

  CSVFormat format = CSVFormat.DEFAULT.withHeader();

  for (CSVRecord record : format.parse(in)) {

      Person person = new Person();
      person.setName(record.get(name));
      person.setEmail(record.get(email));
      person.setPhone(record.get(phone));
      persons.add(person);
  }

 The record is not a Map to keep it simple, it only exposes 3 methods:
 get(int), get(String) and size()


I'm not sure if I understand the approach completely. The Header can
not be accessed as a CSVRecord, right? CSVRecords know the header
values through get(string). What happens if the format does not
support a header? UnsupportedOperationException?
If I got you right, we could use getHeaders() to know, which header
values are available.

Maybe it would be useful to have the record implement iterable as well.

Benedikt

 Emmanuel Bourg



-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [csv] Headers

2012-03-15 Thread Emmanuel Bourg


Le 15/03/2012 08:55, Benedikt Ritter a écrit :


I'm not sure if I understand the approach completely. The Header can
not be accessed as a CSVRecord, right? CSVRecords know the header
values through get(string). What happens if the format does not
support a header? UnsupportedOperationException?


Yes, or IllegalStateException.



If I got you right, we could use getHeaders() to know, which header
values are available.


The actual header would be returned by parser.getHeader().



Maybe it would be useful to have the record implement iterable as well.


Or have a method return the array of values if you want to iterate over it.

Emmanuel Bourg



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [csv] Headers

2012-03-14 Thread Emmanuel Bourg

There is another alternative, we might replace the records returned as a 
String[] by a CSVRecord class able to access the fields by id or by 
name. This would be similar to a JDBC resultset (except for the looping 
logic)


This avoids the duplication of the parser, which might still be 
generified later to support custom beans.


The example becomes:

  CSVFormat format = CSVFormat.DEFAULT.withHeader();

  for (CSVRecord record : format.parse(in)) {
  Person person = new Person();
  person.setName(record.get(name));
  person.setEmail(record.get(email));
  person.setPhone(record.get(phone));
  persons.add(person);
  }

The record is not a Map to keep it simple, it only exposes 3 methods: 
get(int), get(String) and size()


Emmanuel Bourg




smime.p7s
Description: S/MIME Cryptographic Signature

Re: [csv] Headers

2012-03-13 Thread Luc Maisonobe

Le 13/03/2012 00:56, sebb a écrit :
 On 12 March 2012 22:11, Emmanuel Bourg ebo...@apache.org wrote:
 [csv] is missing some elements to ease the use of headers. I have no clear
 idea on how to address this, here are my thoughts.

 Headers are used when the fields are accessed by the column name rather than
 by the index. This provides some flexibility because the input file can be
 slightly modified by reordering the columns or by inserting new columns
 without breaking the existing code.

 Using the current API here is how one would work with headers:

  CSVParser parser = new CSVParser(in);
  IteratorString[] it = parser.iterator();

  // read the header
  String[] header = it.next();

  // build a name to index mapping
  MapString, Integer mapping = new HashMap();
  for (int i = 0; i  header.length; i++) {
  mapping.put(header[i], i);
  }

  // parse the records
  for (String[] record : parser) {
  Person person = new Person();
  person.setName(record[mapping.get(name)]);
  person.setEmail(record[mapping.get(email)]);
  person.setPhone(record[mapping.get(phone)]);
  persons.add(person);
  }

 The user has to take care of the mapping, which is not very friendly. I have
 several solutions in mind:

 1. Do nothing and address it in the next release with the bean mapping.
 Parsing the file would then look like this:

  CSVFormatPerson format = CSVFormat.DEFAULT.withType(Person.class);
  for (Person person : format.parse(in)) {
  persons.add(person);
  }

 
 Does this automatically mean that the file has a header?
 Or is there another way to link columns to Person attributes?
 
 I don't think this should be the only way of handling named columns;
 it's not always convenient to create a type.

I agree. Sometimes, the colums are just a part of a class that would
need other parameters not in the columns (but perhaps in a custom
comment of the header, if these parameters are constant throughout the
file. So providing intermediate level API (with mapping already done,
but still access to individual fields) is a must.

 
 2. Add a parser returning a Map instead of a String[]

  // declare the header in the format,
  // the header line will be parsed automatically
  CSVFormat format = CSVFormat.DEFAULT.withHeader();

  for (MapString, String record : new CSVMapParser(in, format))) {
  Person person = new Person();
  person.setName(record.get(name));
  person.setEmail(record.get(email));
  person.setPhone(record.get(phone));
  persons.add(person);
  }
 
 That seems OK; one can also just use the column values directly.

+1


Luc

 

 2bis. Have the same CSVParser class returning String[] or MapString,
 String depending on a generic parameter. Not sure it's possible with type
 erasure.

 
 It's not possible for two methods to differ only by return parameter
 type, so this can only be done if the method parameters are different
 after type erasure.
 
 3. Have the parser maintain the name-index mapping. The parser read the
 first line automatically if the format declares a header, and a
 getColumnIndex() method is exposed.

  CSVFormat format = CSVFormat.DEFAULT.withHeader();
  CSVParser parser = new CSVParser(in, format);

  // parse the records
  for (String[] record : parser) {
  Person person = new Person();
  person.setName(record[parser.getColumnIndex(name)]);
  person.setEmail(record[parser.getColumnIndex(email)]);
  person.setPhone(record[parser.getColumnIndex(phone)]);
  persons.add(person);
  }
 
 Quite awkard to use.
 

 What do you think?

 Emmanuel Bourg

 
 -
 To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
 For additional commands, e-mail: dev-h...@commons.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [csv] Headers

2012-03-13 Thread Jörg Schaible

Emmanuel Bourg wrote:

 Le 13/03/2012 00:56, sebb a écrit :
 
 1. Do nothing and address it in the next release with the bean mapping.
 Parsing the file would then look like this:

   CSVFormatPerson  format = CSVFormat.DEFAULT.withType(Person.class);
   for (Person person : format.parse(in)) {
   persons.add(person);
   }


 Does this automatically mean that the file has a header?
 Or is there another way to link columns to Person attributes?
 
 If the file doesn't have a header, the fields are matched by index
 (either the natural ordering of the attributes in the class, or
 specified by an annotation).
 
 If the file has a header, the fields are matched by attribute name, and
 an annotation can override the name of the column associated to an
 attribute.

Yeah, but that's not required. Just because you can read the names of the 
columns does not mean that you want to address them by name. Why pay the 
price for creating the map and accessing the values by name just for a one-
time information?

- Jörg


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [csv] Headers

2012-03-13 Thread Emmanuel Bourg


Le 13/03/2012 09:21, Jörg Schaible a écrit :


If the file has a header, the fields are matched by attribute name, and
an annotation can override the name of the column associated to an
attribute.


Yeah, but that's not required. Just because you can read the names of the
columns does not mean that you want to address them by name. Why pay the
price for creating the map and accessing the values by name just for a one-
time information?


Sorry I forgot the end of my message, I meant to access the fields by 
name OR by index when the header is present. That would be configured 
with the annotations.


Emmanuel Bourg



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [csv] Headers

2012-03-13 Thread sebb

On 13 March 2012 08:52, Emmanuel Bourg ebo...@apache.org wrote:
 Le 13/03/2012 09:21, Jörg Schaible a écrit :


 If the file has a header, the fields are matched by attribute name, and
 an annotation can override the name of the column associated to an
 attribute.


 Yeah, but that's not required. Just because you can read the names of the
 columns does not mean that you want to address them by name. Why pay the
 price for creating the map and accessing the values by name just for a
 one-
 time information?


 Sorry I forgot the end of my message, I meant to access the fields by name
 OR by index when the header is present. That would be configured with the
 annotations.

It needs to be possible to access columns by index without having to
use annotations.

 Emmanuel Bourg


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [csv] Headers

2012-03-13 Thread Emmanuel Bourg


Le 13/03/2012 09:56, sebb a écrit :


It needs to be possible to access columns by index without having to
use annotations.


That's still possible with the low level API. I'm just exploring the 
features I would expect of a bean mapping.


Emmanuel Bourg



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [csv] Headers

2012-03-13 Thread Benedikt Ritter

I think transforming the result of the parse process into instances of
some class is a different concern. That should not be part of as
CSVParser. In Hibernate they use ResultTransformers for this purpose
[1]. I think we should separate this concerns as well.

[1] 
http://docs.jboss.org/hibernate/orm/3.3/api/org/hibernate/transform/ResultTransformer.html

Am 13. März 2012 10:03 schrieb Emmanuel Bourg ebo...@apache.org:
 Le 13/03/2012 09:56, sebb a écrit :


 It needs to be possible to access columns by index without having to
 use annotations.


 That's still possible with the low level API. I'm just exploring the
 features I would expect of a bean mapping.

 Emmanuel Bourg


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [csv] Headers

2012-03-12 Thread sebb

On 12 March 2012 22:11, Emmanuel Bourg ebo...@apache.org wrote:
 [csv] is missing some elements to ease the use of headers. I have no clear
 idea on how to address this, here are my thoughts.

 Headers are used when the fields are accessed by the column name rather than
 by the index. This provides some flexibility because the input file can be
 slightly modified by reordering the columns or by inserting new columns
 without breaking the existing code.

 Using the current API here is how one would work with headers:

  CSVParser parser = new CSVParser(in);
  IteratorString[] it = parser.iterator();

  // read the header
  String[] header = it.next();

  // build a name to index mapping
  MapString, Integer mapping = new HashMap();
  for (int i = 0; i  header.length; i++) {
      mapping.put(header[i], i);
  }

  // parse the records
  for (String[] record : parser) {
      Person person = new Person();
      person.setName(record[mapping.get(name)]);
      person.setEmail(record[mapping.get(email)]);
      person.setPhone(record[mapping.get(phone)]);
      persons.add(person);
  }

 The user has to take care of the mapping, which is not very friendly. I have
 several solutions in mind:

 1. Do nothing and address it in the next release with the bean mapping.
 Parsing the file would then look like this:

  CSVFormatPerson format = CSVFormat.DEFAULT.withType(Person.class);
  for (Person person : format.parse(in)) {
      persons.add(person);
  }


Does this automatically mean that the file has a header?
Or is there another way to link columns to Person attributes?

I don't think this should be the only way of handling named columns;
it's not always convenient to create a type.

 2. Add a parser returning a Map instead of a String[]

  // declare the header in the format,
  // the header line will be parsed automatically
  CSVFormat format = CSVFormat.DEFAULT.withHeader();

  for (MapString, String record : new CSVMapParser(in, format))) {
      Person person = new Person();
      person.setName(record.get(name));
      person.setEmail(record.get(email));
      person.setPhone(record.get(phone));
      persons.add(person);
  }

That seems OK; one can also just use the column values directly.


 2bis. Have the same CSVParser class returning String[] or MapString,
 String depending on a generic parameter. Not sure it's possible with type
 erasure.


It's not possible for two methods to differ only by return parameter
type, so this can only be done if the method parameters are different
after type erasure.

 3. Have the parser maintain the name-index mapping. The parser read the
 first line automatically if the format declares a header, and a
 getColumnIndex() method is exposed.

  CSVFormat format = CSVFormat.DEFAULT.withHeader();
  CSVParser parser = new CSVParser(in, format);

  // parse the records
  for (String[] record : parser) {
      Person person = new Person();
      person.setName(record[parser.getColumnIndex(name)]);
      person.setEmail(record[parser.getColumnIndex(email)]);
      person.setPhone(record[parser.getColumnIndex(phone)]);
      persons.add(person);
  }

Quite awkard to use.


 What do you think?

 Emmanuel Bourg


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [csv] Headers

2012-03-12 Thread Emmanuel Bourg


Le 13/03/2012 00:56, sebb a écrit :


1. Do nothing and address it in the next release with the bean mapping.
Parsing the file would then look like this:

  CSVFormatPerson  format = CSVFormat.DEFAULT.withType(Person.class);
  for (Person person : format.parse(in)) {
  persons.add(person);
  }



Does this automatically mean that the file has a header?
Or is there another way to link columns to Person attributes?


If the file doesn't have a header, the fields are matched by index 
(either the natural ordering of the attributes in the class, or 
specified by an annotation).


If the file has a header, the fields are matched by attribute name, and 
an annotation can override the name of the column associated to an 
attribute.



Emmanuel Bourg



smime.p7s
Description: S/MIME Cryptographic Signature

38 matches

Mail list logo