Re: [CSV] Headers and the first record
2013/7/31 Gary Gregory garydgreg...@gmail.com On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg ebo...@apache.org wrote: Le 30/07/2013 23:26, Gary Gregory a écrit : And another thing: internally, the header should be a SetString, not a String[]. I plan on fixing that later too. Why should it be a set? Is there an impact on the performance? Well, I did not finish my though on that one, sorry about that, please allow me to walk through my use cases. The issue is about the feature, not performance. At first glance, using a set avoids an inherent problem with any non-set data structure: defining duplicates. What does the following mean? withHeader(A, B, C, A); It's is a recipe for garbage results: record.get(A) returns what? Today, I added some CSVFormat validation code that checks for duplicate column names. If you build a format with withHeader(A, B, C, A); you will get an ISE when validate() is called. If we had withHeader(Set) and document it as the 'main' way to specify column names, then we can say that withHeader(String...) is just a syntactical convenience and turn the String[] into a Set. But that will not work. The problem with a Java Set is that it is not ordered and the current implementation relies on order of the String[]. But why? What the current implementation says is: ignore what the header line of the file is and use the given column names at the given positions. A perfectly good user story. So for withHeader(A, B, C), A is column 0, B is column 1, and so on. Ok, that's one usage. Taking a step back, I want to talk about why should the column name order matter when you are calling withHeader(). I would like to be able to tell the parser that I want to use a Set of column names and have it figure out, based on the header line, the columns indices. This is quite different than what we have now. A use case I have now is a CSV file with a lot of columns (~90) but I only care about a small subset of the columns (~10). I'd like to be able to say withHeader(Set) where the Set may be a subset of the actual column names in the header line. This is different from withHeader(String[]) because the names in the Set must match the names in the header record. I'm not sure if we should try to build in all this different cases (guessing headers, using the first record as headers, only use a subset of the available headers) into one implementation. What you are talking about sounds more like a view or a projection of the actual content being parsed. Do we really need this for 1.0 or can it be postponed? So I think it boils down to ignoring my comment about using a Set internally and adding a feature where I can tell the parser that I want to use a set of column names and not worry about the order, because the parser will match up the column names when it reads the header line. Gary Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Edition http://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory -- http://people.apache.org/~britter/ http://www.systemoutprintln.de/ http://twitter.com/BenediktRitter http://github.com/britter
Re: [CSV] Headers and the first record
On 31 July 2013 08:38, Benedikt Ritter brit...@apache.org wrote: 2013/7/31 Gary Gregory garydgreg...@gmail.com On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg ebo...@apache.org wrote: Le 30/07/2013 23:26, Gary Gregory a écrit : And another thing: internally, the header should be a SetString, not a String[]. I plan on fixing that later too. Why should it be a set? Is there an impact on the performance? Well, I did not finish my though on that one, sorry about that, please allow me to walk through my use cases. The issue is about the feature, not performance. At first glance, using a set avoids an inherent problem with any non-set data structure: defining duplicates. What does the following mean? withHeader(A, B, C, A); It's is a recipe for garbage results: record.get(A) returns what? Today, I added some CSVFormat validation code that checks for duplicate column names. If you build a format with withHeader(A, B, C, A); you will get an ISE when validate() is called. If we had withHeader(Set) and document it as the 'main' way to specify column names, then we can say that withHeader(String...) is just a syntactical convenience and turn the String[] into a Set. But that will not work. The problem with a Java Set is that it is not ordered and the current implementation relies on order of the String[]. But why? What the current implementation says is: ignore what the header line of the file is and use the given column names at the given positions. A perfectly good user story. So for withHeader(A, B, C), A is column 0, B is column 1, and so on. Ok, that's one usage. Taking a step back, I want to talk about why should the column name order matter when you are calling withHeader(). I would like to be able to tell the parser that I want to use a Set of column names and have it figure out, based on the header line, the columns indices. This is quite different than what we have now. A use case I have now is a CSV file with a lot of columns (~90) but I only care about a small subset of the columns (~10). I'd like to be able to say withHeader(Set) where the Set may be a subset of the actual column names in the header line. This is different from withHeader(String[]) because the names in the Set must match the names in the header record. I'm not sure if we should try to build in all this different cases (guessing headers, using the first record as headers, only use a subset of the available headers) into one implementation. What you are talking about sounds more like a view or a projection of the actual content being parsed. Do we really need this for 1.0 or can it be postponed? Agreed, this is something that needs more work before it could be included. There will always be some extra item that would be nice to have; this seems non-essential to me. So I think it boils down to ignoring my comment about using a Set internally and adding a feature where I can tell the parser that I want to use a set of column names and not worry about the order, because the parser will match up the column names when it reads the header line. Gary Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Edition http://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory -- http://people.apache.org/~britter/ http://www.systemoutprintln.de/ http://twitter.com/BenediktRitter http://github.com/britter - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [CSV] Headers and the first record
On Jul 31, 2013, at 3:38, Benedikt Ritter brit...@apache.org wrote: 2013/7/31 Gary Gregory garydgreg...@gmail.com On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg ebo...@apache.org wrote: Le 30/07/2013 23:26, Gary Gregory a écrit : And another thing: internally, the header should be a SetString, not a String[]. I plan on fixing that later too. Why should it be a set? Is there an impact on the performance? Well, I did not finish my though on that one, sorry about that, please allow me to walk through my use cases. The issue is about the feature, not performance. At first glance, using a set avoids an inherent problem with any non-set data structure: defining duplicates. What does the following mean? withHeader(A, B, C, A); It's is a recipe for garbage results: record.get(A) returns what? Today, I added some CSVFormat validation code that checks for duplicate column names. If you build a format with withHeader(A, B, C, A); you will get an ISE when validate() is called. If we had withHeader(Set) and document it as the 'main' way to specify column names, then we can say that withHeader(String...) is just a syntactical convenience and turn the String[] into a Set. But that will not work. The problem with a Java Set is that it is not ordered and the current implementation relies on order of the String[]. But why? What the current implementation says is: ignore what the header line of the file is and use the given column names at the given positions. A perfectly good user story. So for withHeader(A, B, C), A is column 0, B is column 1, and so on. Ok, that's one usage. Taking a step back, I want to talk about why should the column name order matter when you are calling withHeader(). I would like to be able to tell the parser that I want to use a Set of column names and have it figure out, based on the header line, the columns indices. This is quite different than what we have now. A use case I have now is a CSV file with a lot of columns (~90) but I only care about a small subset of the columns (~10). I'd like to be able to say withHeader(Set) where the Set may be a subset of the actual column names in the header line. This is different from withHeader(String[]) because the names in the Set must match the names in the header record. I'm not sure if we should try to build in all this different cases (guessing headers, using the first record as headers, only use a subset of the available headers) into one implementation. What you are talking about sounds more like a view or a projection of the actual content being parsed. Do we really need this for 1.0 or can it be postponed? This is a real scenario and a real need, not some imaginary complication ;) Even if it is not implemented for 1.0, we should talk about how it should be done such that it fits in and does not cause API problems later. And if I can get it done by then, then that much the better. Gary So I think it boils down to ignoring my comment about using a Set internally and adding a feature where I can tell the parser that I want to use a set of column names and not worry about the order, because the parser will match up the column names when it reads the header line. Gary Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Edition http://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory -- http://people.apache.org/~britter/ http://www.systemoutprintln.de/ http://twitter.com/BenediktRitter http://github.com/britter - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [CSV] Headers and the first record
On Tue, Jul 30, 2013 at 5:47 PM, Emmanuel Bourg ebo...@apache.org wrote: Le 30/07/2013 23:24, Gary Gregory a écrit : Yeah, that's too clever IMO. I expected the same behavior WRT record reading with the only difference being if I let the parser guess or not. Too clever? I didn't feel like I designed a rocket with this feature though :) That's an important feature to me and I'd like to preserve it. If the header is defined in the file I don't want to skip the first record manually, the parser should take care of it. But that is exactly what _was_ happening! ;) If I called withHeader(A, B, C) the header was not skipped. If I called withHeader(new String[]{}) the header was skipped. If I called withHeader() the header was skipped (same as line above). In both cases, I am telling the parser that there is a header, but it is not skipped in both cases. That's the inconsistency I fixed. What I am asking is: should we have a saveHeader setting such that IF you ask for headers, then we save that record in the parser, it is currently lost, or, actually transformed into the header map. Gary That also means the user code can remain the same, whether the header is defined in the code or in the file. The current code now always reads the header line if you set any non-null header. If you call withHeader() with no args it is a non-null call with an empty String[]. I guess a null header or an empty header is just the same and means the first record must be used as the header. Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory
Re: [CSV] Headers and the first record
Le 31/07/2013 15:08, Gary Gregory a écrit : But that is exactly what _was_ happening! ;) If I called withHeader(A, B, C) the header was not skipped. Sounds good. The header is defined in the code, we don't expect to see the header in the file so nothing is skipped. If I called withHeader(new String[]{}) the header was skipped. Correct. The header is not defined in the code, the parser uses the first record as header and doesn't return it when iterating. If I called withHeader() the header was skipped (same as line above). Sounds good too. What was the issue again ? ;) What I am asking is: should we have a saveHeader setting such that IF you ask for headers, then we save that record in the parser, it is currently lost, or, actually transformed into the header map. Keeping the header around might be useful, I wouldn't create a format parameter for this though. It could be made available at the record level, much like ResultSet.getMetaData(). Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [CSV] Headers and the first record
On Wed, Jul 31, 2013 at 8:58 AM, Gary Gregory garydgreg...@gmail.comwrote: On Jul 31, 2013, at 3:38, Benedikt Ritter brit...@apache.org wrote: 2013/7/31 Gary Gregory garydgreg...@gmail.com On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg ebo...@apache.org wrote: Le 30/07/2013 23:26, Gary Gregory a écrit : And another thing: internally, the header should be a SetString, not a String[]. I plan on fixing that later too. Why should it be a set? Is there an impact on the performance? Well, I did not finish my though on that one, sorry about that, please allow me to walk through my use cases. The issue is about the feature, not performance. At first glance, using a set avoids an inherent problem with any non-set data structure: defining duplicates. What does the following mean? withHeader(A, B, C, A); It's is a recipe for garbage results: record.get(A) returns what? Today, I added some CSVFormat validation code that checks for duplicate column names. If you build a format with withHeader(A, B, C, A); you will get an ISE when validate() is called. If we had withHeader(Set) and document it as the 'main' way to specify column names, then we can say that withHeader(String...) is just a syntactical convenience and turn the String[] into a Set. But that will not work. The problem with a Java Set is that it is not ordered and the current implementation relies on order of the String[]. But why? What the current implementation says is: ignore what the header line of the file is and use the given column names at the given positions. A perfectly good user story. So for withHeader(A, B, C), A is column 0, B is column 1, and so on. Ok, that's one usage. Taking a step back, I want to talk about why should the column name order matter when you are calling withHeader(). I would like to be able to tell the parser that I want to use a Set of column names and have it figure out, based on the header line, the columns indices. This is quite different than what we have now. A use case I have now is a CSV file with a lot of columns (~90) but I only care about a small subset of the columns (~10). I'd like to be able to say withHeader(Set) where the Set may be a subset of the actual column names in the header line. This is different from withHeader(String[]) because the names in the Set must match the names in the header record. I'm not sure if we should try to build in all this different cases (guessing headers, using the first record as headers, only use a subset of the available headers) into one implementation. What you are talking about sounds more like a view or a projection of the actual content being parsed. Do we really need this for 1.0 or can it be postponed? This is a real scenario and a real need, not some imaginary complication ;) But I could work with current framework and use withHeaders(new String[]{}) and let the parser find the headers. Then I can just do record.get(A) with the columns I care about. It just feels a little more mysterious. I think the only wrinkle left for me is that I want validation that the columns I care about are there. Right now get(String) throws IllegalArgumentException if you give it an unknown column, which will fail fast enough on the first record. So I'll go down that road until the next speed bump... Gary Even if it is not implemented for 1.0, we should talk about how it should be done such that it fits in and does not cause API problems later. And if I can get it done by then, then that much the better. Gary So I think it boils down to ignoring my comment about using a Set internally and adding a feature where I can tell the parser that I want to use a set of column names and not worry about the order, because the parser will match up the column names when it reads the header line. Gary Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Edition http://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory -- http://people.apache.org/~britter/ http://www.systemoutprintln.de/ http://twitter.com/BenediktRitter http://github.com/britter -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com
[CSV] Accessing a subset of the available headers (Was: Re: [CSV] Headers and the first record)
snip A use case I have now is a CSV file with a lot of columns (~90) but I only care about a small subset of the columns (~10). I'd like to be able to say withHeader(Set) where the Set may be a subset of the actual column names in the header line. This is different from withHeader(String[]) because the names in the Set must match the names in the header record. What you are talking about sounds more like a view or a projection of the actual content being parsed. Do we really need this for 1.0 or can it be postponed? This is a real scenario and a real need, not some imaginary complication ;) Even if it is not implemented for 1.0, we should talk about how it should be done such that it fits in and does not cause API problems later. And if I can get it done by then, then that much the better. Okay, then let's discuss this on a new thread :-) As I've said, I think we should not push to much into withHeaders(String...). Maybe this is some sort of view, where you can pass a parser and the headers you are interested in and it will return an IterableCSVRecord (or CSVParser) that just gives access to the specified headers you are interessted in? Would it be possible to give a code example of what you have to do with to current API in your use case and what you want? Benedikt -- http://people.apache.org/~britter/ http://www.systemoutprintln.de/ http://twitter.com/BenediktRitter http://github.com/britter
Re: [CSV] Headers and the first record
On Wed, Jul 31, 2013 at 9:34 AM, Emmanuel Bourg ebo...@apache.org wrote: Le 31/07/2013 15:08, Gary Gregory a écrit : But that is exactly what _was_ happening! ;) If I called withHeader(A, B, C) the header was not skipped. Sounds good. The header is defined in the code, we don't expect to see the header in the file so nothing is skipped. NOT good! ;) This is where we disagree. The parser used to behave differently depending on the contents of the String[]. - From an API design standpoint, it's smelly to me. - The feature is hard to understand. If we want that, we need two APIs for two behaviors. Using the withHeader API, I can tell the parser to: - Ignore the fact that there is a header record, I am overriding it with my own names - There is no header record, so I am telling you what the header names are. These two features clash because in one case the file has a header line and in the other the file does not. This is why we need settings with different names. That or a setting that says 'skip the first record, it's the header, I do not want to see it as a data record' I see three scenarios: 1) I set the headers (the file does not have one), do not skip the first record 2) I override the existing header record, skip the first record 3) The parser guesses the headers based on reading the first record, which skips the first record as a data record This can be accommodated with a skipHeaderRecord boolean setting. I do not care what the default behavior is as long as I can say this file has headers, guess them please, and skip record 0 and this file has a header record, but I'm telling you to call them A, B, and C, so skip record 0 1) withHeader(A, B, C).skipHeaderRecord(false); 2) withHeader(A, B, C).skipHeaderRecord(true); 3) withHeader() Is there a better name for skipHeaderRecord? Maybe: 1b) withHeader(A, B, C).firstRecordIsHeader(false); 2b) withHeader(A, B, C).firstRecordIsHeader(true); Here the difference is that the API does not describe behavior, instead it describes the data, and behavior is implied. There is also: 1c) withHeader(A, B, C) 2c) withHeaderOverride(A, B, C) Thoughts? Gary If I called withHeader(new String[]{}) the header was skipped. Correct. The header is not defined in the code, the parser uses the first record as header and doesn't return it when iterating. If I called withHeader() the header was skipped (same as line above). Sounds good too. What was the issue again ? ;) What I am asking is: should we have a saveHeader setting such that IF you ask for headers, then we save that record in the parser, it is currently lost, or, actually transformed into the header map. Keeping the header around might be useful, I wouldn't create a format parameter for this though. It could be made available at the record level, much like ResultSet.getMetaData(). Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory
Re: [CSV] Accessing a subset of the available headers (Was: Re: [CSV] Headers and the first record)
On Wed, Jul 31, 2013 at 10:42 AM, Benedikt Ritter brit...@apache.orgwrote: snip A use case I have now is a CSV file with a lot of columns (~90) but I only care about a small subset of the columns (~10). I'd like to be able to say withHeader(Set) where the Set may be a subset of the actual column names in the header line. This is different from withHeader(String[]) because the names in the Set must match the names in the header record. What you are talking about sounds more like a view or a projection of the actual content being parsed. Do we really need this for 1.0 or can it be postponed? This is a real scenario and a real need, not some imaginary complication ;) Even if it is not implemented for 1.0, we should talk about how it should be done such that it fits in and does not cause API problems later. And if I can get it done by then, then that much the better. Okay, then let's discuss this on a new thread :-) As I've said, I think we should not push to much into withHeaders(String...). Maybe this is some sort of view, where you can pass a parser and the headers you are interested in and it will return an IterableCSVRecord (or CSVParser) that just gives access to the specified headers you are interessted in? Would it be possible to give a code example of what you have to do with to current API in your use case and what you want? I am switching to withHeader() with no arg (same as a new String[]{}) and let the parser guess the headers and then pray that the names match between the app and the files. Which is just as unsafe as forcing the headers in fixed order on the parser because the column order might have changed. Ideally, the column order should not matter, which it does not when you do a record.get(String), which is nice. Calling withHeader() with no args is less brittle than calling it with 90 args. The benefit is that the column order in the file can change without affecting the app, which is good. I could use a little more bullet-proofing by making the column names optionally case-insensitive, but that's a different feature. Ideally, I want to define the column names in the app as a simple Java enum, then use an enum as a record key. That does not work for column names that have spaces in them as mine do, so it's back to classic static final Strings as keys. I could create a fancier custom enum but it's not worth it for now. Gary Benedikt -- http://people.apache.org/~britter/ http://www.systemoutprintln.de/ http://twitter.com/BenediktRitter http://github.com/britter -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory
Re: [CSV] Headers and the first record
I took a brief look at the API for CSV, and thought I would share a typical use case from the biotech industry. We deal with a lot of instruments that produce a multiline header. The header usually contains experiment conditions. You can think of this as metadata for the columnar data. The experiment conditions usually contain things like the name of the scientist using the instrument, the time of day the experiment was run, and some instrument configuration settings. Usually when we parse CSV files, we have to parse the header first, extract all relevant data, and then parse the rows of data. In addition to the experiment conditions header, there are also column headers. The column headers can be multi-lined as well. For example, you might have a column header whose first line contains chemical compound IDs or names, and the second line of the column header contains the concentrations for those compounds. The data values represent the percent inhibition at those concentrations. Like this: Erlotinib 1uM 10 uM 100 uM 1nM 0.01 0.001 0.0001 0.1 ... Since the position and types of header and body data vary, we typically use parse configuration files that describe what data can be found where. The parse configuration varies not only per instrument but also per experimental protocol. So there are usually numerous configuration files in your typical lab. The configuration files can also be stored in a database. This is usually part of a file-watching web app. It allows scientists to add support for new experiments or instruments without having to get a developer to write more code. In the API I saw support for hard-coded configurations via the CSVFormat object, but I didn't see any support for creating and using persistable configurations. You may want to consider that as you move forward. Hope this helps, Mark On Wed, Jul 31, 2013 at 6:36 AM, Gary Gregory garydgreg...@gmail.comwrote: On Wed, Jul 31, 2013 at 8:58 AM, Gary Gregory garydgreg...@gmail.com wrote: On Jul 31, 2013, at 3:38, Benedikt Ritter brit...@apache.org wrote: 2013/7/31 Gary Gregory garydgreg...@gmail.com On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg ebo...@apache.org wrote: Le 30/07/2013 23:26, Gary Gregory a écrit : And another thing: internally, the header should be a SetString, not a String[]. I plan on fixing that later too. Why should it be a set? Is there an impact on the performance? Well, I did not finish my though on that one, sorry about that, please allow me to walk through my use cases. The issue is about the feature, not performance. At first glance, using a set avoids an inherent problem with any non-set data structure: defining duplicates. What does the following mean? withHeader(A, B, C, A); It's is a recipe for garbage results: record.get(A) returns what? Today, I added some CSVFormat validation code that checks for duplicate column names. If you build a format with withHeader(A, B, C, A); you will get an ISE when validate() is called. If we had withHeader(Set) and document it as the 'main' way to specify column names, then we can say that withHeader(String...) is just a syntactical convenience and turn the String[] into a Set. But that will not work. The problem with a Java Set is that it is not ordered and the current implementation relies on order of the String[]. But why? What the current implementation says is: ignore what the header line of the file is and use the given column names at the given positions. A perfectly good user story. So for withHeader(A, B, C), A is column 0, B is column 1, and so on. Ok, that's one usage. Taking a step back, I want to talk about why should the column name order matter when you are calling withHeader(). I would like to be able to tell the parser that I want to use a Set of column names and have it figure out, based on the header line, the columns indices. This is quite different than what we have now. A use case I have now is a CSV file with a lot of columns (~90) but I only care about a small subset of the columns (~10). I'd like to be able to say withHeader(Set) where the Set may be a subset of the actual column names in the header line. This is different from withHeader(String[]) because the names in the Set must match the names in the header record. I'm not sure if we should try to build in all this different cases (guessing headers, using the first record as headers, only use a subset of the available headers) into one implementation. What you are talking about sounds more like a view or a projection of the actual content being parsed. Do we really need this for 1.0 or can it be postponed? This is a real scenario and a real need, not some imaginary complication ;) But I could work with current framework and use withHeaders(new
Re: [CSV] Headers and the first record
On Wed, Jul 31, 2013 at 11:14 AM, Mark Fortner phidia...@gmail.com wrote: I took a brief look at the API for CSV, and thought I would share a typical use case from the biotech industry. We deal with a lot of instruments that produce a multiline header. The header usually contains experiment conditions. You can think of this as metadata for the columnar data. The experiment conditions usually contain things like the name of the scientist using the instrument, the time of day the experiment was run, and some instrument configuration settings. Usually when we parse CSV files, we have to parse the header first, extract all relevant data, and then parse the rows of data. In addition to the experiment conditions header, there are also column headers. The column headers can be multi-lined as well. For example, you might have a column header whose first line contains chemical compound IDs or names, and the second line of the column header contains the concentrations for those compounds. The data values represent the percent inhibition at those concentrations. Like this: Erlotinib 1uM 10 uM 100 uM 1nM 0.01 0.001 0.0001 0.1 ... Since the position and types of header and body data vary, we typically use parse configuration files that describe what data can be found where. The parse configuration varies not only per instrument but also per experimental protocol. So there are usually numerous configuration files in your typical lab. The configuration files can also be stored in a database. This is usually part of a file-watching web app. It allows scientists to add support for new experiments or instruments without having to get a developer to write more code. In the API I saw support for hard-coded configurations via the CSVFormat object, but I didn't see any support for creating and using persistable configurations. You may want to consider that as you move forward. Thank you for taking the time to offer your point of view here. CSVFormat implements Serializable, so you can use plain old Java serialization, it's not human readable, but it's something. If we moved to Java 6, we could annotate CSVFormat with JAXB so you can have XML IO. Personally, I do not think we should do our own XML IO, so JAXB is the best path IMO since it is built-in Java 6. What do you currently use to parse your CSV files? Would Commons-CSV work for you as well? If not, how so? Would you be willing to experiment with the current code? Thank you, Gary Hope this helps, Mark On Wed, Jul 31, 2013 at 6:36 AM, Gary Gregory garydgreg...@gmail.com wrote: On Wed, Jul 31, 2013 at 8:58 AM, Gary Gregory garydgreg...@gmail.com wrote: On Jul 31, 2013, at 3:38, Benedikt Ritter brit...@apache.org wrote: 2013/7/31 Gary Gregory garydgreg...@gmail.com On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg ebo...@apache.org wrote: Le 30/07/2013 23:26, Gary Gregory a écrit : And another thing: internally, the header should be a SetString, not a String[]. I plan on fixing that later too. Why should it be a set? Is there an impact on the performance? Well, I did not finish my though on that one, sorry about that, please allow me to walk through my use cases. The issue is about the feature, not performance. At first glance, using a set avoids an inherent problem with any non-set data structure: defining duplicates. What does the following mean? withHeader(A, B, C, A); It's is a recipe for garbage results: record.get(A) returns what? Today, I added some CSVFormat validation code that checks for duplicate column names. If you build a format with withHeader(A, B, C, A); you will get an ISE when validate() is called. If we had withHeader(Set) and document it as the 'main' way to specify column names, then we can say that withHeader(String...) is just a syntactical convenience and turn the String[] into a Set. But that will not work. The problem with a Java Set is that it is not ordered and the current implementation relies on order of the String[]. But why? What the current implementation says is: ignore what the header line of the file is and use the given column names at the given positions. A perfectly good user story. So for withHeader(A, B, C), A is column 0, B is column 1, and so on. Ok, that's one usage. Taking a step back, I want to talk about why should the column name order matter when you are calling withHeader(). I would like to be able to tell the parser that I want to use a Set of column names and have it figure out, based on the header line, the columns indices. This is quite different than what we have now. A use case I have now is a CSV file with a lot of columns (~90) but I only care about a small subset of the columns (~10). I'd like to be
Re: [CSV] Headers and the first record
On Wed, Jul 31, 2013 at 10:48 AM, Gary Gregory garydgreg...@gmail.comwrote: On Wed, Jul 31, 2013 at 9:34 AM, Emmanuel Bourg ebo...@apache.org wrote: Le 31/07/2013 15:08, Gary Gregory a écrit : But that is exactly what _was_ happening! ;) If I called withHeader(A, B, C) the header was not skipped. Sounds good. The header is defined in the code, we don't expect to see the header in the file so nothing is skipped. NOT good! ;) This is where we disagree. The parser used to behave differently depending on the contents of the String[]. - From an API design standpoint, it's smelly to me. - The feature is hard to understand. If we want that, we need two APIs for two behaviors. Using the withHeader API, I can tell the parser to: - Ignore the fact that there is a header record, I am overriding it with my own names - There is no header record, so I am telling you what the header names are. These two features clash because in one case the file has a header line and in the other the file does not. This is why we need settings with different names. That or a setting that says 'skip the first record, it's the header, I do not want to see it as a data record' I see three scenarios: 1) I set the headers (the file does not have one), do not skip the first record 2) I override the existing header record, skip the first record 3) The parser guesses the headers based on reading the first record, which skips the first record as a data record This can be accommodated with a skipHeaderRecord boolean setting. I do not care what the default behavior is as long as I can say this file has headers, guess them please, and skip record 0 and this file has a header record, but I'm telling you to call them A, B, and C, so skip record 0 1) withHeader(A, B, C).skipHeaderRecord(false); 2) withHeader(A, B, C).skipHeaderRecord(true); 3) withHeader() Is there a better name for skipHeaderRecord? Maybe: 1b) withHeader(A, B, C).firstRecordIsHeader(false); 2b) withHeader(A, B, C).firstRecordIsHeader(true); Here the difference is that the API does not describe behavior, instead it describes the data, and behavior is implied. There is also: 1c) withHeader(A, B, C) 2c) withHeaderOverride(A, B, C) Thoughts? I reverted back to NOT skipping a record when withHeader is called with a non-empty array; and added a skipHeaderRecord setting to CSVFormat to use when headers are initialized. Gary Gary If I called withHeader(new String[]{}) the header was skipped. Correct. The header is not defined in the code, the parser uses the first record as header and doesn't return it when iterating. If I called withHeader() the header was skipped (same as line above). Sounds good too. What was the issue again ? ;) What I am asking is: should we have a saveHeader setting such that IF you ask for headers, then we save that record in the parser, it is currently lost, or, actually transformed into the header map. Keeping the header around might be useful, I wouldn't create a format parameter for this though. It could be made available at the record level, much like ResultSet.getMetaData(). Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory
Re: [CSV] Accessing a subset of the available headers (Was: Re: [CSV] Headers and the first record)
2013/7/31 Gary Gregory garydgreg...@gmail.com On Wed, Jul 31, 2013 at 10:42 AM, Benedikt Ritter brit...@apache.org wrote: snip A use case I have now is a CSV file with a lot of columns (~90) but I only care about a small subset of the columns (~10). I'd like to be able to say withHeader(Set) where the Set may be a subset of the actual column names in the header line. This is different from withHeader(String[]) because the names in the Set must match the names in the header record. What you are talking about sounds more like a view or a projection of the actual content being parsed. Do we really need this for 1.0 or can it be postponed? This is a real scenario and a real need, not some imaginary complication ;) Even if it is not implemented for 1.0, we should talk about how it should be done such that it fits in and does not cause API problems later. And if I can get it done by then, then that much the better. Okay, then let's discuss this on a new thread :-) As I've said, I think we should not push to much into withHeaders(String...). Maybe this is some sort of view, where you can pass a parser and the headers you are interested in and it will return an IterableCSVRecord (or CSVParser) that just gives access to the specified headers you are interessted in? Would it be possible to give a code example of what you have to do with to current API in your use case and what you want? I am switching to withHeader() with no arg (same as a new String[]{}) and let the parser guess the headers and then pray that the names match between the app and the files. Which is just as unsafe as forcing the headers in fixed order on the parser because the column order might have changed. Ideally, the column order should not matter, which it does not when you do a record.get(String), which is nice. Calling withHeader() with no args is less brittle than calling it with 90 args. The benefit is that the column order in the file can change without affecting the app, which is good. I could use a little more bullet-proofing by making the column names optionally case-insensitive, but that's a different feature. Ideally, I want to define the column names in the app as a simple Java enum, then use an enum as a record key. That does not work for column names that have spaces in them as mine do, so it's back to classic static final Strings as keys. I could create a fancier custom enum but it's not worth it for now. Hey Gary, I still don't understand what you are suggesting. At first I though this was about accessing a subset of the actual columns (you said your file has 90 columns but you are only interested in ~10). Your last message sounds more like you're looking for a better way to make sure the headers parsed from the file match what you are expecting. I guess this is why getHeaderMap is now public (?!) What am I missing? Benedikt Gary Benedikt -- http://people.apache.org/~britter/ http://www.systemoutprintln.de/ http://twitter.com/BenediktRitter http://github.com/britter -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Edition http://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory -- http://people.apache.org/~britter/ http://www.systemoutprintln.de/ http://twitter.com/BenediktRitter http://github.com/britter
Re: [CSV] Accessing a subset of the available headers (Was: Re: [CSV] Headers and the first record)
On Wed, Jul 31, 2013 at 2:38 PM, Benedikt Ritter brit...@apache.org wrote: 2013/7/31 Gary Gregory garydgreg...@gmail.com On Wed, Jul 31, 2013 at 10:42 AM, Benedikt Ritter brit...@apache.org wrote: snip A use case I have now is a CSV file with a lot of columns (~90) but I only care about a small subset of the columns (~10). I'd like to be able to say withHeader(Set) where the Set may be a subset of the actual column names in the header line. This is different from withHeader(String[]) because the names in the Set must match the names in the header record. What you are talking about sounds more like a view or a projection of the actual content being parsed. Do we really need this for 1.0 or can it be postponed? This is a real scenario and a real need, not some imaginary complication ;) Even if it is not implemented for 1.0, we should talk about how it should be done such that it fits in and does not cause API problems later. And if I can get it done by then, then that much the better. Okay, then let's discuss this on a new thread :-) As I've said, I think we should not push to much into withHeaders(String...). Maybe this is some sort of view, where you can pass a parser and the headers you are interested in and it will return an IterableCSVRecord (or CSVParser) that just gives access to the specified headers you are interessted in? Would it be possible to give a code example of what you have to do with to current API in your use case and what you want? I am switching to withHeader() with no arg (same as a new String[]{}) and let the parser guess the headers and then pray that the names match between the app and the files. Which is just as unsafe as forcing the headers in fixed order on the parser because the column order might have changed. Ideally, the column order should not matter, which it does not when you do a record.get(String), which is nice. Calling withHeader() with no args is less brittle than calling it with 90 args. The benefit is that the column order in the file can change without affecting the app, which is good. I could use a little more bullet-proofing by making the column names optionally case-insensitive, but that's a different feature. Ideally, I want to define the column names in the app as a simple Java enum, then use an enum as a record key. That does not work for column names that have spaces in them as mine do, so it's back to classic static final Strings as keys. I could create a fancier custom enum but it's not worth it for now. Hey Gary, I still don't understand what you are suggesting. At first I though this was about accessing a subset of the actual columns (you said your file has 90 columns but you are only interested in ~10). Your last message sounds more like you're looking for a better way to make sure the headers parsed from the file match what you are expecting. I guess this is why getHeaderMap is now public (?!) What am I missing? Sorry, it seems I keep on mixing up the topics it seems. More my many columned file, I'm going with withHeaders() [no args] and get(String). That's good enough but I still need to have the proper header skipping, which is now in. Yes, I'm looking for what amounts to schema validation, but since get(String) will fail on the first record, that's fail-fast enough for now :) getHeaderMap() has been public for a long time, so that's not an issue here. getHeader() OTOH is now public because I want to be able to build on one format to get a new one. Gary Benedikt Gary Benedikt -- http://people.apache.org/~britter/ http://www.systemoutprintln.de/ http://twitter.com/BenediktRitter http://github.com/britter -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Edition http://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory -- http://people.apache.org/~britter/ http://www.systemoutprintln.de/ http://twitter.com/BenediktRitter http://github.com/britter -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory
Re: [CSV] Headers and the first record
Hi Gary, One other complication I forgot to mention. Compounds are usually run multiple times. So the same compound will appear with the same set of concentrations. In practice you would end up with column headers that have the same text in them, so this issue with using a Set vs String[] for the column names would complicate things. CSVFormat implements Serializable, so you can use plain old Java serialization, it's not human readable, but it's something. A human readable configuration would probably be a high priority. If we moved to Java 6, we could annotate CSVFormat with JAXB so you can have XML IO. Personally, I do not think we should do our own XML IO, so JAXB is the best path IMO since it is built-in Java 6. It would be best if there were a CSVFormat serializer so that the CSVFormat could be injected. Using JAXB would be fine as a default implementation, but I imagine that the configuration format would change. Or that a user might decide to store individual configuration items in a database. What do you currently use to parse your CSV files? Most biotech companies have their own home grown tools for parsing instrument files. There isn't a standard library. Would Commons-CSV work for you as well? If not, how so? As I understand it, the code doesn't support experiment condition-type parameters, like this: Date: 12/10/13 Protocol: Selectivity Profile 1Instrument Name: Gandalf Scientist: John Smith Would you be willing to experiment with the current code? Sure. If the previous issues were addressed. I'm curious if other industries have similar issues? I assume that anyone that deals with instrument data might have similar needs. Mark
Re: [CSV] Headers and the first record
On Wed, Jul 31, 2013 at 3:44 PM, Mark Fortner phidia...@gmail.com wrote: Hi Gary, One other complication I forgot to mention. Compounds are usually run multiple times. So the same compound will appear with the same set of concentrations. In practice you would end up with column headers that have the same text in them, so this issue with using a Set vs String[] for the column names would complicate things. CSVFormat implements Serializable, so you can use plain old Java serialization, it's not human readable, but it's something. A human readable configuration would probably be a high priority. If we moved to Java 6, we could annotate CSVFormat with JAXB so you can have XML IO. Personally, I do not think we should do our own XML IO, so JAXB is the best path IMO since it is built-in Java 6. It would be best if there were a CSVFormat serializer so that the CSVFormat could be injected. Using JAXB would be fine as a default implementation, but I imagine that the configuration format would change. Or that a user might decide to store individual configuration items in a database. What do you currently use to parse your CSV files? Most biotech companies have their own home grown tools for parsing instrument files. There isn't a standard library. Would Commons-CSV work for you as well? If not, how so? As I understand it, the code doesn't support experiment condition-type parameters, like this: Date: 12/10/13 Protocol: Selectivity Profile 1Instrument Name: Gandalf Scientist: John Smith This does not look like a classic CSV file. It sounds like your files contain different sections in different formats. In its current state, commons-csv might not be right for you. What does the rest of the file look like? Gary Would you be willing to experiment with the current code? Sure. If the previous issues were addressed. I'm curious if other industries have similar issues? I assume that anyone that deals with instrument data might have similar needs. Mark -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory
Re: [CSV] Headers and the first record
Hi Gary, This does not look like a classic CSV file. I guess it depends on what your definition of classic is. :-) This is pretty typical for most drug discovery companies. It sounds like your files contain different sections in different formats. True. In its current state, commons-csv might not be right for you. What does the rest of the file look like? The data section looks similar to this. Erlotinib - Run 1 Erlotinib - Run 2 Target 1uM 10 uM 100 uM 1nM 1uM 10 uM 100 uM 1nM BRCA1 0.01 0.001 0.0001 0.1 0.01 0.001 0.0001 0.1 BRCA2 0.20.002 0.0002 0.2 0.20.002 0.0002 0.2 Regards, Mark
Re: [CSV] Headers and the first record
On Wed, Jul 31, 2013 at 4:38 PM, Mark Fortner phidia...@gmail.com wrote: Hi Gary, This does not look like a classic CSV file. I guess it depends on what your definition of classic is. :-) This is pretty typical for most drug discovery companies. It sounds like your files contain different sections in different formats. True. In its current state, commons-csv might not be right for you. What does the rest of the file look like? The data section looks similar to this. Erlotinib - Run 1 Erlotinib - Run 2 Target 1uM 10 uM 100 uM 1nM 1uM 10 uM 100 uM 1nM BRCA1 0.01 0.001 0.0001 0.1 0.01 0.001 0.0001 0.1 BRCA2 0.20.002 0.0002 0.2 0.20.002 0.0002 0.2 Hm... so it looks like you have a couple of rows that each have a different format. For some rows, the format has the header and it's value on the same line: Date: 12/10/13 Protocol: Selectivity Profile 1Instrument Name: Gandalf Scientist: John Smith Which is different from the 'usual' column we see. You format is more like a spreadsheet than a CSV file. Nonetheless, we would need to extend our current feature set to accommodate this format. I could see the client code looking like this: // row one is a key: value pair format.addKeyValueRow(1, :); // row two is 2 key: value pairs, separated by a tab format.addKeyValueRow(2, :, \t); // 2 pairs The args should also be a format object of some kind, like we have a CSVFormat object now. This seems out of scope for 1.0 if we are itching to get 1.0 out the door. Gary Regards, Mark -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory
Re: [CSV] Headers and the first record
Hi All: I see now, the behavior is different depending on what you pass to withHeader()! Confusing indeed. If you call withHeader with Strings, the first line is not read and it is returned as a record. If you call withHeader with no arguments, the first line _is_ read and it is NOT returned as a record. I think I'll change it so that withHeader causes the first line to be skipped, always, and add an option skipHeaders with a default of true. So if you really want to set the headers AND see what they are, you can do that. Gary On Tue, Jul 30, 2013 at 3:44 PM, Gary Gregory garydgreg...@gmail.comwrote: Hi All: I have Excel files with headers. So I use withHeaders() of course to map the headers. When I call parser.iterator().next(), the first record is the header record, not data. I always have to skip this first line since it is not data. I wonder if: 1) We should automatically skip the header line for next() and parser.getRecords(), or 2) Add a skipHeader boolean setting to control the above behavior, where the default is...? (2) is the most flexible. Thoughts? Gary -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory
Re: [CSV] Headers and the first record
Actually, if you use withHeader(), no args, you _cannot_ get back the first record, so that makes skipHeader=false not possible without making the parser track the first record separately. In the interest of simplicity, I am going to make it simple: if you use withHeader of any kind, then the first record is read. Gary On Tue, Jul 30, 2013 at 4:15 PM, Gary Gregory garydgreg...@gmail.comwrote: Hi All: I see now, the behavior is different depending on what you pass to withHeader()! Confusing indeed. If you call withHeader with Strings, the first line is not read and it is returned as a record. If you call withHeader with no arguments, the first line _is_ read and it is NOT returned as a record. I think I'll change it so that withHeader causes the first line to be skipped, always, and add an option skipHeaders with a default of true. So if you really want to set the headers AND see what they are, you can do that. Gary On Tue, Jul 30, 2013 at 3:44 PM, Gary Gregory garydgreg...@gmail.comwrote: Hi All: I have Excel files with headers. So I use withHeaders() of course to map the headers. When I call parser.iterator().next(), the first record is the header record, not data. I always have to skip this first line since it is not data. I wonder if: 1) We should automatically skip the header line for next() and parser.getRecords(), or 2) Add a skipHeader boolean setting to control the above behavior, where the default is...? (2) is the most flexible. Thoughts? Gary -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory
Re: [CSV] Headers and the first record
I haven't checked the current code, but the intended behavior was: - no args: the first record defines the header and is not returned when iterating - args: the header is defined independently of the data, all the records are returned when iterating Emmanuel Bourg Le 30/07/2013 22:23, Gary Gregory a écrit : Actually, if you use withHeader(), no args, you _cannot_ get back the first record, so that makes skipHeader=false not possible without making the parser track the first record separately. In the interest of simplicity, I am going to make it simple: if you use withHeader of any kind, then the first record is read. - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [CSV] Headers and the first record
On Tue, Jul 30, 2013 at 5:15 PM, Emmanuel Bourg ebo...@apache.org wrote: I haven't checked the current code, but the intended behavior was: - no args: the first record defines the header and is not returned when iterating - args: the header is defined independently of the data, all the records are returned when iterating Yeah, that's too clever IMO. I expected the same behavior WRT record reading with the only difference being if I let the parser guess or not. The current code now always reads the header line if you set any non-null header. If you call withHeader() with no args it is a non-null call with an empty String[]. The idea being that if I use headers and I ask the parser to guess or give it the headers, I do not need to have the header line as a record. I plan on adding a setting that allows the header record to be saved for callers who care. Gary Emmanuel Bourg Le 30/07/2013 22:23, Gary Gregory a écrit : Actually, if you use withHeader(), no args, you _cannot_ get back the first record, so that makes skipHeader=false not possible without making the parser track the first record separately. In the interest of simplicity, I am going to make it simple: if you use withHeader of any kind, then the first record is read. - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory
Re: [CSV] Headers and the first record
And another thing: internally, the header should be a SetString, not a String[]. I plan on fixing that later too. Gary On Tue, Jul 30, 2013 at 5:24 PM, Gary Gregory garydgreg...@gmail.comwrote: On Tue, Jul 30, 2013 at 5:15 PM, Emmanuel Bourg ebo...@apache.org wrote: I haven't checked the current code, but the intended behavior was: - no args: the first record defines the header and is not returned when iterating - args: the header is defined independently of the data, all the records are returned when iterating Yeah, that's too clever IMO. I expected the same behavior WRT record reading with the only difference being if I let the parser guess or not. The current code now always reads the header line if you set any non-null header. If you call withHeader() with no args it is a non-null call with an empty String[]. The idea being that if I use headers and I ask the parser to guess or give it the headers, I do not need to have the header line as a record. I plan on adding a setting that allows the header record to be saved for callers who care. Gary Emmanuel Bourg Le 30/07/2013 22:23, Gary Gregory a écrit : Actually, if you use withHeader(), no args, you _cannot_ get back the first record, so that makes skipHeader=false not possible without making the parser track the first record separately. In the interest of simplicity, I am going to make it simple: if you use withHeader of any kind, then the first record is read. - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory
Re: [CSV] Headers and the first record
Le 30/07/2013 23:26, Gary Gregory a écrit : And another thing: internally, the header should be a SetString, not a String[]. I plan on fixing that later too. Why should it be a set? Is there an impact on the performance? Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [CSV] Headers and the first record
Le 30/07/2013 23:24, Gary Gregory a écrit : Yeah, that's too clever IMO. I expected the same behavior WRT record reading with the only difference being if I let the parser guess or not. Too clever? I didn't feel like I designed a rocket with this feature though :) That's an important feature to me and I'd like to preserve it. If the header is defined in the file I don't want to skip the first record manually, the parser should take care of it. That also means the user code can remain the same, whether the header is defined in the code or in the file. The current code now always reads the header line if you set any non-null header. If you call withHeader() with no args it is a non-null call with an empty String[]. I guess a null header or an empty header is just the same and means the first record must be used as the header. Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [CSV] Headers and the first record
On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg ebo...@apache.org wrote: Le 30/07/2013 23:26, Gary Gregory a écrit : And another thing: internally, the header should be a SetString, not a String[]. I plan on fixing that later too. Why should it be a set? Is there an impact on the performance? Well, I did not finish my though on that one, sorry about that, please allow me to walk through my use cases. The issue is about the feature, not performance. At first glance, using a set avoids an inherent problem with any non-set data structure: defining duplicates. What does the following mean? withHeader(A, B, C, A); It's is a recipe for garbage results: record.get(A) returns what? Today, I added some CSVFormat validation code that checks for duplicate column names. If you build a format with withHeader(A, B, C, A); you will get an ISE when validate() is called. If we had withHeader(Set) and document it as the 'main' way to specify column names, then we can say that withHeader(String...) is just a syntactical convenience and turn the String[] into a Set. But that will not work. The problem with a Java Set is that it is not ordered and the current implementation relies on order of the String[]. But why? What the current implementation says is: ignore what the header line of the file is and use the given column names at the given positions. A perfectly good user story. So for withHeader(A, B, C), A is column 0, B is column 1, and so on. Ok, that's one usage. Taking a step back, I want to talk about why should the column name order matter when you are calling withHeader(). I would like to be able to tell the parser that I want to use a Set of column names and have it figure out, based on the header line, the columns indices. This is quite different than what we have now. A use case I have now is a CSV file with a lot of columns (~90) but I only care about a small subset of the columns (~10). I'd like to be able to say withHeader(Set) where the Set may be a subset of the actual column names in the header line. This is different from withHeader(String[]) because the names in the Set must match the names in the header record. So I think it boils down to ignoring my comment about using a Set internally and adding a feature where I can tell the parser that I want to use a set of column names and not worry about the order, because the parser will match up the column names when it reads the header line. Gary Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory
Re: [CSV] Headers and the first record
On Tue, Jul 30, 2013 at 5:47 PM, Emmanuel Bourg ebo...@apache.org wrote: Le 30/07/2013 23:24, Gary Gregory a écrit : Yeah, that's too clever IMO. I expected the same behavior WRT record reading with the only difference being if I let the parser guess or not. Too clever? I didn't feel like I designed a rocket with this feature though :) That's an important feature to me and I'd like to preserve it. If the header is defined in the file I don't want to skip the first record manually, the parser should take care of it. That also means the user code can remain the same, whether the header is defined in the code or in the file. Let me reply to this part tomorrow (it's late here ;) The current code now always reads the header line if you set any non-null header. If you call withHeader() with no args it is a non-null call with an empty String[]. I guess a null header or an empty header is just the same and means the first record must be used as the header. It is not the same at all. A null header String[] is different from a length 0 array. It's been like that for a while. Gary Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org -- E-Mail: garydgreg...@gmail.com | ggreg...@apache.org Java Persistence with Hibernate, Second Editionhttp://www.manning.com/bauer3/ JUnit in Action, Second Edition http://www.manning.com/tahchiev/ Spring Batch in Action http://www.manning.com/templier/ Blog: http://garygregory.wordpress.com Home: http://garygregory.com/ Tweet! http://twitter.com/GaryGregory
Re: [csv] Headers
Am 15. März 2012 01:58 schrieb Emmanuel Bourg ebo...@apache.org: There is another alternative, we might replace the records returned as a String[] by a CSVRecord class able to access the fields by id or by name. This would be similar to a JDBC resultset (except for the looping logic) sounds good. This discussion showed, that a record is more than a String array. So having a specialized class is a good idea. This avoids the duplication of the parser, which might still be generified later to support custom beans. The example becomes: CSVFormat format = CSVFormat.DEFAULT.withHeader(); for (CSVRecord record : format.parse(in)) { Person person = new Person(); person.setName(record.get(name)); person.setEmail(record.get(email)); person.setPhone(record.get(phone)); persons.add(person); } The record is not a Map to keep it simple, it only exposes 3 methods: get(int), get(String) and size() I'm not sure if I understand the approach completely. The Header can not be accessed as a CSVRecord, right? CSVRecords know the header values through get(string). What happens if the format does not support a header? UnsupportedOperationException? If I got you right, we could use getHeaders() to know, which header values are available. Maybe it would be useful to have the record implement iterable as well. Benedikt Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Headers
Le 15/03/2012 08:55, Benedikt Ritter a écrit : I'm not sure if I understand the approach completely. The Header can not be accessed as a CSVRecord, right? CSVRecords know the header values through get(string). What happens if the format does not support a header? UnsupportedOperationException? Yes, or IllegalStateException. If I got you right, we could use getHeaders() to know, which header values are available. The actual header would be returned by parser.getHeader(). Maybe it would be useful to have the record implement iterable as well. Or have a method return the array of values if you want to iterate over it. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Headers
There is another alternative, we might replace the records returned as a String[] by a CSVRecord class able to access the fields by id or by name. This would be similar to a JDBC resultset (except for the looping logic) This avoids the duplication of the parser, which might still be generified later to support custom beans. The example becomes: CSVFormat format = CSVFormat.DEFAULT.withHeader(); for (CSVRecord record : format.parse(in)) { Person person = new Person(); person.setName(record.get(name)); person.setEmail(record.get(email)); person.setPhone(record.get(phone)); persons.add(person); } The record is not a Map to keep it simple, it only exposes 3 methods: get(int), get(String) and size() Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Headers
Le 13/03/2012 00:56, sebb a écrit : On 12 March 2012 22:11, Emmanuel Bourg ebo...@apache.org wrote: [csv] is missing some elements to ease the use of headers. I have no clear idea on how to address this, here are my thoughts. Headers are used when the fields are accessed by the column name rather than by the index. This provides some flexibility because the input file can be slightly modified by reordering the columns or by inserting new columns without breaking the existing code. Using the current API here is how one would work with headers: CSVParser parser = new CSVParser(in); IteratorString[] it = parser.iterator(); // read the header String[] header = it.next(); // build a name to index mapping MapString, Integer mapping = new HashMap(); for (int i = 0; i header.length; i++) { mapping.put(header[i], i); } // parse the records for (String[] record : parser) { Person person = new Person(); person.setName(record[mapping.get(name)]); person.setEmail(record[mapping.get(email)]); person.setPhone(record[mapping.get(phone)]); persons.add(person); } The user has to take care of the mapping, which is not very friendly. I have several solutions in mind: 1. Do nothing and address it in the next release with the bean mapping. Parsing the file would then look like this: CSVFormatPerson format = CSVFormat.DEFAULT.withType(Person.class); for (Person person : format.parse(in)) { persons.add(person); } Does this automatically mean that the file has a header? Or is there another way to link columns to Person attributes? I don't think this should be the only way of handling named columns; it's not always convenient to create a type. I agree. Sometimes, the colums are just a part of a class that would need other parameters not in the columns (but perhaps in a custom comment of the header, if these parameters are constant throughout the file. So providing intermediate level API (with mapping already done, but still access to individual fields) is a must. 2. Add a parser returning a Map instead of a String[] // declare the header in the format, // the header line will be parsed automatically CSVFormat format = CSVFormat.DEFAULT.withHeader(); for (MapString, String record : new CSVMapParser(in, format))) { Person person = new Person(); person.setName(record.get(name)); person.setEmail(record.get(email)); person.setPhone(record.get(phone)); persons.add(person); } That seems OK; one can also just use the column values directly. +1 Luc 2bis. Have the same CSVParser class returning String[] or MapString, String depending on a generic parameter. Not sure it's possible with type erasure. It's not possible for two methods to differ only by return parameter type, so this can only be done if the method parameters are different after type erasure. 3. Have the parser maintain the name-index mapping. The parser read the first line automatically if the format declares a header, and a getColumnIndex() method is exposed. CSVFormat format = CSVFormat.DEFAULT.withHeader(); CSVParser parser = new CSVParser(in, format); // parse the records for (String[] record : parser) { Person person = new Person(); person.setName(record[parser.getColumnIndex(name)]); person.setEmail(record[parser.getColumnIndex(email)]); person.setPhone(record[parser.getColumnIndex(phone)]); persons.add(person); } Quite awkard to use. What do you think? Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Headers
Emmanuel Bourg wrote: Le 13/03/2012 00:56, sebb a écrit : 1. Do nothing and address it in the next release with the bean mapping. Parsing the file would then look like this: CSVFormatPerson format = CSVFormat.DEFAULT.withType(Person.class); for (Person person : format.parse(in)) { persons.add(person); } Does this automatically mean that the file has a header? Or is there another way to link columns to Person attributes? If the file doesn't have a header, the fields are matched by index (either the natural ordering of the attributes in the class, or specified by an annotation). If the file has a header, the fields are matched by attribute name, and an annotation can override the name of the column associated to an attribute. Yeah, but that's not required. Just because you can read the names of the columns does not mean that you want to address them by name. Why pay the price for creating the map and accessing the values by name just for a one- time information? - Jörg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Headers
Le 13/03/2012 09:21, Jörg Schaible a écrit : If the file has a header, the fields are matched by attribute name, and an annotation can override the name of the column associated to an attribute. Yeah, but that's not required. Just because you can read the names of the columns does not mean that you want to address them by name. Why pay the price for creating the map and accessing the values by name just for a one- time information? Sorry I forgot the end of my message, I meant to access the fields by name OR by index when the header is present. That would be configured with the annotations. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Headers
On 13 March 2012 08:52, Emmanuel Bourg ebo...@apache.org wrote: Le 13/03/2012 09:21, Jörg Schaible a écrit : If the file has a header, the fields are matched by attribute name, and an annotation can override the name of the column associated to an attribute. Yeah, but that's not required. Just because you can read the names of the columns does not mean that you want to address them by name. Why pay the price for creating the map and accessing the values by name just for a one- time information? Sorry I forgot the end of my message, I meant to access the fields by name OR by index when the header is present. That would be configured with the annotations. It needs to be possible to access columns by index without having to use annotations. Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Headers
Le 13/03/2012 09:56, sebb a écrit : It needs to be possible to access columns by index without having to use annotations. That's still possible with the low level API. I'm just exploring the features I would expect of a bean mapping. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Headers
I think transforming the result of the parse process into instances of some class is a different concern. That should not be part of as CSVParser. In Hibernate they use ResultTransformers for this purpose [1]. I think we should separate this concerns as well. [1] http://docs.jboss.org/hibernate/orm/3.3/api/org/hibernate/transform/ResultTransformer.html Am 13. März 2012 10:03 schrieb Emmanuel Bourg ebo...@apache.org: Le 13/03/2012 09:56, sebb a écrit : It needs to be possible to access columns by index without having to use annotations. That's still possible with the low level API. I'm just exploring the features I would expect of a bean mapping. Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Headers
On 12 March 2012 22:11, Emmanuel Bourg ebo...@apache.org wrote: [csv] is missing some elements to ease the use of headers. I have no clear idea on how to address this, here are my thoughts. Headers are used when the fields are accessed by the column name rather than by the index. This provides some flexibility because the input file can be slightly modified by reordering the columns or by inserting new columns without breaking the existing code. Using the current API here is how one would work with headers: CSVParser parser = new CSVParser(in); IteratorString[] it = parser.iterator(); // read the header String[] header = it.next(); // build a name to index mapping MapString, Integer mapping = new HashMap(); for (int i = 0; i header.length; i++) { mapping.put(header[i], i); } // parse the records for (String[] record : parser) { Person person = new Person(); person.setName(record[mapping.get(name)]); person.setEmail(record[mapping.get(email)]); person.setPhone(record[mapping.get(phone)]); persons.add(person); } The user has to take care of the mapping, which is not very friendly. I have several solutions in mind: 1. Do nothing and address it in the next release with the bean mapping. Parsing the file would then look like this: CSVFormatPerson format = CSVFormat.DEFAULT.withType(Person.class); for (Person person : format.parse(in)) { persons.add(person); } Does this automatically mean that the file has a header? Or is there another way to link columns to Person attributes? I don't think this should be the only way of handling named columns; it's not always convenient to create a type. 2. Add a parser returning a Map instead of a String[] // declare the header in the format, // the header line will be parsed automatically CSVFormat format = CSVFormat.DEFAULT.withHeader(); for (MapString, String record : new CSVMapParser(in, format))) { Person person = new Person(); person.setName(record.get(name)); person.setEmail(record.get(email)); person.setPhone(record.get(phone)); persons.add(person); } That seems OK; one can also just use the column values directly. 2bis. Have the same CSVParser class returning String[] or MapString, String depending on a generic parameter. Not sure it's possible with type erasure. It's not possible for two methods to differ only by return parameter type, so this can only be done if the method parameters are different after type erasure. 3. Have the parser maintain the name-index mapping. The parser read the first line automatically if the format declares a header, and a getColumnIndex() method is exposed. CSVFormat format = CSVFormat.DEFAULT.withHeader(); CSVParser parser = new CSVParser(in, format); // parse the records for (String[] record : parser) { Person person = new Person(); person.setName(record[parser.getColumnIndex(name)]); person.setEmail(record[parser.getColumnIndex(email)]); person.setPhone(record[parser.getColumnIndex(phone)]); persons.add(person); } Quite awkard to use. What do you think? Emmanuel Bourg - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Headers
Le 13/03/2012 00:56, sebb a écrit : 1. Do nothing and address it in the next release with the bean mapping. Parsing the file would then look like this: CSVFormatPerson format = CSVFormat.DEFAULT.withType(Person.class); for (Person person : format.parse(in)) { persons.add(person); } Does this automatically mean that the file has a header? Or is there another way to link columns to Person attributes? If the file doesn't have a header, the fields are matched by index (either the natural ordering of the attributes in the class, or specified by an annotation). If the file has a header, the fields are matched by attribute name, and an annotation can override the name of the column associated to an attribute. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature