[jira] [Updated] (CSV-42) Lots of possible changes

Emmanuel Bourg (Updated) (JIRA) Mon, 26 Mar 2012 16:58:56 -0700

     [ 
https://issues.apache.org/jira/browse/CSV-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Emmanuel Bourg updated CSV-42:
------------------------------

    Description: 
I made a lot of changes to pretty much all of the classes in the csv package.  
I thought it would be better to put all of the the changes here in one issue, 
but feel free to only take the parts you like (if any).  Hopefully if nothing 
else the test cases will be useful to you.

I'll try to list most of the changes here, but I'm sure I'm forgetting some. 
This should include all of the big changes at least. I focused mostly on the 
parser, but I also made a few changes to the printer classes (although I don't 
think I added any new test cases there).

h3. General Changes

- Changed all class names with "CSV" in them to use "Csv". This is how it 
appears in the commons-lang "escapeCsv" methods and I think it's easier to read 
the class name when acronyms are not in all upper case. (x) {color:red}3 letter 
acronyms are usually kept in uppercase (for example URLConnection or SAXParser 
in the JDK, but there are some exceptions){color}

- Formatted the code. I used Eclipse with a version of the Java formatting 
style that uses spaces instead of tabs and with a few other small changes to 
try to make it more similar to the style of this code. The formatting was 
inconsistent before (sometimes 2 space indent, sometimes 4) which made it 
really hard to work on. (/) {color:green}DONE{color}

- Removed all deprecated methods/constructors (/) {color:green}DONE{color}

- Made all public classes final. If there is ever a need to create subclasses 
of them then this could be changed, but I think it would be better to at least 
start them as final (since once they are released as non-final it's hard to go 
back).

- A few bug fixes (and test cases for them)

h3. CsvParser

- There were a few bugs for special cases, so I made as small of changes as I 
could to the parser code to fix these.

- Added a lot of test cases. I created a test case for all bugs that I found, 
so even if you don't use my changes to this class you should be able to use the 
test cases to find all of the same bugs.

- Added a close method. (x) {color:red} The try-with-resources statement in 
Java 7 makes resources management much easier, there is no need to add a 
close() method to the parser.{color}

- Renamed the nextValue method to getValue (so it is more consistent with the 
getAll and getLine method names). I think I would prefer to use a different 
method name prefix for all three of these (like "readAll") since I wouldn't 
normally expect a "get" method to have side effects, but I didn't want to just 
change the names of the most used methods. (x) {color:red}This method has been 
removed, the parser now works line by line.{color}

- Changed the getLineNumber method to return the correct line number when there 
are multi-line values. (x) {color:red}The suggested code counts the number of 
records instead of the number of lines. For debugging it's better to return the 
actual line number.{color}

- Moved all of the lexer methods into an inner CsvLexer class that is 
completely independent of the CsvParser class. The methods were already 
separated out, so it wasn't a very big change. I also moved the lexer test 
cases into a new CsvLexerTest class. (/) {color:green}DONE{color}

- Got rid of the interpreting unicode escape options. This doesn't really have 
anything to do with parsing a CSV file so I think it should be left up to the 
user of the class to implement this if needed. As an example, I made a 
CsvParserUnicodeEscapeTest class that uses the code from the lexer in a Reader 
subclass. One nice thing is that with this implementation, the interpreted 
values can be used as the delimiter, encapsulator, etc. (/) {color:green}DONE - 
The unicode unescaping is now handled by a class implementing java.io.Reader 
(to be contributed to Commons IO).{color}

- Got rid of the "escape" option for the same reason as the unicode escape 
option. I replaced it with an encapsulator escape option that is only used as 
an escape operator on the encapsulator character.

h3. ExtendedBufferedReader

- Greatly simplified this class. I removed all the methods that weren't being 
used (including keeping track of the line number) and changed the lookahead 
option to use the BufferedReader mark and reset methods. (/) {color:green}DONE 
- ExtendedBufferedReader is still counting the lines, but the mark/reset 
lookahead improved the performance by 30%.{color}


h3. CsvStrategy

- I split this class into three classes: an abstract base class (CsvStrategy), 
a parser-specific version (CsvParseStrategy) and a printer-specific version 
(CsvPrintStrategy). I didn't like that the strategy was used for both parsing 
and printing even though some of the values only applied to parsing (and there 
could be values that apply only to printing as well). (x) {color:red}There 
aren't enough properties in CSVFormat to split the class{color}

- Made this class immutable (as described in SANDBOX-279) (/) 
{color:green}DONE{color}

- Changed the whitespace options to not ignore whitespace by default. This is 
what the document at http://www.rfc-editor.org/rfc/rfc4180.txt recommends for 
the CSV format, so I think it should be like that by default. I added an 
"IGNORE_WHITESPACE_STRATEGY" field that works the same as the old defaults. (/) 
{color:green}DONE{color}

- Removed the interpretUnicodeEscape option and replaced the escape field with 
an encapsulatorEscape field (as described in the CsvParser change details). 
{color:green}DONE - Unicode unescaping has been removed from the format 
parameters{color}

- Added an ignoreEncapsulationTrailingCharacters field. This is used to either 
ignore or append characters that are after an encapsulated value. Previously an 
IOException was being thrown here, which I don't think is ever a good idea.

- Added some restrictions to prevent the values from being things that would 
break the parser. This includes using a line break for anything or having equal 
two values (other than the encapsulator and encapsulator escape). (/) 
{color:green}DONE{color}


h3. CsvPrinter

- I changed this to use a modified version of the commons-lang escapeCsv method 
(I hope it is ok to copy a small amount of code from one commons project to 
another?). The escaping is a little different (and simpler) that the old 
version, but I think the commons-lang version seems to be the best way to do 
it. (x) {color:red}Good point on the code reuse, but escapeCsv doesn't handle 
properly a format using escaped characters instead of quoted values, such as 
the default MySQL format.{color}

- I added an option to the constructor to allow disabling auto-flushing of the 
output stream (similar to what is in the PrintStream class). I also reduced the 
number of times the output is flushed when using the print method that take 
array input. (x) {color:red}CSVPrinter now outputs to an Appendable instead of 
a Writer, which means you can use a PrintStream and control exactly the 
behavior of the flushing.{color}


h3. CharBuffer

- I didn't really make any changes other than to make it a non-public class. 
(/) {color:green}DONE{color}



  was:
I made a lot of changes to pretty much all of the classes in the csv package.  
I thought it would be better to put all of the the changes here in one issue, 
but feel free to only take the parts you like (if any).  Hopefully if nothing 
else the test cases will be useful to you.

I'll try to list most of the changes here, but I'm sure I'm forgetting some. 
This should include all of the big changes at least. I focused mostly on the 
parser, but I also made a few changes to the printer classes (although I don't 
think I added any new test cases there).

h3. General Changes

- Changed all class names with "CSV" in them to use "Csv". This is how it 
appears in the commons-lang "escapeCsv" methods and I think it's easier to read 
the class name when acronyms are not in all upper case. (x) {color:red}3 letter 
acronyms are usually kept in uppercase (for example URLConnection or SAXParser 
in the JDK, but there are some exceptions){color}

- Formatted the code. I used Eclipse with a version of the Java formatting 
style that uses spaces instead of tabs and with a few other small changes to 
try to make it more similar to the style of this code. The formatting was 
inconsistent before (sometimes 2 space indent, sometimes 4) which made it 
really hard to work on. (/) {color:green}DONE{color}

- Removed all deprecated methods/constructors (/) {color:green}DONE{color}

- Made all public classes final. If there is ever a need to create subclasses 
of them then this could be changed, but I think it would be better to at least 
start them as final (since once they are released as non-final it's hard to go 
back).

- A few bug fixes (and test cases for them)

h3. CsvParser

- There were a few bugs for special cases, so I made as small of changes as I 
could to the parser code to fix these.

- Added a lot of test cases. I created a test case for all bugs that I found, 
so even if you don't use my changes to this class you should be able to use the 
test cases to find all of the same bugs.

- Added a close method. (x) {color:red} The try-with-resources statement in 
Java 7 makes resources management much easier, there is no need to add a 
close() method to the parser.{color}

- Renamed the nextValue method to getValue (so it is more consistent with the 
getAll and getLine method names). I think I would prefer to use a different 
method name prefix for all three of these (like "readAll") since I wouldn't 
normally expect a "get" method to have side effects, but I didn't want to just 
change the names of the most used methods. (x) {color:red}This method has been 
removed, the parser now works line by line.{color}

- Changed the getLineNumber method to return the correct line number when there 
are multi-line values. (x) {color:red}The suggested code counts the number of 
records instead of the number of lines. For debugging it's better to return the 
actual line number.{color}

- Moved all of the lexer methods into an inner CsvLexer class that is 
completely independent of the CsvParser class. The methods were already 
separated out, so it wasn't a very big change. I also moved the lexer test 
cases into a new CsvLexerTest class. (/) {color:green}DONE{color}

- Got rid of the interpreting unicode escape options. This doesn't really have 
anything to do with parsing a CSV file so I think it should be left up to the 
user of the class to implement this if needed. As an example, I made a 
CsvParserUnicodeEscapeTest class that uses the code from the lexer in a Reader 
subclass. One nice thing is that with this implementation, the interpreted 
values can be used as the delimiter, encapsulator, etc. (/) {color:green}DONE - 
The unicode unescaping is now handled by a class implementing java.io.Reader 
(to be contributed to Commons IO).{color}

- Got rid of the "escape" option for the same reason as the unicode escape 
option. I replaced it with an encapsulator escape option that is only used as 
an escape operator on the encapsulator character.

h3. ExtendedBufferedReader

- Greatly simplified this class. I removed all the methods that weren't being 
used (including keeping track of the line number) and changed the lookahead 
option to use the BufferedReader mark and reset methods. (/) {color:green}DONE 
- ExtendedBufferedReader is still counting the lines, but the mark/reset 
lookahead improved the performance by 30%.{color}


h3. CsvStrategy

- I split this class into three classes: an abstract base class (CsvStrategy), 
a parser-specific version (CsvParseStrategy) and a printer-specific version 
(CsvPrintStrategy). I didn't like that the strategy was used for both parsing 
and printing even though some of the values only applied to parsing (and there 
could be values that apply only to printing as well). (x) {color:red}There 
aren't enough properties in CSVFormat to split the class{color}

- Made this class immutable (as described in SANDBOX-279) (/) 
{color:green}DONE{color}

- Changed the whitespace options to not ignore whitespace by default. This is 
what the document at http://www.rfc-editor.org/rfc/rfc4180.txt recommends for 
the CSV format, so I think it should be like that by default. I added an 
"IGNORE_WHITESPACE_STRATEGY" field that works the same as the old defaults. (/) 
{color:green}DONE{color}

- Removed the interpretUnicodeEscape option and replaced the escape field with 
an encapsuatorEscape field (as described in the CsvParser change details).

- Added an ignoreEncapsulationTrailingCharacters field. This is used to either 
ignore or append characters that are after an encapsulated value. Previously an 
IOException was being thrown here, which I don't think is ever a good idea.

- Added some restrictions to prevent the values from being things that would 
break the parser. This includes using a line break for anything or having equal 
two values (other than the encapsulator and encapsulator escape). (/) 
{color:green}DONE{color}


h3. CsvPrinter

- I changed this to use a modified version of the commons-lang escapeCsv method 
(I hope it is ok to copy a small amount of code from one commons project to 
another?). The escaping is a little different (and simpler) that the old 
version, but I think the commons-lang version seems to be the best way to do 
it. (x) {color:red}Good point on the code reuse, but escapeCsv doesn't handle 
properly a format using escaped characters instead of quoted values, such as 
the default MySQL format.{color}

- I added an option to the constructor to allow disabling auto-flushing of the 
output stream (similar to what is in the PrintStream class). I also reduced the 
number of times the output is flushed when using the print method that take 
array input. (x) {color:red}CSVPrinter now outputs to an Appendable instead of 
a Writer, which means you can use a PrintStream and control exactly the 
behavior of the flushing.{color}


h3. CharBuffer

- I didn't really make any changes other than to make it a non-public class. 
(/) {color:green}DONE{color}



    
> Lots of possible changes
> ------------------------
>
>                 Key: CSV-42
>                 URL: https://issues.apache.org/jira/browse/CSV-42
>             Project: Commons CSV
>          Issue Type: Improvement
>            Reporter: Bob Smith
>            Priority: Minor
>         Attachments: src.zip
>
>
> I made a lot of changes to pretty much all of the classes in the csv package. 
>  I thought it would be better to put all of the the changes here in one 
> issue, but feel free to only take the parts you like (if any).  Hopefully if 
> nothing else the test cases will be useful to you.
> I'll try to list most of the changes here, but I'm sure I'm forgetting some. 
> This should include all of the big changes at least. I focused mostly on the 
> parser, but I also made a few changes to the printer classes (although I 
> don't think I added any new test cases there).
> h3. General Changes
> - Changed all class names with "CSV" in them to use "Csv". This is how it 
> appears in the commons-lang "escapeCsv" methods and I think it's easier to 
> read the class name when acronyms are not in all upper case. (x) {color:red}3 
> letter acronyms are usually kept in uppercase (for example URLConnection or 
> SAXParser in the JDK, but there are some exceptions){color}
> - Formatted the code. I used Eclipse with a version of the Java formatting 
> style that uses spaces instead of tabs and with a few other small changes to 
> try to make it more similar to the style of this code. The formatting was 
> inconsistent before (sometimes 2 space indent, sometimes 4) which made it 
> really hard to work on. (/) {color:green}DONE{color}
> - Removed all deprecated methods/constructors (/) {color:green}DONE{color}
> - Made all public classes final. If there is ever a need to create subclasses 
> of them then this could be changed, but I think it would be better to at 
> least start them as final (since once they are released as non-final it's 
> hard to go back).
> - A few bug fixes (and test cases for them)
> h3. CsvParser
> - There were a few bugs for special cases, so I made as small of changes as I 
> could to the parser code to fix these.
> - Added a lot of test cases. I created a test case for all bugs that I found, 
> so even if you don't use my changes to this class you should be able to use 
> the test cases to find all of the same bugs.
> - Added a close method. (x) {color:red} The try-with-resources statement in 
> Java 7 makes resources management much easier, there is no need to add a 
> close() method to the parser.{color}
> - Renamed the nextValue method to getValue (so it is more consistent with the 
> getAll and getLine method names). I think I would prefer to use a different 
> method name prefix for all three of these (like "readAll") since I wouldn't 
> normally expect a "get" method to have side effects, but I didn't want to 
> just change the names of the most used methods. (x) {color:red}This method 
> has been removed, the parser now works line by line.{color}
> - Changed the getLineNumber method to return the correct line number when 
> there are multi-line values. (x) {color:red}The suggested code counts the 
> number of records instead of the number of lines. For debugging it's better 
> to return the actual line number.{color}
> - Moved all of the lexer methods into an inner CsvLexer class that is 
> completely independent of the CsvParser class. The methods were already 
> separated out, so it wasn't a very big change. I also moved the lexer test 
> cases into a new CsvLexerTest class. (/) {color:green}DONE{color}
> - Got rid of the interpreting unicode escape options. This doesn't really 
> have anything to do with parsing a CSV file so I think it should be left up 
> to the user of the class to implement this if needed. As an example, I made a 
> CsvParserUnicodeEscapeTest class that uses the code from the lexer in a 
> Reader subclass. One nice thing is that with this implementation, the 
> interpreted values can be used as the delimiter, encapsulator, etc. (/) 
> {color:green}DONE - The unicode unescaping is now handled by a class 
> implementing java.io.Reader (to be contributed to Commons IO).{color}
> - Got rid of the "escape" option for the same reason as the unicode escape 
> option. I replaced it with an encapsulator escape option that is only used as 
> an escape operator on the encapsulator character.
> h3. ExtendedBufferedReader
> - Greatly simplified this class. I removed all the methods that weren't being 
> used (including keeping track of the line number) and changed the lookahead 
> option to use the BufferedReader mark and reset methods. (/) 
> {color:green}DONE - ExtendedBufferedReader is still counting the lines, but 
> the mark/reset lookahead improved the performance by 30%.{color}
> h3. CsvStrategy
> - I split this class into three classes: an abstract base class 
> (CsvStrategy), a parser-specific version (CsvParseStrategy) and a 
> printer-specific version (CsvPrintStrategy). I didn't like that the strategy 
> was used for both parsing and printing even though some of the values only 
> applied to parsing (and there could be values that apply only to printing as 
> well). (x) {color:red}There aren't enough properties in CSVFormat to split 
> the class{color}
> - Made this class immutable (as described in SANDBOX-279) (/) 
> {color:green}DONE{color}
> - Changed the whitespace options to not ignore whitespace by default. This is 
> what the document at http://www.rfc-editor.org/rfc/rfc4180.txt recommends for 
> the CSV format, so I think it should be like that by default. I added an 
> "IGNORE_WHITESPACE_STRATEGY" field that works the same as the old defaults. 
> (/) {color:green}DONE{color}
> - Removed the interpretUnicodeEscape option and replaced the escape field 
> with an encapsulatorEscape field (as described in the CsvParser change 
> details). {color:green}DONE - Unicode unescaping has been removed from the 
> format parameters{color}
> - Added an ignoreEncapsulationTrailingCharacters field. This is used to 
> either ignore or append characters that are after an encapsulated value. 
> Previously an IOException was being thrown here, which I don't think is ever 
> a good idea.
> - Added some restrictions to prevent the values from being things that would 
> break the parser. This includes using a line break for anything or having 
> equal two values (other than the encapsulator and encapsulator escape). (/) 
> {color:green}DONE{color}
> h3. CsvPrinter
> - I changed this to use a modified version of the commons-lang escapeCsv 
> method (I hope it is ok to copy a small amount of code from one commons 
> project to another?). The escaping is a little different (and simpler) that 
> the old version, but I think the commons-lang version seems to be the best 
> way to do it. (x) {color:red}Good point on the code reuse, but escapeCsv 
> doesn't handle properly a format using escaped characters instead of quoted 
> values, such as the default MySQL format.{color}
> - I added an option to the constructor to allow disabling auto-flushing of 
> the output stream (similar to what is in the PrintStream class). I also 
> reduced the number of times the output is flushed when using the print method 
> that take array input. (x) {color:red}CSVPrinter now outputs to an Appendable 
> instead of a Writer, which means you can use a PrintStream and control 
> exactly the behavior of the flushing.{color}
> h3. CharBuffer
> - I didn't really make any changes other than to make it a non-public class. 
> (/) {color:green}DONE{color}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CSV-42) Lots of possible changes

Reply via email to