Hi,

while looking for potential performance optimization I came across
CSVLexer.isEndOfLine(int c). Here is the source:

    private boolean isEndOfLine(int c) throws IOException {
        // check if we have \r\n...
        if (c == '\r' && in.lookAhead() == '\n') {
            // note: does not change c outside of this method !!
            c = in.read();
        }
        return (c == '\n' || c == '\r');
    }

this method assumes, that a line separator will always be "\r" or
"\r\n". This is true for the pre-configured CSVFormats EXCEL, TDF and
MYSQL. I'm not a pro when it comes to file encoding, but isn't there
the possibility that new encodings will have different line
separators?
If that is the case, isEndOfLine() should somehow use
format.getLineSeparator().
For example the lookAhead only has to be made, if
lineSeperator.length() > 1. This may have a positive impact on the
performance of parsing files with an encoding whose line separator is
only one char long.

Benedikt

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to