On 2 May 2013 18:26, Benedikt Ritter <brit...@apache.org> wrote: > 2013/5/2 sebb <seb...@gmail.com> > > > On 2 May 2013 14:56, sebb <seb...@gmail.com> wrote: > > > > > On 2 May 2013 09:48, Benedikt Ritter <brit...@apache.org> wrote: > > > > > >> 2013/5/1 sebb <seb...@gmail.com> > > >> > > >> > On 1 May 2013 08:53, Benedikt Ritter <brit...@apache.org> wrote: > > >> > > > >> > > Hi, > > >> > > > > >> > > I have tried to solve CSV-58 - Escape handling needs rethinking > [1] > > a > > >> few > > >> > > times over the past weeks now. But I can not see a way to get this > > >> > working. > > >> > > There are two problems with our current implementation: > > >> > > > > >> > > 1. a sequence of " test \a test" will be parsed to "test a test", > > >> > although > > >> > > there is no reason to escape 'a'. The expected behavior is to let > > the > > >> > > sequence \a unchanged. > > >> > > > > >> > > > >> > An alternative would be to throw an Exception. > > >> > > > >> > > >> Haven't thought about that alternative so far. It is was the java > > compiler > > >> would do if it finds "\a" in a String. > > >> If we decide to through an exception, the message should state very > > >> clearly > > >> what has gone wrong, including the position of the misplaced escape > > char. > > >> I'd like to hear more opinions on this. > > >> > > >> > > > Throwing an exception stops the parsing, and obviously only catches the > > > first error. > > > Leaving the sequences as is means potentially having to scan a large > > > output file to find them all. > > > > > > There's another option: invoke a callback when any invalid sequences > are > > > detected. > > > > > > That would allow for various possibilities, depending on what callback > > was > > > provided. > > > - throw Exception with details > > > - ignore (with optional log) > > > - replace with some other sequence (with optional log) > > > > > > I don't think CSV should do any logging of its own, but the caller can > do > > > so in the callback > > > > > > However we would need to decide what the default callback should do. > > > If we throw an Exception, then it is obvious that something is wrong. > > > If we ignore the sequence (leave as is) then there is no easy way for > the > > > caller to know if there have been any errors. > > > > > > > > >> > > > >> > > > >> > > 2. If the escape char is different from backslash (for example > > '!'), a > > >> > > sequence of "test !r test" will be parsed to "test CR test" > > >> > > > > >> > > > > >> > Is that how other CSV parsers behave? i.e. do they always use \ for > > >> > escaping EOL? > > >> > > > >> > > >> Haven't tested this. I'll try to have a look this weekend. > > >> > > >> The problem with the current impl is a backslash as part of an escape > > >> sequence. > > >> This is what readEscape() currently does: It checks if the lexer has > > >> reached a character combination of (for example) \r and replaces this > > with > > >> CR. Otherwise it swallows the escape character and returns the next > > >> character. > > >> > > >> There are two problems with this: > > >> 1. it does so if it recognizes the escape character (which doesn't > have > > to > > >> be a backslash). This is why !r gets replaced by CR. > > >> 2. it swallows the escape and returns the next character (default > branch > > >> in > > >> the switch statement). This should only happen if the next character > is > > >> one > > >> of the special characters from the format (e.g. the delimiter). This > is > > >> why > > >> '\a' gets replaced by 'a' > > >> > > > > > > I think that's wrong, because one cannot distinguish "\a" from a". > > > > > > The readEscape() method currently returns an int; this cannot be used > to > > > distinguish between a valid escape and some other character. > > > If we only want to support throwing an Exception for invalid escape > > > sequence, then it's trivial to fix this - just throw the Exception. > > > > > > However, to support "as is" passthrough of invalid sequences, the > caller > > > needs to know when to restore the escape character to the output. > > > > > > One way to do this would be to change the return type to a char[] > array. > > > This would either be a single char or the escape followed by a single > > char. > > > The return array could just be appended to the buffer. This should work > > > quite well with a call-back, as the callback could also return a char > > array. > > > > > > The only downside I can see is that append(char[]) may be slightly > slower > > > than append(char). > > > > > > > > Found another way to do it: return -1 (EOF) which cannot otherwise be > > returned by the method. > > The last character is then accessible via in.getLastChar(). > > i.e. if -1 is returned, output escape followed by lastChar. > > > > > > Needs some tweaks to readEscape, because \LF etc. stand for themselves, > > i.e. both \r and \CR => CR. > > > > What needs to be decided is whether both !r and !CR => CR if the escape > > character is ! rather than \, and if so, whether \CR still stands for CR. > > > > But I think the parsing can be tweaked without too much rewriting. > > > > I think there are four cases to cover (I'm using ! as escape in this > example): > 1. !r => nothing to escape, append '!r' >
Not sure I agree with that; currently \r => CR if \ is the escape. > 2. !\r => the CR character in its character representation. Replace '\r' by > CR and escape it. continue parsing of the token > Not sure I agree here either. I think that's an invalid sequence unless \ is one of the meta-characters. Should CSV unconditionally recognise \r \n etc as CR LF? If so, then they don't need a ! escape beforehand. > 3. !CR => esacpe CR and contine parsing of the token > !CR should be replaced by CR, i.e. *un*escaped 4. !\CR => escape '\' if it is one of the characters used in the format, in > other words: remove the '!' and append the '\'. Then EOL => next record, > but only if CR is the record separator (?). > > I've tried several times and failed. Do you have the time to implement your > proposal? I'll probably have some time this weekend to try once again :-) > > Yes, I think I can make a start on amending readEscape. However that won't get us anywhere until the rules are clear; I don't think they are currently. It should be possible to round-trip escaping then unescaping. It may be easier to consider all possible escaped output to decide how to unescape, and what is not a valid sequence. > > > > > > > > An alternative is to throw an exception. > > >> > > >> I think it may be necessary to separate backslash handling from escape > > >> handling. > > >> > > >> > > >> > > > Possibly. However that may make parsing harder. > > > > > > > > >> > > > >> > > > >> > > I have tried different approaches to handle this issue but there > are > > >> > always > > >> > > corner cases that I'm unable to cover. I'm getting the feeling > that > > >> I'm > > >> > > just to dump to solve this, so any comments or suggestions would > be > > >> > really > > >> > > appreciated! > > >> > > > > >> > > > > >> > Perhaps it would help to commit the unit tests you are working to > > solve. > > >> > The tests can be disabled. > > >> > > > >> > Or perhaps better use a class whose name is not one of the ones > picked > > >> up > > >> > by Surefire. > > >> > This allows the test to be run independently by Eclipse (and Maven) > > but > > >> > won't mess with CI builds. > > >> > > > >> > > >> I have already committed three tests to CSVLexerTest and set them to > be > > >> ignored. Maybe you can review those and tell me whether my assumptions > > are > > >> correct. > > >> > > >> > > >> > > > >> > Benedikt > > >> > > [1] https://issues.apache.org/jira/browse/CSV-58 > > >> > > > > >> > > -- > > >> > > http://people.apache.org/~britter/ > > >> > > http://www.systemoutprintln.de/ > > >> > > http://twitter.com/BenediktRitter > > >> > > http://github.com/britter > > >> > > > > >> > > > >> > > >> > > >> > > >> -- > > >> http://people.apache.org/~britter/ > > >> http://www.systemoutprintln.de/ > > >> http://twitter.com/BenediktRitter > > >> http://github.com/britter > > >> > > > > > > > > > > > > -- > http://people.apache.org/~britter/ > http://www.systemoutprintln.de/ > http://twitter.com/BenediktRitter > http://github.com/britter >