In talking with a user off list, we realized there might be some use for expanded character classes, and a potential issue with existing ones.
The potential issue is delimiter scanning fails if you have an %NL; character class after a %WSP*; or %WSP+; character class. The reason being that WSP*/WSP+ greedily consumes all whitespace, including newlines, leaving nothing left for a following %NL; character class to consume. For example, say we had the following in the data: 0x20 0x20 0x20 0x0A So three spaces followed by a line feed. If we want to consume zero or more spaces followed by a newline (i.e. gobble up an empty line). The obvious delimiter is %WSP*;%NL; but that doesn't work for reason mentioned above. WSP* will consume all spaces and the newline leaving nothing left for the %NL; to consume. We could instead use just "%WSP+;" as the delimiter, but that does not require that this end in a new line. In order to support this type of delimiter, I think we would need to add either forward lookahead for backtracking to our delimiter scanner, which adds extra complexity that I'm not sure would be worth it. As an alternative, I think it might be helpful to add new character class that matches everything that WSP matches EXCEPT for newline characters. We would want to have + and * variations of this as well. I'm not sure what to call it (WSPX?) but that's a minor issue. Then the above could be matched via %WSPX*;%NL; Related, there are cases where a user might want to match one or more newlines. Right now, you need to support an arbitrary number of newlines and just do something like this: dfdl:separator="%NL; %NL;%NL; %NL;%NL;%NL; %NL;%NL;%NL;%NL;" Which is limited to match one to four newlines. So messy and limited. So I think having %NL*; and %NL+; variations might also be beneficial. So, I propose we add three new character classes for matching all whitespace characters except for those matched by %NL;: %WSPX; %WSPX*; %WSPX+; And two new newline character classes to match zero or one or more new lines: %NL*; %NL+; I also propose that it is an SDE if a character class that contains * or + is immediately followed by a character inside that character class. The reason being that such sequences will always require either lookahead or backtracking which adds extra complexity to our delimiter scanning without significant gain. Thoughts? - Steve
