In talking with a user off list, we realized there might be some use for
expanded character classes, and a potential issue with existing ones.
The potential issue is delimiter scanning fails if you have an %NL;
character class after a %WSP*; or %WSP+; character class. The reason
being that WSP*/WSP+ greedily consumes all whitespace, including
newlines, leaving nothing left for a following %NL; character class to
consume. For example, say we had the following in the data:
0x20 0x20 0x20 0x0A
So three spaces followed by a line feed. If we want to consume zero or
more spaces followed by a newline (i.e. gobble up an empty line). The
obvious delimiter is %WSP*;%NL; but that doesn't work for reason
mentioned above. WSP* will consume all spaces and the newline leaving
nothing left for the %NL; to consume. We could instead use just "%WSP+;"
as the delimiter, but that does not require that this end in a new line.
In order to support this type of delimiter, I think we would need to add
either forward lookahead for backtracking to our delimiter scanner,
which adds extra complexity that I'm not sure would be worth it.
As an alternative, I think it might be helpful to add new character
class that matches everything that WSP matches EXCEPT for newline
characters. We would want to have + and * variations of this as well.
I'm not sure what to call it (WSPX?) but that's a minor issue. Then the
above could be matched via %WSPX*;%NL;
Related, there are cases where a user might want to match one or more
newlines. Right now, you need to support an arbitrary number of newlines
and just do something like this:
dfdl:separator="%NL; %NL;%NL; %NL;%NL;%NL; %NL;%NL;%NL;%NL;"
Which is limited to match one to four newlines. So messy and limited. So
I think having %NL*; and %NL+; variations might also be beneficial.
So, I propose we add three new character classes for matching all
whitespace characters except for those matched by %NL;:
And two new newline character classes to match zero or one or more new
I also propose that it is an SDE if a character class that contains * or
+ is immediately followed by a character inside that character class.
The reason being that such sequences will always require either
lookahead or backtracking which adds extra complexity to our delimiter
scanning without significant gain.