New DFDL Character Classes

Steve Lawrence Tue, 13 Mar 2018 05:10:58 -0700

In talking with a user off list, we realized there might be some use for
expanded character classes, and a potential issue with existing ones.


The potential issue is delimiter scanning fails if you have an %NL;
character class after a %WSP*; or %WSP+; character class. The reason
being that WSP*/WSP+ greedily consumes all whitespace, including
newlines, leaving nothing left for a following %NL; character class to
consume. For example, say we had the following in the data:

 0x20 0x20 0x20 0x0A

So three spaces followed by a line feed. If we want to consume zero or
more spaces followed by a newline (i.e. gobble up an empty line). The
obvious delimiter is %WSP*;%NL; but that doesn't work for reason
mentioned above. WSP* will consume all spaces and the newline leaving
nothing left for the %NL; to consume. We could instead use just "%WSP+;"
as the delimiter, but that does not require that this end in a new line.

In order to support this type of delimiter, I think we would need to add
either forward lookahead for backtracking to our delimiter scanner,
which adds extra complexity that I'm not sure would be worth it.

As an alternative, I think it might be helpful to add new character
class that matches everything that WSP matches EXCEPT for newline
characters. We would want to have + and * variations of this as well.
I'm not sure what to call it (WSPX?) but that's a minor issue. Then the
above could be matched via %WSPX*;%NL;

Related, there are cases where a user might want to match one or more
newlines. Right now, you need to support an arbitrary number of newlines
and just do something like this:

  dfdl:separator="%NL; %NL;%NL; %NL;%NL;%NL; %NL;%NL;%NL;%NL;"

Which is limited to match one to four newlines. So messy and limited. So
I think having %NL*; and %NL+; variations might also be beneficial.

So, I propose we add three new character classes for matching all
whitespace characters except for those matched by %NL;:

  %WSPX;
  %WSPX*;
  %WSPX+;

And two new newline character classes to match zero or one or more new
lines:

  %NL*;
  %NL+;

I also propose that it is an SDE if a character class that contains * or
+ is immediately followed by a character inside that character class.
The reason being that such sequences will always require either
lookahead or backtracking which adds extra complexity to our delimiter
scanning without significant gain.

Thoughts?
- Steve

New DFDL Character Classes

Reply via email to