Re: Transliteration preferring longest match

John Macdonald Thu, 15 Dec 2005 17:57:54 -0800

On Thu, Dec 15, 2005 at 09:56:09PM +0000, Luke Palmer wrote:
> On 12/15/05, Brad Bowman <[EMAIL PROTECTED]> wrote:
> > Why does the longest input sequence win?
> >    Is it for some consistency that that I'm not seeing? Some exceedingly
> > common use case?  The rule seems unnecessarily restrictive.
> 
> Hmm.  Good point.  You see, the longest token wins because that's an
> exceedingly common rule in lexers, and you can't sort regular
> expressions the way you can sort strings, so there needs to be special
> machinery in there.
> 
> There are two rather weak arguments to keep the longest token rule:
> 
>     * We could compile the transliteration into a DFA and make it
> fast.  Premature optimization.
>     * We could generalize transliteration to work on rules as well.
> 
> In fact, I think the first Perl module I ever wrote was
> Regexp::Subst::Parallel, which did precisely the second of these. 
> That's one of the easy things that was hard in Perl (but I guess
> that's what CPAN is for).  Hmm.. none of these is really a compelling
> argument either way.


If a shorter rule is allowed to match first, then the longer
rule can be removed from the match set, at least for constant
string matches.  If, for example, '=' can match without
preferring to try first for '==' then you'll never match '=='
without syntactic help to force a backtracking retry.

--

Re: Transliteration preferring longest match

Reply via email to