Hi Steffen,

in fact, I considered B) directly, but not up to the point of building onto
C) and D) (with an obvious A, but that one is a given in Pharo
implementation of strings).

Thanks for the explanation, then.

Regards,

Thierry

2017-11-08 14:21 GMT+01:00 Steffen Märcker <merk...@web.de>:

> I see. How about the following (sketched) solution to avoid looping over
> all characters? It might be very well the case that you already considered
> (and dismissed) that path.
>
> A) Assumption
> In order to allow any meaningful matching, the input to the scanner is
> normalized according to the unicode spec.
>
> B) Abstraction
> Treat each character and character group of an regex as a set of intervals
> in the unicode code points. Lets call them "character tests" and lift the
> common set operations union, intersection and difference to them.
>
> C) Construct NFA
> NFA has potentially overlapping character tests at the transitions of each
> state.
>
> D) Construct DFA
> Given a product state s in the DFA and two transitions t1, t2 from the
> original NFA, add three new transitions to the DFA:
> - a transition labeled with the character test of t1 minus the character
> test of t2
> - a transition labeled with the intersection of the character tests of t1
> and t2
> - a transition labeled with the character test of t2 minus the character
> test of t1
>
> E) Extension
> Instead of sets of unicode intervals we could also use test-functions,
> e.g. blocks. Then, in step D), the set operations are translated to boolean
> operations:
> - difference t1 - t2 becomes: t1 && not t2
> - intersection of t1 and t2 becomes: t1 && t2
> This would allow to use optimized test functions, e.g., bdds, instead of
> relying on charcter tests only.
>
>
> Cheers,
> Steffen
>
>
>
>
>
> Am .11.2017, 23:16 Uhr, schrieb Thierry Goubier <thierry.goub...@gmail.com
> >:
>
> Le 07/11/2017 à 23:00, Steffen Märcker a écrit :
>>
>>> I am not familiar with the literature on scanners. May I ask you about
>>> some details on the "for all characters" algorithms you are referring to?
>>>
>>
>> The two main ones available, from the typical Aho/Ullman textbook, are:
>>
>> - NFA to DFA conversion (i.e. build a NFA with your regular expressions,
>> then convert it to a DFA)
>>
>> - Direct regular expression to DFA construction
>>
>> Both of them have loop of the type:
>>
>> for (each input symbol a) {
>> ...
>>
>> Building a (or connecting to) a BDD library would be fun, indeed. But
>>> within that time frame it seems not realistic. Anyway, after finishing my
>>> thesis, I'd like to come back to that idea.
>>>
>>
>> It would certainly be interesting. Please contact us again when you'll
>> have time :)
>>
>> Regards,
>>
>> Thierry
>>
>> Ciao, Steffen
>>>   Am 7. November 2017 16:33:03 MEZ schrieb Andrew Glynn <
>>> aglyn...@gmail.com>:
>>>      A possible way to accomplish it would be to use an object graph with
>>>     an incremental query engine, such as EMF/CDO with Viatra or
>>>     something similar.  You could then put different character sets in
>>>     connected objects and query only as far as you need to.
>>>      Andrew Glynn
>>>      Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
>>>     Windows 10
>>>      *From: *Thierry Goubier <mailto:thierry.goub...@gmail.com>
>>>     *Sent: *Tuesday, November 7, 2017 7:17 AM
>>>     *To: *Any question about pharo is welcome
>>>     <mailto:pharo-users@lists.pharo.org>
>>>     *Subject: *Re: [Pharo-users] Binary Decision Diagram Package in
>>>     Smalltalk
>>>      Hi Andrew, Steffen,
>>>      2017-11-07 13:10 GMT+01:00 Prof. Andrew P. Black <bl...@cs.pdx.edu
>>>     <mailto:bl...@cs.pdx.edu>>:
>>>            > On 28 Oct 2017, at 17:37 , Steffen Märcker <merk...@web.de
>>>         <mailto:merk...@web.de>> wrote:
>>>          >
>>>          > Does that mean the sets/bdd would be constructed mainly at
>>>         comile time? Anyway, Andrew, feel free to contact me, I might
>>>         help you with this.
>>>          >
>>>          Thanks for the offer, Steffen!  The problem is that I need to
>>>         use SmaCC for my current project, and really do not have a month
>>>         to take off and re-design the way that it builds its scanner.
>>>      I’ve talked to Thierry Goubier about, and he doesn’t have time
>>>         either!  It would be a fun project, though, and it ought to be
>>>         fairly separate from other parts of SmaCC.  I’ve spent a fair
>>>         bit of time thinking about how to do it, but don’t think that I
>>>         will be able to actually focus on it.
>>>      Yes, this is the essence of the issue. There are a few alternatives
>>>     about it, but none we have the time to pursue.
>>>           An alternative approach, which Thierry has suggested, is to
>>> make
>>>         SmaCC work on the UTF-8 representation of the Unicode.  Then we
>>>         could represent character sets as prefix trees.  But the core
>>>         problem would still exist: you can’t run an algorithm that
>>>         repeatedly executes
>>>                           for all characters in the alphabet do:
>>>          when there are 2^21 characters in the alphabet!
>>>      The main issue is that `for all characters`... All the literature on
>>>     scanner building uses 'for all characters do'.
>>>      Thierry
>>>                    Andrew
>>>
>>>
>>
>

Reply via email to