Hi Steffen, in fact, I considered B) directly, but not up to the point of building onto C) and D) (with an obvious A, but that one is a given in Pharo implementation of strings).
Thanks for the explanation, then. Regards, Thierry 2017-11-08 14:21 GMT+01:00 Steffen Märcker <merk...@web.de>: > I see. How about the following (sketched) solution to avoid looping over > all characters? It might be very well the case that you already considered > (and dismissed) that path. > > A) Assumption > In order to allow any meaningful matching, the input to the scanner is > normalized according to the unicode spec. > > B) Abstraction > Treat each character and character group of an regex as a set of intervals > in the unicode code points. Lets call them "character tests" and lift the > common set operations union, intersection and difference to them. > > C) Construct NFA > NFA has potentially overlapping character tests at the transitions of each > state. > > D) Construct DFA > Given a product state s in the DFA and two transitions t1, t2 from the > original NFA, add three new transitions to the DFA: > - a transition labeled with the character test of t1 minus the character > test of t2 > - a transition labeled with the intersection of the character tests of t1 > and t2 > - a transition labeled with the character test of t2 minus the character > test of t1 > > E) Extension > Instead of sets of unicode intervals we could also use test-functions, > e.g. blocks. Then, in step D), the set operations are translated to boolean > operations: > - difference t1 - t2 becomes: t1 && not t2 > - intersection of t1 and t2 becomes: t1 && t2 > This would allow to use optimized test functions, e.g., bdds, instead of > relying on charcter tests only. > > > Cheers, > Steffen > > > > > > Am .11.2017, 23:16 Uhr, schrieb Thierry Goubier <thierry.goub...@gmail.com > >: > > Le 07/11/2017 à 23:00, Steffen Märcker a écrit : >> >>> I am not familiar with the literature on scanners. May I ask you about >>> some details on the "for all characters" algorithms you are referring to? >>> >> >> The two main ones available, from the typical Aho/Ullman textbook, are: >> >> - NFA to DFA conversion (i.e. build a NFA with your regular expressions, >> then convert it to a DFA) >> >> - Direct regular expression to DFA construction >> >> Both of them have loop of the type: >> >> for (each input symbol a) { >> ... >> >> Building a (or connecting to) a BDD library would be fun, indeed. But >>> within that time frame it seems not realistic. Anyway, after finishing my >>> thesis, I'd like to come back to that idea. >>> >> >> It would certainly be interesting. Please contact us again when you'll >> have time :) >> >> Regards, >> >> Thierry >> >> Ciao, Steffen >>> Am 7. November 2017 16:33:03 MEZ schrieb Andrew Glynn < >>> aglyn...@gmail.com>: >>> A possible way to accomplish it would be to use an object graph with >>> an incremental query engine, such as EMF/CDO with Viatra or >>> something similar. You could then put different character sets in >>> connected objects and query only as far as you need to. >>> Andrew Glynn >>> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for >>> Windows 10 >>> *From: *Thierry Goubier <mailto:thierry.goub...@gmail.com> >>> *Sent: *Tuesday, November 7, 2017 7:17 AM >>> *To: *Any question about pharo is welcome >>> <mailto:pharo-users@lists.pharo.org> >>> *Subject: *Re: [Pharo-users] Binary Decision Diagram Package in >>> Smalltalk >>> Hi Andrew, Steffen, >>> 2017-11-07 13:10 GMT+01:00 Prof. Andrew P. Black <bl...@cs.pdx.edu >>> <mailto:bl...@cs.pdx.edu>>: >>> > On 28 Oct 2017, at 17:37 , Steffen Märcker <merk...@web.de >>> <mailto:merk...@web.de>> wrote: >>> > >>> > Does that mean the sets/bdd would be constructed mainly at >>> comile time? Anyway, Andrew, feel free to contact me, I might >>> help you with this. >>> > >>> Thanks for the offer, Steffen! The problem is that I need to >>> use SmaCC for my current project, and really do not have a month >>> to take off and re-design the way that it builds its scanner. >>> I’ve talked to Thierry Goubier about, and he doesn’t have time >>> either! It would be a fun project, though, and it ought to be >>> fairly separate from other parts of SmaCC. I’ve spent a fair >>> bit of time thinking about how to do it, but don’t think that I >>> will be able to actually focus on it. >>> Yes, this is the essence of the issue. There are a few alternatives >>> about it, but none we have the time to pursue. >>> An alternative approach, which Thierry has suggested, is to >>> make >>> SmaCC work on the UTF-8 representation of the Unicode. Then we >>> could represent character sets as prefix trees. But the core >>> problem would still exist: you can’t run an algorithm that >>> repeatedly executes >>> for all characters in the alphabet do: >>> when there are 2^21 characters in the alphabet! >>> The main issue is that `for all characters`... All the literature on >>> scanner building uses 'for all characters do'. >>> Thierry >>> Andrew >>> >>> >> >