Paddy wrote: > Kay Schluehr wrote: > > I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a > > regular expression sx from it, such that sx.match(s) yields a SRE_Match > > object when s starts with an s_i for one i in [0,...,n]. There might > > be relations between those strings: s_k.startswith(s_1) -> True or > > s_k.endswith(s_1) -> True. An extreme case would be ls = ['a', 'aa', > > ...,'aaaa...ab']. For this reason SRE_Match should provide the longest > > possible match. > > > > Is there a Python module able to create an optimized regex rx from ls > > for the given constraints? > > > > Regards, > > Kay > > A start would be: > regexp = "^(" + "|".join(sorted(ls, reverse=True)) + ")" > But the above does not work if you have special characters in your > strings.
For special characters there might be a workaround using escapes. This is indeed important, but I do think one should split the problem into separate parts. > You say you want something that is optimised. What have have you tried? Sorting the list and checking for successor inclusions. Say you have ls = ['x','a', 'aa', 'aab' ,'ab'] This can be mapped to: 'x|a(?:(?:ab)?|b?|a?)' or to: '^(x|ab|aab|aa|a)' with reverse sorting as in your proposal.The naive solution is easy to generate but I'm sceptical about its cost effectiveness. On the other hand I do not want to investigate this matter if somebody else already did it thoroughly. Regards, Kay -- http://mail.python.org/mailman/listinfo/python-list