Same. One day, Python will have a decent parsing library.
On Friday, March 31, 2017 at 4:21:51 AM UTC-4, Stephan Houben wrote: > > Hi all, > > FWIW, I also strongly prefer the Verbal Expression style and consider > "normal" regular expressions to become quickly unreadable and > unmaintainable. > > Verbal Expressions are also much more composable. > > Stephan > > 2017-03-31 9:23 GMT+02:00 Stephen J. Turnbull > <turnbull....@u.tsukuba.ac.jp <javascript:>>: > > Abe Dillon writes: > > > > > Note that the entire documentation is 250 words while just the syntax > > > portion of Python docs for the re module is over 3000 words. > > > > Since Verbal Expressions (below, VEs, indicating notation) "compile" > > to regular expressions (spelling out indicates the internal matching > > implementation), the documentation of VEs presumably ignores > > everything except the limited language it's useful for. To actually > > understand VEs, you need to refer to the RE docs. Not a win IMO. > > > > > > You think that example is more readable than the proposed > transalation > > > > ^(http)(s)?(\:\/\/)(www\.)?([^\ ]*)$ > > > > which is better written > > > > ^https?://(www\.)?[^ ]*$ > > > > or even > > > > ^https?://[^ ]*$ > > > > > > > > > Yes. I find it *far* more readable. It's not a soup of symbols like > Perl > > > code. I can only surmise that you're fluent in regex because it seems > > > difficult for you to see how the above could be less readable than > English > > > words. > > > > Yes, I'm fairly fluent in regular expression notation (below, REs). > > I've maintained a compiler for one dialect. > > > > I'm not interested in the difference between words and punctuation > > though. The reason I find the middle RE most readable is that it > > "looks like" what it's supposed to match, in a contiguous string as > > the object it will match will be contiguous. If I need to parse it to > > figure out *exactly* what it matches, yes, that takes more effort. > > But to understand a VE's semantics correctly, I'd have to look it up > > as often as you have to look up REs because many words chosen to notate > > VEs have English meanings that are (a) ambiguous, as in all natural > > language, and (b) only approximate matches to RE semantics. > > > > > I could tell it only matches URLs that are the only thing inside > > > the string because it clearly says: start_of_line() and > > > end_of_line(). > > > > That's not the problem. The problem is the semantics of the method > > "find". "then" would indeed read better, although it doesn't exactly > > match the semantics of concatenation in REs. > > > > > I would have had to refer to a reference to know that "^" doesn't > > > always mean "not", it sometimes means "start of string" and > > > probably other things. I would also have to check a reference to > > > know that "$" can mean "end of string" (and probably other things). > > > > And you'll still have to do that when reading other people's REs. > > > > > > Are those groups capturing in Verbal Expressions? The use of > > > > "find" (~ "search") rather than "match" is disconcerting to the > > > > experienced user. > > > > > > You can alternately use the word "then". The source code is just > > > one python file. It's very easy to read. I actually like "then" > > > over "find" for the example: > > > > You're missing the point. The reader does not get to choose the > > notation, the author does. I do understand what several varieties of > > RE mean, but the variations are of two kinds: basic versus extended > > (ie, what tokens need to be escaped to be taken literally, which ones > > have special meaning if escaped), and extensions (which can be > > ignored). Modern RE facilities are essentially all of the extended > > variety. Once you've learned that, you're in good shape for almost > > any RE that should be written outside of an obfuscated code contest. > > > > This is a fundamental principle of Python design: don't make readers > > of code learn new things. That includes using notation developed > > elsewhere in many cases. > > > > > What does alternation look like? > > > > > > .OR(option1).OR(option2).OR(option3)... > > > > > > How about alternation of > > > > non-trivial regular expressions? > > > > > > .OR(other_verbal_expression) > > > > Real examples, rather than pseudo code, would be nice. I think you, > > too, will find that examples of even fairly simple nested alternations > > containing other constructs become quite hard to read, as they fall > > off the bottom of the screen. > > > > For example, the VE equivalent of > > > > scheme = "(https?|ftp|file):" > > > > would be (AFAICT): > > > > scheme = VerEx().then(VerEx().then("http") > > .maybe("s") > > .OR("ftp") > > .OR("file")) > > .then(":") > > > > which is pretty hideous, I think. And the colon is captured by a > > group. If perversely I wanted to extract that group from a match, > > what would its index be? > > > > I guess you could keep the linear arrangement with > > > > scheme = (VerEx().add("(") > > .then("http") > > .maybe("s") > > .OR("ftp") > > .OR("file") > > .add(")") > > .then(":")) > > > > but is that really an improvement over > > > > scheme = VerEx().add("(https?|ftp|file):") > > > > ;-) > > > > > > As far as I can see, Verbal Expressions are basically a way of > > > > making it so painful to write regular expressions that people > > > > will restrict themselves to regular expressions > > > > > > What's so painful to write about them? > > > > One thing that's painful is that VEs "look like" context-free > > grammars, but clumsy and without the powerful semantics. You can get > > the readability you want with greater power using grammars, which is > > why I would prefer we work on getting a parser module into the stdlib. > > > > But if one doesn't know about grammars, it's still not great. The > > main pains about writing VEs for me are (1) reading what I just wrote, > > (2) accessing capturing groups, and (3) verbosity. Even a VE to > > accurately match what is normally a fairly short string, such as the > > scheme, credentials, authority, and port portions of a "standard" URL, > > is going to be hundreds of characters long and likely dozens of lines > > if folded as in the examples. > > > > Another issue is that we already have a perfectly good poor man's > > matching library: glob. The URL example becomes > > > > http{,s}://{,www.}* > > > > Granted you lose the anchors, but how often does that matter? You > > apparently don't use them often enough to remember them. > > > > > Does your IDE not have autocompletion? > > > > I don't want an IDE. I have Emacs. > > > > > I find REs so painful to write that I usually just use string > > > methods if at all feasible. > > > > Guess what? That's the right thing to do anyway. They're a lot more > > readable and efficient when partitioning a string into two or three > > parts, or recognizing a short list of affixes. But chaining many > > methods, as VEs do, is not a very Pythonic way to write a program. > > > > > > I don't think that this failure to respect the developer's taste > > > > is restricted to this particular implementation, either. > > > > > > I generally find it distasteful to write a pseudolanguage in > > > strings inside of other languages (this applies to SQL as well). > > > > You mean like arithmetic operators? (Lisp does this right, right? > > Only one kind of expression, the function call!) It's a matter of > > what you're used to. I understand that people new to text-processing, > > or who don't do so much of it, don't find REs easy to read. So how is > > this a huge loss? They don't use regular expressions very often! In > > fact, they're far more likely to encounter, and possibly need to > > understand, REs written by others! > > > > > Especially when the design principals of that pseudolanguage are > > > *diametrically opposed* to the design principals of the host > > > language. A key principal of Python's design is: "you read code a > > > lot more often than you write code, so emphasize > > > readability". Regex seems to be based on: "Do the most with the > > > fewest key-strokes. > > > > So is all of mathematics. There's nothing wrong with concise > > expression for use in special cases. > > > > > Readability be dammed!". It makes a lot more sense to wrap the > > > psudolanguage in constructs that bring it in-line with the host > > > language than to take on the mental burden of trying to comprehend > > > two different languages at the same time. > > > > > > If you disagree, nothing's stopping you from continuing to write > > > res the old-fashion way. > > > > I don't think that RE and SQL are "pseudo" languages, no. And I, and > > most developers, will continue to write regular expressions using the > > much more compact and expressive RE notation. (In fact with the > > exception of the "word" method, in VEs you still need to use RE notion > > to express most of the Python extensions.) So what you're saying is > > that you don't read much code, except maybe your own. Isn't that your > > problem? Those of us who cooperate widely on applications using > > regular expressions will continue to communicate using REs. If that > > leaves you out, that's not good. But adding VEs to the stdlib (and > > thus encouraging their use) will split the community into RE users and > > VE users, if VEs are at all useful. That's a bad. I don't see that > > the potential usefulness of VEs to infrequent users of regular > > expressions outweighing the downsides of "many ways to do it" in the > > stdlib. > > > > > Can we at least agree that baking special re syntax directly into > > > the language is a bad idea? > > > > I agree that there's no particular need for RE literals. If one wants > > to mark an RE as some special kind of object, re.compile() does that > > very well both by converting to a different type internally and as a > > marker syntactically. > > > > > On Wed, Mar 29, 2017 at 11:49 PM, Nick Coghlan <ncog...@gmail.com > <javascript:>> wrote: > > > > > > > We don't really want to ease the use of regexps in Python - while > > > > they're an incredibly useful tool in a programmer's toolkit, > > > > they're so cryptic that they're almost inevitably a > > > > maintainability nightmare. > > > > I agree with Nick. Regular expressions, whatever the notation, are a > > useful tool (no suspension of disbelief necessary for me, though!). > > But they are cryptic, and it's not just the notation. People (even > > experienced RE users) are often surprised by what fairly simple > > regular expression match in a given text, because people want to read > > a regexp as instructions to a one-pass greedy parser, and it isn't. > > > > For example, above I wrote > > > > scheme = "(https?|ftp|file):" > > > > rather than > > > > scheme = "(\w+):" > > > > because it's not unlikely that I would want to treat those differently > > from other schemes such as mailto, news, and doi. In many > > applications of regular expressions (such as tokenization for a > > parser) you need many expressions. Compactness really is a virtue in > > REs. > > > > Steve > > > > _______________________________________________ > > Python-ideas mailing list > > python...@python.org <javascript:> > > https://mail.python.org/mailman/listinfo/python-ideas > > Code of Conduct: http://python.org/psf/codeofconduct/ > _______________________________________________ > Python-ideas mailing list > python...@python.org <javascript:> > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ >
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/