[Python-ideas] Re: Regex timeouts

Chris Angelico Tue, 15 Feb 2022 16:28:48 -0800

On Wed, 16 Feb 2022 at 10:15, Steven D'Aprano <st...@pearwood.info> wrote:
>
> On Tue, Feb 15, 2022 at 11:51:41PM +0900, Stephen J. Turnbull wrote:
>
> > scanf just isn't powerful enough.  For example, consider parsing user
> > input dates: scanf("%d/%d/%d", &year, &month, &day).  This is nice and
> > simple, but handling "2022-02-15" as well requires a bit of thinking
> > and several extra statements in C.  In Python, I guess it would
> > probably look something like
> >
> >     year, sep1, month, sep2, day = scanf("%d%c%d%c%d")
> >     if not ('/' == sep1 == sep2 or '-' == sep1 == sep2):
> >         raise DateFormatUnacceptableError
> >     # range checks for month and day go here
>
> Assuming that scanf raises if there is no match, I would probably go
> with:


Having scanf raise is one option; another option would be to have it
return a partial result, which would raise ValueError when unpacked in
this simple way. (Partial results are FAR easier to debug than a
simple "didn't match", plus they can be extremely useful in some
situations.)

>     try:
>         # Who writes ISO-8601 dates using slashes?
>         day, month, year = scanf("%d/%d/%d")
>         if ALLOW_TWO_DIGIT_YEARS and len(year) == 2:
>             year = "20" + year
>     except ScanError:
>         year, month, day = scanf("%d-%d-%d")

It all depends on what your goal is. Do you want to support multiple
different formats (d/m/y, y-m-d, etc)? Do you want one format with
multiple options for delimiter? Is it okay if someone mismatches
delimiters?

Most likely, I'd not care if someone uses y/m-d, but I wouldn't allow
d/m/y or m/d/y, so I'd write it like this:

year, month, day = scanf("%d%*[-/]%d%*[-/]%d")

But realistically, if we're doing actual ISO 8601 date parsing, then
*not one of these is correct*, and we should be using an actual ISO
8601 library :) The simple cases like log file parsing are usually
consuming the output of exactly one program, so you can mandate the
delimiter completely. Here's something that can parse the output of
'git blame':

commit, name, y,m,d, h,m,s, tz, line, text = \
    scanf("%s (%s %d-%d-%d %d:%d:%d %d %d) %s")

(Of course, you should use --porcelain instead, but this is an example.)

There's a spectrum of needs, and a spectrum of tools that can fulfil
them. At one extreme, simple method calls, the "in" operator, etc -
very limited, very fast, easy to read. At the other extreme, full-on
language parsers with detailed grammars. In between? Well, sscanf is a
bit simpler than regexp, REXX's parse is probably somewhere near
sscanf, SNOBOL is probably a bit to the right of regexp, etc, etc,
etc. We shouldn't have to stick to a single tool just because it's
capable of spanning a wide range.

> I think that
>
>     year, sep1, month, sep2, day = 
> re.match(r"(\d+)([-/])(\d+)([-/])(\d+)").groups()
>
> might do it (until Tim or Chris tell me that actually is wrong).
>
> Or use \2 as you suggest later on.

Yeah, \2 much more clearly expresses the intent of "take either of
these characters, and then match another of that character".

ChrisA
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/PKQXJNACY3RMI4DAN2OTQDBLPUMSLZ67/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Regex timeouts

Reply via email to