[Python-ideas] Re: Fwd: Re: Fwd: re.findfirst()

Kyle Stanley Fri, 06 Dec 2019 13:22:18 -0800

Serhiy Storchaka wrote:
> Thank you Kyle for your investigation!

No problem, this seemed like an interesting feature proposal and I was
personally curious about the potential use cases. Thanks for the detailed
analysis, I learned a few new things from it. (:


Serhiy Storchaka wrote:
> -       clerk_name = name_re.findall(clerk)[0]
> +       clerk_name = name_re.search(clerk).group(1)

This pattern seems to be common across most of the above examples (minus
the last two), specifically replacing ``re.findall()[0]`` with
``re.findall().group(1)`` when there are subgroups within the regex or
``re.findall().group()`` without subgroups.

Serhiy Storchaka wrote:
> It seems that in most cases the author just do not know about
> re.search(). Adding re.findfirst() will not fix this.

That's definitely possible, but it might be just as likely that they saw
re.findall() as being more simple to use compared to re.search(). Although
it has worse performance by a substantial amount when parsing decent
amounts of text (assuming the first match isn't at the end),
``re.findall()[0]`` *consistently* returns the first string that was
matched, as long as no subgroups were used. This allows them to circumvent
the usage of match objects entirely, which makes it a bit easier to learn.
Especially for those who are less familiar with OOP, or are already
familiar with other popular flavors of regex (such as JS).

I'll admit this is mostly speculation, but I think there's an especially
large number of re users (compared to other modules) that aren't
necessarily developers, and might just be someone who wants to write a
script to quickly parse some documents. These types of users are the ones
who would likely benefit the most from the proposed re.findfirst(),
particularly if it directly returns a string as Guido is suggesting.

I think at the end of the day, the critical question to answer is this:

*Do we want to add a new helper function that's easy to use, consistent,
and provides good performance for finding the first match, even if the
functionality already exists within the module?*

Personally, I lean a bit more towards "yes", but I think that "no" would
also be a reasonable response.

>From my perspective, a significant reason why Python is appealing to so
many users that aren't professional developers is that it's much easier to
pick up the basics. Python allows users write a quick script with *decent*
performance without having to learn too much, compared to most other
mainstream programming languages. IMO, the addition of an re.findfirst()
helps to reinforce that reason.

Another option to consider might be adding a boolean parameter to
re.search() that changes the behavior to directly return a string instead
of a match object, similar to re.findall() when there are not multiple
subgroups. For example:

>>> re.search(" (\w) ", "there is a one letter word in the middle",
match_obj=False)
'a'

The above would have the same exact return value as
``pattern.findall()[0]``, but it's more efficient since it would only parse
the text until the first match is found, and it doesn't need to create a
list. For backwards compatibility, this parameter would default to True.
Feel free to change the name if you like the idea, "match_obj" was simply
the first one that came to my head.

The cons of this solution is that it might be excessively overloading
re.search(), and that it not be as noticeable or easy to find as the
addition of a new function. But, it could provide the same functionality as
the proposed re.findfirst(), without adding an entirely new function for
behavior that already exists.

On Fri, Dec 6, 2019 at 2:47 AM Serhiy Storchaka <[email protected]> wrote:

> 05.12.19 23:47, Kyle Stanley пише:
> >
> > Serhiy Storchaka wrote:
> >  > We still do not know a use case for findfirst. If the OP would show
> his
> >  > code and several examples in others code this could be an argument for
> >  > usefulness of this feature.
> >
> > I'm not sure about the OP's exact use case, but using GitHub's code
> > search for .py files that match with "first re.findall" shows a decent
> > amount of code that uses the format ``re.findall()[0]``. It would be
> > nice if GitHub's search properly supported symbols and regular
> > expressions, but this presents a decent number of examples. See
> > https://github.com/search?l=Python&q=first+re.findall&type=Code.
> >
> > I also spent some time looking for a few specific examples, since there
> > were a number of false positives in the above results. Note that I
> > didn't look much into the actual purpose of the code or judge it based
> > on quality, I was just looking for anything that seemed remotely
> > practical and contained something along the lines of
> > ``re.findall()[0]``. Several of the links below contain multiple lines
> > where findfirst would likely be a better alternative, but I only
> > included one permalink per code file.
>
> Thank you Kyle for your investigation!
>
> >
> https://github.com/MohamedAl-Hussein/my_projects/blob/15feca5254fe1b2936d39369365867496ce5b2aa/fifa_workspace/fifa_market_analysis/fifa_market_analysis/items.py#L325
>
> It is easy to rewrite it using re.search().
>
> -         input_processor=MapCompose(lambda x: re.findall(r'pointDRI =
> ([0-9]+)', x)[0], eval),
> +         input_processor=MapCompose(lambda x: re.search(r'pointDRI =
> ([0-9]+)', x).group(1), eval),
>
> I also wonder if it is worth to replace eval with more efficient and
> safe int.
>
>
> >
> https://github.com/MohamedAl-Hussein/FIFA/blob/2b1390fe46f94648e5b0bcfd28bc67a3bc43f09d/fifa_data/fifa_data/items.py#L370
>
> It is the same code differently formatted.
>
> >
> https://github.com/democracyworks/dog-catcher/blob/9f6200084d4505091399d36ab0d5e3379b04588c/new_jersey.py#L82
>
> -       clerk_name = name_re.findall(clerk)[0]
> +       clerk_name = name_re.search(clerk).group(1)
>
>
> >
> https://github.com/democracyworks/dog-catcher/blob/9f6200084d4505091399d36ab0d5e3379b04588c/connecticut.py#L182
>
> -     official_name = name_re.findall(town)[0].title()
> +     official_name = name_re.search(town).group().title()
>
>
> >
> https://github.com/jessyL6/CQUPTHUB-spiders_task1/blob/db73c47c0703ed01eb2a6034c37edd9e18abb2e0/ZhongBiao2/spiders/zhongbiao2.py#L176
>
> -             first_1_results = re.findall(first_1,all_list9)[0]
> +             first_1_results = re.findall(first_1,all_list9).group(1)
>
>
>
> >
> https://github.com/kerinin/giscrape/blob/d398206ed4a7e48e1ef6afbf37b4f98784cf2442/giscrape/spiders/people_search.py#L26
>
> It is a complex example which performs multiple searches with different
> regular expressions. It is all can be replaced with a single more
> efficient regular expression.
>
> -   if re.search('^(\w+) (\w+)$', parcel.owner):
> -     last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0]
> -   elif re.search('^(\w+) (\w+) (\w+)$', parcel.owner):
> -     last, first, middle = re.findall( '(\w+) (\w+) (\w+)',parcel.owner
> )[0]
> -   elif re.search('^(\w+) (\w+) &amp; (\w+)$', parcel.owner):
> -     last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0]
> -   elif re.search('^(\w+) (\w+) (\w+) &amp: (\w+)$', parcel.owner):
> -     last, first, middle = re.findall( '(\w+) (\w+) (\w+)',parcel.owner
> )[0]
> -   elif re.search('^(\w+) (\w+) &amp; (\w+) (\w+)$', parcel.owner):
> -     last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0]
> -   elif re.search('^(\w+) (\w+) (\w+) &amp: (\w+) (\w+)$', parcel.owner):
> -     last, first, middle = re.findall( '(\w+) (\w+) (\w+)',parcel.owner
> )[0]
> -   elif re.search('^(\w+) (\w+) &amp; (\w+) (\w+) (\w+)$', parcel.owner):
> -     last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0]
> -   elif re.search('^(\w+) (\w+) (\w+) &amp: (\w+) (\w+) (\w+)$',
> parcel.owner):
> -     last, first, middle = re.findall( '(\w+) (\w+) (\w+)',
> parcel.owner     )[0]
>
> +   m = re.fullmatch('(\w+) (\w+)(?: (\w+))?(?: &amp;(?: \w+){1,3})?',
> parcel.owner)
> +   if m:
> +     last, first, middle = m.groups()
>
>
> >
> https://github.com/songweifun/parsebook/blob/529a86739208e9dc07abbb31363462e2921f00a0/dao/parseMarc.py#L211
>
> This is the only example which checks if findall() returns an empty
> list. It calls findall() twice! Fortunately it can be easily optimized
> using a fact that the Match object support subscription. I used group()
> above because it is more explicit and works in older Python.
>
> -             self.item.first_tutor_name = REGPX_A.findall(value)[0] if
> REGPX_A.findall(value) else ''
> +             self.item.first_tutor_name = (REGPX_A.search(value) or
> [''])[0]
>
>
> It seems that in most cases the author just do not know about
> re.search(). Adding re.findfirst() will not fix this.
> _______________________________________________
> Python-ideas mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at
> https://mail.python.org/archives/list/[email protected]/message/5O2TP5HZHHJC7E55K2OYVKND4ITDB5DM/
> Code of Conduct: http://python.org/psf/codeofconduct/
>

_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/6YAYAM6TK4AZ566NMVFXSQRRRUSA6IYF/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Fwd: Re: Fwd: re.findfirst()

Reply via email to