[Python-ideas] Re: Regex timeouts

J.B. Langston Tue, 15 Feb 2022 08:25:58 -0800

Tim Peters wrote:
> """
> Some people, when confronted with a problem, think “I know, I'll use
> regular expressions.”  Now they have two problems.
> - Jamie Zawinski
> """


Maybe so, but I'm committed now :).  I have dozens of regexes to parse specific 
log messages I'm interested in. I made a little DSL that uses regexes with 
capture groups, and if the regex matches, takes the resulting groupdict and 
optionally applies further transformations on the individual fields. This 
allows me to very concisely specify what I want to extract before doing further 
analysis and aggregation on the resulting fields.  For example:

flush_end = Rule(
    Capture(
        # Completed flushing 
/u01/data02/tb_tbi_project02_prd/data_launch_index-4a5f72725b7211eaab635720a1b8a299/aa-26507-bti-Data.db
 (46.528MiB) for commitlog position CommitLogPosition(segmentId=1615955816662, 
position=223538288)
        # Completed flushing 
/dse/data02/OpsCenter/rollup_state-7b621931ab7511e8b862810a639403e5/bb-21969-bti-Data.db
 (7.763MiB/2.197MiB on disk/1 files) for commitlog position 
CommitLogPosition(segmentId=1637403836277, position=9927158)
        r"Completed flushing (?P<sstable>[^ ]+) 
\((?P<bytes_flushed>[^)/]+)(/(?P<bytes_on_disk>[^ ]+) on disk/(?P<file_count>[^ 
]+) files)?\) for commitlog position 
CommitLogPosition\(segmentId=(?P<commitlog_segment>[^,]+), 
position=(?P<commitlog_position>[^)]+)\)"
    ),
    Convert(
        normval,
        "bytes_flushed",
        "bytes_on_disk",
        "commitlog_segment",
        "commitlog_position",
    ),
    table_from_sstable,
)

I know there are specialized tools like logstash but it's nice to be able to 
specify the extraction and subsequent analysis together in Python. 

> reason to change that. Naive regexps are both clumsy and prone to bad
> timing in many tasks that "should be" very easy to express. For
> example, "now match up to the next occurrence of 'X'". In SNOBOL and
> Icon, that's trivial. 75% of regexp users will write ".*X", with scant
> understanding that it may match waaaay more than they intended.
> Another 20% will write ".*?X", with scant understanding that may
> extend beyond _just_ "the next" X in some cases. That leaves the happy
> 5% who write "[^X]*X", which finally says what they intended from the
> start.

If you look in my regex in the example above, you will see that the "[^X]*X" is 
exactly what I did. The pathological case arose from a simple typo where I had 
an extra + after a capture group that I failed to notice, and which somehow 
worked correctly on the expected input but ran forever when the expected 
terminating character appeared more times than expected in the input string.
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/USLCQSN6WARWTWJI5LATPS3DZMAYDM5S/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Regex timeouts

Reply via email to