Re: [Python-ideas] New explicit methods to trim strings

Cameron Simpson Tue, 02 Apr 2019 15:58:28 -0700

On 02Apr2019 12:23, Paul Moore <[email protected]> wrote:

On Tue, 2 Apr 2019 at 12:07, Rhodri James <[email protected]> wrote:

So far we have two slightly dubious use-cases.


1. Stripping file extensions.  Personally I find that treating filenames
like filenames (i.e. using os.path or (nowadays) pathlib) results in me
thinking more appropriately about what I'm doing.


I'd go further and say that filename manipulation is a great example
of a place where generic string functions should definitely *not* be
used.

Filename manipulation on a path _component_ is generally pretty reliable(yes one can break things by, say, inserting os.sep).

I do a fair bit of filename fiddling using string functions, and thesefall into 3 categories off the top of my head:


- file extensions, and here I do use splitext()

- trimming extensions (only barely a second case), and it turns out theonly case I could easily find using the endswith/[:-offset]incantation would probably go just as well with splitext()

- normalising pathnames; as an example, for the home media library Iroutinely downcase filenames, convert whitespace into a dash, separatefields with "--" (eg episode designator vs title) and convert _ into acolon (hello Mac Finder and UI file save dialogues, a holdovercompatibility mode from OS9)

None of these seem to benefit directly from having a cutprefix/cutsuffixmethod. But splitext aside, I'm generally fiddling a pathname component(and usually a basename), and in that domain the general stringfunctions are very handy and well used.

So I think "filename" (basename) fiddling with str methods is actuallypretty reasonable. It is _pathname_ fiddling that is hazardous, becausethe path separators often need to be treated specially.

2. Stripping prefixes and suffixes to get to root words.  Python has
been used for natural language work for over a decade, and I don't think
I've heard any great call from linguists for the functionality.  English
isn't a girl who puts out like that on a first date :-)  There are too
many common exception cases for such a straightforward approach not to
cause confusion.


Agreed, using prefix/suffix stripping on natural language is at best a
"quick hack".

Yeah. I was looking at the prefix list from a related article and seeing"intra" and thinking "intractable". Hacky indeed. _Unless_ the word hasalready been qualified as suitable for the action. And once it is, acutprefix method would indeed be handy.

3. My most common use case (not very common at that) is for stripping
annoying prompts off text-based APIs.  I'm happy using .startswith() and
string slicing for that, though your point about the repeated use of the
string to be stripped off (or worse, hard-coding its length) is well made.

In some ways the verbosity and bugproneness is my personal use case forcutprefix/cutsuffix (however spelt):

- repeating the string is wordy and requires human eyeballing whenever Iread it (to check for correctness); the same applies whenever I writesuch a piece of code - personally I'm quite prone to off-by-one errorswhen hand writing variations on this

- a well named method is more readable and expresses intent better (thesame argument holds for a standalone function, though a method is abit better)

- the anecdotally not uncommon misuse of .strip() where .cutsuffix()with be correct

I confess being a little surprised at how few examples which could usecutsuffix I found in my own code, where I had expected it to be common.


I find several bits line this:

    # parsing text which may have \r\n line endings
    if line.endswith('\r'):
      line = line[:-1]

    # parsing a UNIX network interface listing from ifconfig,
    # which varies platform to platform
    if ifname.endswith(':'):
      ifname = ifname[:-1]

Here I DO NOT want rstrip() because I want to strip only one character,rather than as many as there are. So: the optional trailing marker insome input. But doing this for single character markers is much easierto get right than the broader case with longer suffixes, so I think thisis not a very strong case.


Fiddling the domain suffix on an email address:

    if not addr.endswith(old_domain):
      raise ValueError('addr does not end in old_domain')
    addr2 = addr[:-len(old_domain)] + new_domain

which would be a good fit, _except_ for the sanity check. However, thatsanity check is just one of a few preceeding the change, so in fact thisis a good fit.

I have a few classes which annotate their instances with some magicattributes. Here's a snippet from a class' __getattr__ for a db schema:


    if attr.endswith('_table'):
      # *_table ==> table "*"
      nickname = attr[:-6]
      if nickname in self.table_by_nickname:

There's a little suite of "match attribute suffix, trim and do somethingspecific with what's left" if statements. However, they are almost allof the form above, so rewriting it like this:


    if attr.endswith('_table'):
      # *_table ==> table "*"
      nickname = attr.cutsuffix('_table')
      if nickname in self.table_by_nickname:

is a small improvement. Eevry magic number (the "6" above) is anopportunity for bugs.

I am beginning to worry slightly that actually there are usually more
appropriate things to do than simply cutting off affixes, and that in
providing these particular batteries we might be encouraging poor practise.


It would be really helpful if someone could go through the various use
cases presented in this thread and classify them - filename
manipulation, natural language uses, and "other".

Surprisingly for me, the big subjective win is avoiding misuse oflstrip/rstrip by having obvious better named alternatives for affixtrimming.

Short summary: in my own code I find oportunities for an affix trimmethod less common than I had expected. But I still like the "might findit useful" argument.

I think I find "might find it useful" more compelling than many do. Letme explain.

I think a _well_ _defined_ battery is worth including in the kit (strmethods) because:

- the operation is simple and well defined: people won't be confused byits purpose, and when they want it there is a reliable debugged methodsitting there ready for use

- variations on this get written _all the time_, and writing thosevariations using the method is more readable and more reliable


- the existing .strip battery is misused for this purpose by accident

I have in the past found myself arguing for adding little tools likethis in agile teams, and getting a lot of resistence. The resistencetended to take these forms:

- YAGNI. While the tiny battery _can_ be written longhand, every timethat happens makes for needlessly verbose code, is an opportunity forstupid bugs, and makes code whose purpose must be _deduced_ ratherthan doing what it says on the tin

- not in this ticket: this leads to a starvation issue - the batterynever goes in with any ticket, and a ticket just for the battery nevergets chosen for a sprint

- we've already got this other battery; subtext "not needed" or "wedon't want 2 ways to do this", my subtext "does it worse, or doessomething which only _looks_ like this purpose". Classic example fromthe codebase I was in at the time was SQL parameter insertion.Eventually I said "... this" and wrote the battery anyway.

My position on cut*affix is that (a) it is easy to implement (b) it canthus be debugged once (c) it makes code clearer when used (d) it reducesthe liklihood of .strip() misuse.


Cheers,
Cameron Simpson <[email protected]>
_______________________________________________
Python-ideas mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] New explicit methods to trim strings

Reply via email to