On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:
I'd like to improve the code below, which works. It feels clunky to me.

I need to clean up user-uploaded files the size of which I don't know in
advance.

After cleaning they might be as big as 1Mb but that would be super rare.
Perhaps only for testing.

I'm extracting CAS numbers and here is the pattern xx-xx-x up to
xxxxxxx-xx-x eg., 1012300-77-4

def remove_alpha(txt):

      """  r'[^0-9\- ]':

      [^...]: Match any character that is not in the specified set.

      0-9: Match any digit.

      \: Escape character.

      -: Match a hyphen.

      Space: Match a space.

      """

      cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)

      bits = cleaned_txt.split()

      pieces = []

      for bit in bits:

          # minimum size of a CAS number is 7 so drop smaller clumps of digits

          pieces.append(bit if len(bit) > 6 else "")

      return " ".join(pieces)


Many thanks for any hints

Why don't you use re.findall?

re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to