Thomas 'PointedEars' Lahn wrote: > Peter Otten wrote: > >> gesh...@gmail.com wrote: >>> how to write a function taking a string parameter, which returns it >>> after you delete the spaces, punctuation marks, accented characters in >>> python ? >> >> Looks like you want to remove more characters than you want to keep. In >> this case I'd decide what characters too keep first, e. g. (assuming >> Python 3) > > However, with *that* approach (which is different from the OP’s request), > regular expression matching might turn out to be more efficient: > > ----------------------------------------------------------- > import re > print("".join(re.findall(r'[a-z]+', "...", re.IGNORECASE))) > ----------------------------------------------------------- > > With the OP’s original request, they may still be the better approach. > For example: > > ---------------------------------------------------------------------- > import re > print("".join(re.sub(r'[\s,;.?!ÀÁÈÉÌÍÒÓÙÚÝ]+', "", "...", > flags=re.IGNORECASE))) > ---------------------------------------------------------------------- > > or > > ---------------------------------------------------------------------- > import re > print("".join(re.findall(r'[^\s,;.?!ÀÁÈÉÌÍÒÓÙÚÝ]+', "", "...", > flags=re.IGNORECASE))) > ---------------------------------------------------------------------- > >>>>> import string >>>>> keep = string.ascii_letters + string.digits >>>>> keep >> 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' >> >> Now you can iterate over the characters and check if you want to preserve >> it for each of them: > > The good thing about this part of the approach you suggested is that you > can build regular expressions from strings, too: > > keep = '[' + 'a-z' + r'\d' + ']' > >>>>> def clean(s, keep): >> ... return "".join(c for c in s if c in keep) >> ... > > Why would one prefer this over "".filter(lambda: c in keep, s)?
Because it's idiomatic Python and easy to understand if you are coming from the imperative buf = [] for c in s: if c in keep: buf.append(c) "".join(buf) Because it uses Python syntax instead of the filter/map/reduce trio. Because it avoids the extra function call (the lambda) though the speed difference is not as big as I expected: $ python3 -m timeit -s 'import string; keep = string.ascii_letters + string.digits; s = "alphabet soup ä" * 1000' '"".join(filter(lambda c: c in keep, s))' 100 loops, best of 3: 4.66 msec per loop $ python3 -m timeit -s 'import string; keep = string.ascii_letters + string.digits; s = "alphabet soup ä" * 1000' '"".join(c for c in s if c in keep)' 100 loops, best of 3: 3.11 msec per loop For reference here is a variant using regular expressions (picked at random, feel free to find a faster one): $ python3 -m timeit -s 'import string, re; keep = string.ascii_letters + string.digits; s = "alphabet soup ä" * 1000; sub=re.compile(r"[^a-zA- Z0-9]+").sub' 'sub("", s)' 1000 loops, best of 3: 1.65 msec per loop And finally str.translate(): $ python3 -m timeit -s 'import string, collections as c; keep = string.ascii_letters + string.digits; s = "alphabet soup ä" * 1000; trans = c.defaultdict(lambda: None, str.maketrans(keep, keep))' 's.translate(trans)' 1000 loops, best of 3: 997 usec per loop >>>>> clean("<alpha> äöü ::42", keep) >> 'alpha42' >>>>> clean("<alpha> äöü ::42", string.ascii_letters) >> 'alpha' >> >> If you are dealing with a lot of text you can make this a bit more >> efficient with the str.translate() method. Create a mapping that maps all >> characters that you want to keep to themselves >> >>>>> m = str.maketrans(keep, keep) >>>>> m[ord("a")] >> 97 >>>>> m[ord(">")] >> Traceback (most recent call last): >> File "<stdin>", line 1, in <module> >> KeyError: 62 >> >> and all characters that you want to discard to None > > Why would creating a *larger* list for *more* operations be *more* > efficient? > I don't understand the question. If you mean that the trans dict may become large -- that depends on the input data. The characters to be deleted are lazily added to the defaultdict. For text in european languages the total size should stay well below 256 entries. But you are probably aiming at something else... -- https://mail.python.org/mailman/listinfo/python-list