Dennis Lee Bieber writes:

> On Thu, 22 Jun 2017 22:46:28 +0300, Jussi Piitulainen declaimed the
> following:
>
>>
>> A pair of methods, str.maketrans to make a translation table and then
>> .translate on every string, allows to do all that in one step:
>>
>> spacy = r'\/-.[]{}()'
>> tr = str.maketrans(dict.fromkeys(spacy, ' '))
>>
>> ...
>>
>> ln = ln.translate(tr)
>>
>> But those seem to be only in Python 3.
>>
>
>       Well -- I wasn't trying for "production ready" either; mostly
> focusing on the SQLite side of things.

I know, and that's a sound suggestion if the OP is ready for that.

I just like those character translation methods, and I didn't like it
when you first took the time to call a simple regex "line noise" and
then proceeded to post something that looked much more noisy yourself.

>> However, if the OP really is getting their input from a CSV file,
>> they shouldn't need methods like these. Because surely it's then
>> already an unambiguous list of words, to be read in with the csv
>> module? Or else it's not yet CSV at all after all? I think they need
>> to sit down with someone who can walk them through the whole
>> exercise.
>
>       The OP file extensions had CSV, but there was no sign of the csv
> module being used; worse, it looks like the write of the results file
> has no formatting -- it is the repr of a tuple of (word, count)!

Exactly. Too many things like that makes me think they are not ready for
more advanced methods.

>       I'm going out on a limb and guessing the regex being used to
> find words is accepting anything separated by leading/trailing space
> containing a minimum of 3 and maximum of 15 characters in the set
> a..z. So could be missing first and last words on a line if they don't
> have the leading or trailing space, and ignoring "a", "an", "me",
> etc., along with "mrs." [due to .] In contrast, I didn't limit on
> length, and tried to split "look-alike" into "look" and "alike" (and
> given time, would have tried to accept "people's" as a possessive).

I'm not sure I like the splitting of look-alike (I'm not sure that I
like not splitting it either) but note that the regex does that for
free.

The \b in the original regex matches the empty string at a position
where there is a "word character" on only one side. It recognizes a
boundary at the beginning of a line and at whitespace, but also at all
the punctuation marks.

You guess right about the length limits. I wouldn't use them, and then
there's no need for the boundary markers any more: my \w+ matches
maximal sequences of word characters (even in foreign languages like
Finnish or French, and even in upper case, also digits).

To also match "people's" and "didn't", use \w+'\w+, and to match with
and without the ' make the trailing part optional \w+('\w+)? except the
notation really does start to become noisy because one must prevent the
parentheses from "capturing" the group:

import re
wordy = re.compile(r'''  \w+  (?: ' \w+ )? ''', re.VERBOSE)
text = '''
Oliver N'Goma, dit Noli, né le 23 mars 1959 à Mayumba et mort le 7 juin
2010, est un chanteur et guitariste gabonais d'Afro-zouk.
'''

print(wordy.findall(text))

# ['Oliver', "N'Goma", 'dit', 'Noli', 'né', 'le', '23', 'mars', '1959',
# 'à', 'Mayumba', 'et', 'mort', 'le', '7', 'juin', '2010', 'est', 'un',
# 'chanteur', 'et', 'guitariste', 'gabonais', "d'Afro", 'zouk']

Not too bad?

But some punctuation really belongs in words. And other doesn't. And
everything becomes hard and every new heuristic turns out too strict or
too lenient and things that are not words at all may look like words, or
it may not be clear whether something is a word or is more than one word
or is less than a word or not like a word at all. Should one be amused?
Should one despair?

:)
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to