Gilles Ganault <nospam <at> nospam.com> writes: > > Hello > > Some of the adresses are missing a space between the streetname and > the ZIP code, eg. "123 Main Street01159 Someville"
This problem appears very similar to the one you had in a previous episode, where you were deleting <br /> in address contexts where it obviously should have been treated as importantly as a comma or even (would you believe) a line break. The example botched output was "... St Johns WoodLondon ..." IIRC. Prevention is better than cure; try to find out if your earlier code is causing this problem. > > The following regex doesn't seem to work: Regexes do work. If the outcome is not what you expected, it is your eexpectation-to-regex translator that is not working. What does it do? Does it match zero addresses, all addresses, many addresses that contain a 5-digit number /followed/ by a space, something else? Could you use the answer to that question to narrow in on the problem with your regex? > > #Check for any non-space before a five-digit number > re_bad_address = re.compile('([^\s].)(\d{5}) ',re.I | re.S | re.M) The comment is quite incorrect. After removing the fog of useless parentheses, the regex says: [^\s] -- one non-whitespace character (better written as \S) . -- any character (more or less, see later) (why?) \d{5} -- 5 digits -- a space (why?) Then there's a hail of flags: re.I (ignore case) -- irrelevant re.S (DOTALL) -- makes your pointless . match any character (instead of any character except newline) Do you have any newlines in your addresses? re.M (MULTILINE) -- I'm 99% sure you don't need this either. > > I also tried ([^ ].), to no avail. If not-whitespace doesn't match, changing it to not-space doesn't help. > > What is the right way to tell the Python re module to check for any > non-space character? r'[^ ]' -- but that's NOT the question you should be asking. HTH, John -- http://mail.python.org/mailman/listinfo/python-list