Re: Extracting patterns after matching a regex

MRAB Tue, 08 Sep 2009 08:34:29 -0700

Mart. wrote:

On Sep 8, 3:53 pm, MRAB <[email protected]> wrote:

Mart. wrote:

On Sep 8, 3:14 pm, "Andreas Tawn" <[email protected]> wrote:

Hi,
I need to extract a string after a matching a regular expression. For
example I have the string...
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:
m = re.findall(r"FTPHOST", s)
But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!
Thanks in advance for the help,
Martin

No need for regex.
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
    return s[9:]
Cheers,
Drea

Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example  s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
etc. So I need to find a way to stop it before the \r
slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.

If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, 
then s[0].split(":")[1].strip() will work.

It is an email which contains information before and after the main
section I am interested in, namely...
FINISHED: 09/07/2009 08:42:31
MEDIATYPE: FtpPull
MEDIAFORMAT: FILEFORMAT
FTPHOST: e4ftl01u.ecs.nasa.gov
FTPDIR: /PullDir/0301872638CySfQB
Ftp Pull Download Links:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
Down load ZIP file of packaged order:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
FTPEXPR: 09/12/2009 08:42:31
MEDIA 1 of 1
MEDIAID:
I have been doing this to turn the email into a string
email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())

To me that seems a strange thing to do. You could just read the entire
file as a string:


     f = open(email, 'r')
     s = f.read()

so FTPHOST isn't the first element, it is just part of a larger
string. When I turn the email into a string it looks like...
'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
load ZIP file of packaged order:\r\n',
So not sure splitting it like you suggested works in this case.


Within the file are a list of files, e.g.

TOTAL FILES: 2
                FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf
                FILESIZE: 11028908

                FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
                FILESIZE: 18975

and what i want to do is get the ftp address from the file and collect
these files to pull down from the web e.g.

MOD13A2.A2007033.h17v08.005.2007101023605.hdf
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml

Thus far I have

#!/usr/bin/env python

import sys
import re
import urllib

email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())
m = re.findall(r"MOD....\.........\.h..v..\.005\..............\....
\....", s)

ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1)
ftpdir  = re.search(r'FTPDIR: (.*?)\\r',s).group(1)
url = 'ftp://' + ftphost + ftpdir

for i in xrange(len(m)):

        print i, ':', len(m)
        file1 = m[i][:-4]               # remove xml bit.
        file2 = m[i]

        urllib.urlretrieve(url, file1)
        urllib.urlretrieve(url, file2)

which works, clearly my match for the MOD13A2* files isn't ideal I
guess, but they will always occupt those dimensions, so it should
work. Any suggestions on how to improve this are appreciated.

Suppose the file contains your example text above. Using 'readlines'
returns a list of the lines:

>>> f = open(email, 'r')
>>> lines = f.readlines()
>>> lines

['TOTAL FILES: 2\n', '\t\tFILENAME:MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n', '\t\tFILESIZE:11028908\n', '\n', '\t\tFILENAME:MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n', '\t\tFILESIZE:18975\n']


Using 'str' on that list then converts it to s string _representation_
of that list:

>>> str(lines)

"['TOTAL FILES: 2\\n', '\\t\\tFILENAME:MOD13A2.A2007033.h17v08.005.2007101023605.hdf\\n', '\\t\\tFILESIZE:11028908\\n', '\\n', '\\t\\tFILENAME:MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\\n', '\\t\\tFILESIZE:18975\\n']"


That just parsing a lot more difficult.

It's much easier to just read the entire file as a single string and
then parse that:

>>> f = open(email, 'r')
>>> s = f.read()
>>> s

'TOTAL FILES: 2\n\t\tFILENAME:MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n\t\tFILESIZE:11028908\n\n\t\tFILENAME:MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n\t\tFILESIZE: 18975\n'

>>> import re
>>> re.findall(r"FILENAME: (.+)", s)

['MOD13A2.A2007033.h17v08.005.2007101023605.hdf','MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml']

--
http://mail.python.org/mailman/listinfo/python-list

Re: Extracting patterns after matching a regex

Reply via email to