Re: RE Module

Anthra Norell Fri, 25 Aug 2006 03:21:15 -0700

Roman,

Your re works for me. I suspect you have tags spanning lines, a thing you get 
more often than not. If so, processing linewise
doesn't work. You need to catch the tags like this:

>>> text = re.sub ('<(.|\n)*?>', '', text)

If your text is reasonably small I would recommend this solution. Else you 
might want to take a look at SE which is a stream edtor
that does the buffering for you:

http://cheeseshop.python.org/pypi/SE/2.2%20beta

>>> import SE
>>> Tag_Stripper = SE.SE (' "~<(.|\n)*?>~="  "~<!--(.|\n)*?-->~=" ')
>>> print Tag_Stripper (text)
(... your text without tags ...)

The Tag_Stripper is made up of two regexes. The second one catches comments 
which may nest tags. The first expression alone would
also catch comments, but would mistake the '>' of the first nested tag for the 
end of the comment and quit prematurely. The example
"re.sub ('<(.|\n)*?>', '', text)" above would misperform in this respect.

Your Tag_Stripper takes input from files directly:

>>> Tag_Stripper ('name_of_file.htm', 'name_of_output_file')
'name_of_output_file'

Or if you want to to view the output:

>>> Tag_Stripper ('name_of_file.htm', '')
(... your text without tags ...)

If you want to keep the definitions for later use, do this:

>>> Tag_Stripper.save ('[your_path/]tag_stripper.se')

Your definitions are now saved in the file 'tag_stripper.se'. You can edit that 
file. The next time you need a Tag_Stripper you can
make it simply by naming the file:

>>> Tag_Stripper = SE.SE ('[your_path/]tag_stripper.se')

You can easily expand the capabilities of your Tag_Stripper. If, for instance, 
you want to translate the ampersand escapes (&nbsp;
etc.) you'd simply add the name of the file that defines the ampersand 
replacements:

>>> Tag_Stripper = SE.SE ('tag_stripper.se  htm2iso.se')

'htm2iso.se' comes with the SE package ready to use and as an example for 
writing ones own replacement sets.

Frederic

----- Original Message -----
From: "Simon Forman" <[EMAIL PROTECTED]>
Newsgroups: comp.lang.python
To: <[email protected]>
Sent: Friday, August 25, 2006 7:09 AM
Subject: Re: RE Module

> Roman wrote:
> > I am trying to filter a column in a list of all html tags.
>
> What?
>
> > To do that, I have setup the following statement.
> >
> > row[0] = re.sub(r'<.*?>', '', row[0])
> >
> > The results I get are sporatic.  Sometimes two tags are removed.
> > Sometimes 1 tag is removed.   Sometimes no tags are removed.  Could
> > somebody tell me where have I gone wrong here?
> >
> > Thanks in advance
>
> I'm no re expert, so I won't try to advise you on your re, but it might
> help those who are if you gave examples of your input and output data.
> What results are you getting for what input strings.
>
> Also, if you're just trying to strip html markup to get plain text from
> a file, "w3m -dump some.html"  works great.  ;-)
>
> HTH,
> ~Simon
>
> --
> http://mail.python.org/mailman/listinfo/python-list

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module

Reply via email to