Plucking Slate.com, a Python example

Bill Janssen Sat, 03 Nov 2001 06:35:25 -0800

The MSN change has affected Slate.com, an online magazine owned by MS.
The re-styling is so bad that I figured I'd start plucking it instead
of looking at it in a browser.  Unfortunately, it's in UTF-8 and
XHTML, and contains a number of the standard "odd" characters.  I
wrote a little csh/Python script to convert it, and thought it might
be of interest to others, if only as an example of how Python 2's
Unicode and regexp support works.


Apparently the various re.sub calls can be amalgamated into one big
one, but my head would explode if I then tried to debug the resulting
RE.

Bill

#!/bin/csh

setenv PATH /sbin:/usr/sbin:/usr/bin:/etc:/usr/ccs/bin:/usr/ucb:/usr/openwin/bin
setenv PATH /import/netpbm/sparc-sun-solaris2/bin:$PATH
source /import/python-1.5.2/top/enable
source /import/Plucker/1.1.13/top/enable

/import/python-2.1/python <<'EOF'
import sys, re, time, urllib
input = urllib.urlopen('http://slate.msn.com/toolbar.aspx?id=toc&action=print')
uline = unicode(input.read(), 'utf-8')
input.close()
# first remove various non-Latin1 punctuation
uline = re.sub(u'\u2019', "'", uline)
uline = re.sub(u'\u2018', "`", uline)
uline = re.sub(u'\u201a', ",", uline)
uline = re.sub(u'\u201c', "``", uline)
uline = re.sub(u'\u201d', "''", uline)
uline = re.sub(u'\u2013', "-", uline)
uline = re.sub(u'\u2014', "--", uline)
uline = re.sub(u'\u2026', "...", uline)
# we don't know about XHTML yet, so translate anchors to HTML
uline = re.sub(u'<a name="#([^"]+)"/>', '<a name="\\1"></a>', uline)
# remove advertisements
uline = re.sub(u'(?si)<HTMLCode.+?</HTMLCode>', '', uline)
# remove any trailing javascript
uline = re.sub(u'(?si)<script.+?</script>', '', uline)
# remove any tracking images
uline = re.sub(u'<img width="0".+?>', '', uline)
# set the title to something meaningful
timestamp = time.strftime("%m/%d/%y, %I:%M %p", time.localtime(time.time()))
uline = re.sub(u'<title>.+?</title>', '<title>Slate Magazine, ' + timestamp + 
'</title>', uline)
uline = re.sub(u'<b>Slate.com</b>', '<h1>Slate Magazine</h1><br><center>' + timestamp 
+ '</center>', uline, 1)
# change the indicated charset
uline = re.sub(u'charset=utf-8', "charset=iso-8859-1", uline)
# and output the results
output = open('/tmp/slate.html', 'w')
# write it out, replacing any non-Latin-1 characters remaining with '?'
output.write(uline.encode('iso8859-1', 'replace'))
output.close()
'EOF'

plucker-build --verbosity=0 --zlib-compression -H file:/tmp/slate.html -f 
SlateMagazine -N "Slate Magazine" -M 1 --bpp=0 -p ~/.plucker

Plucking Slate.com, a Python example

Reply via email to