Thomas Bress 

> I'm new to Plucker and I'm having trouble getting the following site:

> http://www.accuweather.com/pda/pda_5dy.asp?act=S&thisZip=48128

and then

> http://www.accuweather.com/pda/pda_5dy.asp?act=R&thisZip=48128

Note that these URLs are exactly the same, up until the "?"; I wouldn't
be surprised if you have duplicates even up to the "&".

Sometimes (e.g., CNET) this means they are the same document.
Often (e.g., this example), it doesn't.

PyPlucker assumes they are the same, and refuses to even fetch the 
"duplicate"; it just creates another link to the first.  A patch has been 
submitted, but was never applied.

If you want to fix it in your own copy, edit PyPlucker\TextParser and
change the regular expression for finding attributes.

Near the top, you'll find a line like:

    sgmllib.attrfind = re.compile(
        '[%s]*([a-zA-Z_][-.a-zA-Z_0-9]*)' % string.whitespace
        + ('([%s]*=[%s]*' % (string.whitespace, string.whitespace))
        + r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9,@./:+*%?!\(\)_#=~]*))?')

Once upon a time, python's sgmllib did not recognize "@", so plucker
overrode the regex.  Today, at least in the CVS version,
http://cvs.sourceforge.net/viewcvs.py/python/python/dist/src/Lib/
python has fixed all the bugs plucker found and a few extra.  (So if
your python is new enough, you can just delete that line from 
TextParser.)

attrfind = re.compile(
    r'\s*([a-zA-Z_][-:.a-zA-Z_0-9]*)(\s*=\s*'
    r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~\'"@]*))?')


For reference, the changes this makes are:

It allows ":" in the attribute name.  Not needed for html, but useful
if you're fetching xml pages as well.

In the attribute value, it now also allows 

&       by convention, separates query variables (and fixes your problem)
;       marks "parameters".  in the www standard, but I've never seen it
used
$       typically for $SESSION_ID=xxx$ in place of cookies.
'       legal inside a double-quoted value string.  I've seen it used (as an
apostrophe).
"       legal inside a single-quoted value string.  I've never seen it.

It still does not add "\", which would allow some additional URLs
that use the windows-based "\" as a separator.  They're trying to
stay sort-of-close-to-legal, but accepting it might help you on a 
few pages.

-jJ
_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Reply via email to