Thomas Bress > I'm new to Plucker and I'm having trouble getting the following site:
> http://www.accuweather.com/pda/pda_5dy.asp?act=S&thisZip=48128 and then > http://www.accuweather.com/pda/pda_5dy.asp?act=R&thisZip=48128 Note that these URLs are exactly the same, up until the "?"; I wouldn't be surprised if you have duplicates even up to the "&". Sometimes (e.g., CNET) this means they are the same document. Often (e.g., this example), it doesn't. PyPlucker assumes they are the same, and refuses to even fetch the "duplicate"; it just creates another link to the first. A patch has been submitted, but was never applied. If you want to fix it in your own copy, edit PyPlucker\TextParser and change the regular expression for finding attributes. Near the top, you'll find a line like: sgmllib.attrfind = re.compile( '[%s]*([a-zA-Z_][-.a-zA-Z_0-9]*)' % string.whitespace + ('([%s]*=[%s]*' % (string.whitespace, string.whitespace)) + r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9,@./:+*%?!\(\)_#=~]*))?') Once upon a time, python's sgmllib did not recognize "@", so plucker overrode the regex. Today, at least in the CVS version, http://cvs.sourceforge.net/viewcvs.py/python/python/dist/src/Lib/ python has fixed all the bugs plucker found and a few extra. (So if your python is new enough, you can just delete that line from TextParser.) attrfind = re.compile( r'\s*([a-zA-Z_][-:.a-zA-Z_0-9]*)(\s*=\s*' r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~\'"@]*))?') For reference, the changes this makes are: It allows ":" in the attribute name. Not needed for html, but useful if you're fetching xml pages as well. In the attribute value, it now also allows & by convention, separates query variables (and fixes your problem) ; marks "parameters". in the www standard, but I've never seen it used $ typically for $SESSION_ID=xxx$ in place of cookies. ' legal inside a double-quoted value string. I've seen it used (as an apostrophe). " legal inside a single-quoted value string. I've never seen it. It still does not add "\", which would allow some additional URLs that use the windows-based "\" as a separator. They're trying to stay sort-of-close-to-legal, but accepting it might help you on a few pages. -jJ _______________________________________________ plucker-list mailing list [EMAIL PROTECTED] http://lists.rubberchicken.org/mailman/listinfo/plucker-list

