I'm trying to discern why the BusinessWeek site fails to grab the actual articles when entered from the top, but does grab them when entered from one level down. It's not a depth thing... we're talking a depth of 3 and I have it set to 6 for this site.
The only difference in my two INI sections, other than the channel names, is the starting URL and the update base (but I'm forcing update anyhow.) This one works: home_url=http://pda.businessweek.com/list/db01.htm This one fails: home_url=http://pda.businessweek.com/ Both have home_maxdepth=6, neither has an exclusion list. And yet they generate amazingly different output. The db01.htm is the first news link off of the pda.businessweek.com page. It has some articles with very long names... the Seton Hall article is named http://pda.businessweek.com/bwdaily/dnflash/sep2002/nf20020920_9297.htm It is these names that are not getting grabbed-and-processed by Plucker when coming from the top level. And, unfortunately, even at Maximum Detail, the output is truncated... This is what it had to say about that file... ---- 15 collected, 30 to do ---- Processing http://pda.businessweek.com/list/db01.htm... Retrieved ok. Not fetching image <SpiderLink Depth: 6/6 MAXWIDTH=None MAXHEIGHT=None BPP=8 URL='http://pda.businessweek.com/common_images/bwsmall.gif' {'_plucker_from_image': 1, '_plucker_id_tag_outoflineimage': 86, 'border': '0', '_plucker_id_tag_inlineimage': 85, 'maxdepth': '6', 'bpp': '8', 'width': '150', 'height': '25', 'src': 'http://pda.businessweek.com/common_images/bwsmall.gif'}> (already fetched) Not fetching image <SpiderLink Depth: 6/6 MAXWIDTH=None MAXHEIGHT=None BPP=8 URL='http://www.businessweek.com/common_images/bw_1x1.gif' {'_plucker_from_image': 1, '_plucker_id_tag_outoflineimage': 88, 'border': '0', '_plucker_id_tag_inlineimage': 87, 'maxdepth': '6', 'bpp': '8', 'width': '1', 'height': '12', 'src': 'http://www.businessweek.com/common_images/bw_1x1.gif'}> (already fetched) Parsed ok. Well, that's good to know; it's stating why it's not grabbing certain images. No mention of any of the six or so URL links on that page, but it knows the depth. When it does grab pages, the output is like this: Processing http://pda.businessweek.com/investor/con.....020920_2431.htm... Retrieved ok. Parsed ok. The name is a bit mangled. So my query is this: 1. Any way to get it to output the ENTIRE URL it's working on? 2. Any way to get it to explain why it's skipping some URLs/files? 3. Any clue why, if depth is not an issue and all else is the same, it would grab some files only if it enters at their immediate parent rather than their grandparent? The entire INI section is: [BusinessWeek] bpp=8 copy_to_dir=D:\winsoft\palm\Plucker\output doc_file=channels/BusinessWeek/BusinessWeek doc_name=Business Week Online user=Tony McNamara home_url=http://pda.businessweek.com/ verbosity=2 referrer= user_agent= before_command= after_command= home_maxdepth=6 home_stayonhost=0 home_stayondomain=0 home_url_pattern= exclusion_lists= charset= indent_paragraphs=0 anchor_color=#0000FF alt_text=1 maxwidth=150 maxheight=250 alt_maxwidth=1000000 alt_maxheight=1000000 compression=zlib image_compression_limit=50 category=Business no_urlinfo=1 owner_id_build= copyprevention_bit=0 backup_bit=0 launchable_bit=0 big_icon= small_icon= update_enabled=1 update_frequency=6 update_period=hourly update_base=2002-09-21T21:14:24 close_on_exit=1 close_on_error=1 If you can get the full text of the Seton Hall story, I would love to know your trick. Thanks Tony McNamara _______________________________________________ plucker-dev mailing list [EMAIL PROTECTED] http://lists.rubberchicken.org/mailman/listinfo/plucker-dev
