I'm trying to discern why the BusinessWeek site fails to grab the actual 
articles when entered from the top, but does grab them when entered from 
one level down.  It's not a depth thing... we're talking a depth of 3 and I 
have it set to 6 for this site.

The only difference in my two INI sections, other than the channel names, 
is the starting URL and the update base (but I'm forcing update anyhow.)

This one works:  home_url=http://pda.businessweek.com/list/db01.htm

This one fails: home_url=http://pda.businessweek.com/

Both have home_maxdepth=6, neither has an exclusion list.  And yet they 
generate amazingly different output.  The db01.htm is the first news link 
off of the pda.businessweek.com page.  It has some articles with very long 
names... the Seton Hall article is named 
http://pda.businessweek.com/bwdaily/dnflash/sep2002/nf20020920_9297.htm

It is these names that are not getting grabbed-and-processed by Plucker 
when coming from the top level.  And, unfortunately, even at Maximum 
Detail, the output is truncated...

This is what it had to say about that file...
---- 15 collected, 30 to do ----
Processing http://pda.businessweek.com/list/db01.htm...
   Retrieved ok.
   Not fetching image <SpiderLink Depth: 6/6 MAXWIDTH=None MAXHEIGHT=None 
BPP=8 URL='http://pda.businessweek.com/common_images/bwsmall.gif' 
{'_plucker_from_image': 1, '_plucker_id_tag_outoflineimage': 86, 'border': 
'0', '_plucker_id_tag_inlineimage': 85, 'maxdepth': '6', 'bpp': '8', 
'width': '150', 'height': '25', 'src': 
'http://pda.businessweek.com/common_images/bwsmall.gif'}> (already fetched)
   Not fetching image <SpiderLink Depth: 6/6 MAXWIDTH=None MAXHEIGHT=None 
BPP=8 URL='http://www.businessweek.com/common_images/bw_1x1.gif' 
{'_plucker_from_image': 1, '_plucker_id_tag_outoflineimage': 88, 'border': 
'0', '_plucker_id_tag_inlineimage': 87, 'maxdepth': '6', 'bpp': '8', 
'width': '1', 'height': '12', 'src': 
'http://www.businessweek.com/common_images/bw_1x1.gif'}> (already fetched)
   Parsed ok.

Well, that's good to know; it's stating why it's not grabbing certain 
images.  No mention of any of the six or so URL links on that page, but it 
knows the depth.

When it does grab pages, the output is like this:
Processing http://pda.businessweek.com/investor/con.....020920_2431.htm...
   Retrieved ok.
   Parsed ok.

The name is a bit mangled.

So my query is this:

1. Any way to get it to output the ENTIRE URL it's working on?

2. Any way to get it to explain why it's skipping some URLs/files?

3. Any clue why, if depth is not an issue and all else is the same, it 
would grab some files only if it enters at their immediate parent rather 
than their grandparent?

The entire INI section is:
[BusinessWeek]
bpp=8
copy_to_dir=D:\winsoft\palm\Plucker\output
doc_file=channels/BusinessWeek/BusinessWeek
doc_name=Business Week Online
user=Tony McNamara
home_url=http://pda.businessweek.com/
verbosity=2
referrer=
user_agent=
before_command=
after_command=
home_maxdepth=6
home_stayonhost=0
home_stayondomain=0
home_url_pattern=
exclusion_lists=
charset=
indent_paragraphs=0
anchor_color=#0000FF
alt_text=1
maxwidth=150
maxheight=250
alt_maxwidth=1000000
alt_maxheight=1000000
compression=zlib
image_compression_limit=50
category=Business
no_urlinfo=1
owner_id_build=
copyprevention_bit=0
backup_bit=0
launchable_bit=0
big_icon=
small_icon=
update_enabled=1
update_frequency=6
update_period=hourly
update_base=2002-09-21T21:14:24
close_on_exit=1
close_on_error=1

If you can get the full text of the Seton Hall story, I would love to know 
your trick.

Thanks
        Tony McNamara

_______________________________________________
plucker-dev mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-dev

Reply via email to