On Tue, 13 Mar 2001, Malcolm Austen wrote:
+ On Mon, 12 Mar 2001, Gilles Detillieux wrote:
+
+ + Well, if you have a simple test set of data that produces this problem
+ + in 3.1.5, then please do bore us with it. Even though there have been
+ + substantial changes for 3.2, much of those have been backported to 3.1.5,
+ + so if the problem remains in 3.1.x but not in 3.2.x, I'd like to know what
+ + the cause is, so we can address this if/when we start working on 3.1.6.
Gilles, (and anyone else who cares to try to resolve this!!)
I got worried yesterday afternoon that I was not going to be able to
reproduce the fault without indexing 20,000 documents. Fortunately I did
manage it with just one server and (just)under 600 documents.
I have indexed (config file at the end of this message) with a hop count
of one and then again with a hopcount of two. The result of the second run
is some 299 documents with hopcounts of 1 that were not indexed in the
first run. The output from the two runs can be seen in/fetched from
http://wwwsearch.ox.ac.uk/h1v/
http://wwwsearch.ox.ac.uk/h2v/
Anticipating the request I have re-run them with more output into
http://wwwsearch.ox.ac.uk/h1vvv/
http://wwwsearch.ox.ac.uk/h2vvv/
In each case the report directory contains the output generated by my
reporting script from htdig.stdout.
If you want more 'v's just ask. Re-indexing is probably best done from
here since I have the server and htdig on the same 100Mbps subnet but I
don't think there are any restricted access files to prevent you
reproducing the results remotely.
I have tried to look at the HTML concerned to see if it has any obvious
oddities but can't see anything. The first page indexed with a bad (ie not
incremented to 2) hopcount is -
http://www.ox.ac.uk/blueprint/
- linked from the left panel on http://www.ox.ac.uk/newsf.html
- that file has horrid HTML, the author seems to have an aversion to
closing anchors 8-( which I will take up separately. As htdig manages to
find and follow the links I don't think the bad HTML is contributing to
the bad hopcounts.
I looked at another -
http://www.ox.ac.uk/aboutoxford/spotlight/03.01.shtml
- linked from http://www.ox.ac.uk/aboutoxford/
- that HTML looks cleaner wrt </a> although it still generates a large
number of validation errors.
Over to someone who knows the code I think ... thanks for anything you can
do to track this down, I think mapping pages by hopcount does give some
feeling for how likely pages are to be found by pure surfing.
regards,
Malcolm.
[EMAIL PROTECTED] http://users.ox.ac.uk/~malcolm/
Config file for htdig follows although I don't think it contains anything
to trigger the problem. This is the samne as my live config except for the
database directory and the commented out limit_urls_to line.
robotstxt_name: oxbot
database_dir: /opt/spider/htdig/hoptest/db/new
remove_bad_urls: false
maintainer: [EMAIL PROTECTED]
htnotify_sender: [EMAIL PROTECTED]
iso_8601: true
server_wait_time: 1
start_url: http://www.ox.ac.uk/
#limit_urls_to: .ox.ac.uk .ousu.org .humbul.ac.uk
server_aliases: www.ecu.ox.ac.uk:80=www.eci.ox.ac.uk:80 \
www.lincoln.ox.ac.uk:80=www.linc.ox.ac.uk:80 \
www.pembroke.ox.ac.uk:80=www.pmb.ox.ac.uk:80 \
www.somerville.ox.ac.uk:80=www.some.ox.ac.uk:80 \
www.st-johns.ox.ac.uk:80=www.sjc.ox.ac.uk:80 \
www.wolfson.ox.ac.uk:80=www.wolf.ox.ac.uk:80 \
www.worcester.ox.ac.uk:80=www.worc.ox.ac.uk:80 \
molbiol.ox.ac.uk:80=www.molbiol.ox.ac.uk:80 \
biochweb.bioch.ox.ac.uk:80=www.bioch.ox.ac.uk:80 \
www.ndcb.ox.ac.uk:80=www.ndcls.ox.ac.uk:80 \
www.ncl.ox.ac.uk:80=www.chem.ox.ac.uk:80 \
hoa.dha.ox.ac.uk:80=www.hoa.ox.ac.uk:80 \
aisuwww.offices.ox.ac.uk:80=www.admin.ox.ac.uk:80 \
oxam.ox.ac.uk:80=www.oxam.ox.ac.uk:80
#exclude_urls: /~ \
# /users \
# /internal \
# www.lincoln.ox.ac.uk \
# www.pembroke.ox.ac.uk \
# www.somerville.ox.ac.uk \
# www.worcester.ox.ac.uk \
# ecu.ox.ac.uk \
exclude_urls: ? \
/cgi-bin/ \
.cgi \
.ps \
.pdf \
.rtf \
.doc \
.ox.ac.uk. \
/internet/news/ \
/oucs/news/ \
/oucs/linux/linux \
phone.lis \
info.ox.ac.uk/dep \
ox.os.linux \
webtest.offices.ox.ac.uk \
mirror \
strubi/strubi \
maillist.ox.ac.uk \
munchkin \
:1100 \
:5000 \
ferret.lmh.ox.ac.uk \
www-jcr.lmh.ox.ac.uk \
www-jcr.linc.ox.ac.uk \
www-student.linc.ox.ac.uk \
jcr.jesus.ox.ac.uk \
student.some.ox.ac.uk \
paul.merton.ox.ac.uk \
madhatter.chch.ox.ac.uk \
enterprise.molbiol.ox.ac.uk \
bioch.ox.ac.uk:8888 \
www.lib.ox.ac.uk:8000 \
irusan.las.ox.ac.uk \
neon.chem.ox.ac.uk \
neon.chemistry.ox.ac.uk \
physchem.ox.ac.uk:8000 \
joule.pcl.ox.ac.uk:8000 \
nimbus.geog.ox.ac.uk \
archive.comlab.ox.ac.uk \
www.softeng.ox.ac.uk:8080 \
al2.physics.ox.ac.uk \
www-pnp.physics.ox.ac.uk \
av2.physics.ox.ac.uk:8080 \
av8.physics.ox.ac.uk \
k2.stcatz.ox.ac.uk \
babel.mml.ox.ac.uk \
sbsnet.mgtstud.ox.ac.uk \
assayfinder.jr2.ox.ac.uk \
erl.ox.ac.uk:8590 \
lannes.ashmol.ox.ac.uk \
info.ox.ac.uk:81 \
www-preview.oucs.ox.ac.uk \
herald.ox.ac.uk \
acedirector.ox.ac.uk \
mercury.oucs.ox.ac.uk \
www.robots.ox.ac.uk/ftp \
kebl1088.keble.ox.ac.uk/ftp \
www2.merton.ox.ac.uk/~security \
security-archive.merton.ox.ac.uk \
wwwsearch.oucs.ox.ac.uk \
wwwsearch.ox.ac.uk
max_head_length: 10000
translate_amp: true
translate_lt_gt: true
translate_quot: true
#
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html