[htdig] Errors in reported hopcounts

Malcolm Austen Wed, 14 Mar 2001 06:10:46 -0800
On Tue, 13 Mar 2001, Malcolm Austen wrote:

+ On Mon, 12 Mar 2001, Gilles Detillieux wrote:
+ 
+ + Well, if you have a simple test set of data that produces this problem
+ + in 3.1.5, then please do bore us with it.  Even though there have been
+ + substantial changes for 3.2, much of those have been backported to 3.1.5,
+ + so if the problem remains in 3.1.x but not in 3.2.x, I'd like to know what
+ + the cause is, so we can address this if/when we start working on 3.1.6.

Gilles, (and anyone else who cares to try to resolve this!!)

I got worried yesterday afternoon that I was not going to be able to
reproduce the fault without indexing 20,000 documents. Fortunately I did
manage it with just one server and (just)under 600 documents.

I have indexed (config file at the end of this message) with a hop count
of one and then again with a hopcount of two. The result of the second run
is some 299 documents with hopcounts of 1 that were not indexed in the
first run. The output from the two runs can be seen in/fetched from

        http://wwwsearch.ox.ac.uk/h1v/
        http://wwwsearch.ox.ac.uk/h2v/

Anticipating the request I have re-run them with more output into

        http://wwwsearch.ox.ac.uk/h1vvv/
        http://wwwsearch.ox.ac.uk/h2vvv/

In each case the report directory contains the output generated by my
reporting script from htdig.stdout.

If you want more 'v's just ask. Re-indexing is probably best done from
here since I have the server and htdig on the same 100Mbps subnet but I
don't think there are any restricted access files to prevent you
reproducing the results remotely.

I have tried to look at the HTML concerned to see if it has any obvious
oddities but can't see anything. The first page indexed with a bad (ie not
incremented to 2) hopcount is -
        http://www.ox.ac.uk/blueprint/
 - linked from the left panel on  http://www.ox.ac.uk/newsf.html
 - that file has horrid HTML, the author seems to have an aversion to
closing anchors 8-( which I will take up separately. As htdig manages to
find and follow the links I don't think the bad HTML is contributing to
the bad hopcounts.

I looked at another -
        http://www.ox.ac.uk/aboutoxford/spotlight/03.01.shtml
 - linked from http://www.ox.ac.uk/aboutoxford/
 - that HTML looks cleaner wrt </a> although it still generates a large
number of validation errors.

Over to someone who knows the code I think ... thanks for anything you can
do to track this down, I think mapping pages by hopcount does give some
feeling for how likely pages are to be found by pure surfing.

regards,
        Malcolm.

 [EMAIL PROTECTED]     http://users.ox.ac.uk/~malcolm/

Config file for htdig follows although I don't think it contains anything
to trigger the problem. This is the samne as my live config except for the
database directory and the commented out limit_urls_to line.

robotstxt_name:         oxbot
database_dir:           /opt/spider/htdig/hoptest/db/new
remove_bad_urls:        false
maintainer:             [EMAIL PROTECTED]
htnotify_sender:        [EMAIL PROTECTED]
iso_8601:               true
server_wait_time:       1

start_url:              http://www.ox.ac.uk/

#limit_urls_to:          .ox.ac.uk .ousu.org .humbul.ac.uk

server_aliases:         www.ecu.ox.ac.uk:80=www.eci.ox.ac.uk:80 \
                        www.lincoln.ox.ac.uk:80=www.linc.ox.ac.uk:80 \
                        www.pembroke.ox.ac.uk:80=www.pmb.ox.ac.uk:80 \
                        www.somerville.ox.ac.uk:80=www.some.ox.ac.uk:80 \
                        www.st-johns.ox.ac.uk:80=www.sjc.ox.ac.uk:80 \
                        www.wolfson.ox.ac.uk:80=www.wolf.ox.ac.uk:80 \
                        www.worcester.ox.ac.uk:80=www.worc.ox.ac.uk:80 \
                        molbiol.ox.ac.uk:80=www.molbiol.ox.ac.uk:80 \
                        biochweb.bioch.ox.ac.uk:80=www.bioch.ox.ac.uk:80 \
                        www.ndcb.ox.ac.uk:80=www.ndcls.ox.ac.uk:80 \
                        www.ncl.ox.ac.uk:80=www.chem.ox.ac.uk:80 \
                        hoa.dha.ox.ac.uk:80=www.hoa.ox.ac.uk:80 \
                        aisuwww.offices.ox.ac.uk:80=www.admin.ox.ac.uk:80 \
                        oxam.ox.ac.uk:80=www.oxam.ox.ac.uk:80

#exclude_urls:          /~ \
#                       /users \
#                       /internal \
#                       www.lincoln.ox.ac.uk \
#                       www.pembroke.ox.ac.uk \
#                       www.somerville.ox.ac.uk \
#                       www.worcester.ox.ac.uk \
#                       ecu.ox.ac.uk \

exclude_urls:           ? \
                        /cgi-bin/ \
                        .cgi \
                        .ps \
                        .pdf \
                        .rtf \
                        .doc \
                        .ox.ac.uk. \
                        /internet/news/ \
                        /oucs/news/ \
                        /oucs/linux/linux \
                        phone.lis \
                        info.ox.ac.uk/dep \
                        ox.os.linux \
                        webtest.offices.ox.ac.uk \
                        mirror \
                        strubi/strubi \
                        maillist.ox.ac.uk \
                        munchkin \
                        :1100 \
                        :5000 \
                        ferret.lmh.ox.ac.uk \
                        www-jcr.lmh.ox.ac.uk \
                        www-jcr.linc.ox.ac.uk \
                        www-student.linc.ox.ac.uk \
                        jcr.jesus.ox.ac.uk \
                        student.some.ox.ac.uk \
                        paul.merton.ox.ac.uk \
                        madhatter.chch.ox.ac.uk \
                        enterprise.molbiol.ox.ac.uk \
                        bioch.ox.ac.uk:8888 \
                        www.lib.ox.ac.uk:8000 \
                        irusan.las.ox.ac.uk \
                        neon.chem.ox.ac.uk \
                        neon.chemistry.ox.ac.uk \
                        physchem.ox.ac.uk:8000 \
                        joule.pcl.ox.ac.uk:8000 \
                        nimbus.geog.ox.ac.uk \
                        archive.comlab.ox.ac.uk \
                        www.softeng.ox.ac.uk:8080 \
                        al2.physics.ox.ac.uk \
                        www-pnp.physics.ox.ac.uk \
                        av2.physics.ox.ac.uk:8080 \
                        av8.physics.ox.ac.uk \
                        k2.stcatz.ox.ac.uk \
                        babel.mml.ox.ac.uk \
                        sbsnet.mgtstud.ox.ac.uk \
                        assayfinder.jr2.ox.ac.uk \
                        erl.ox.ac.uk:8590 \
                        lannes.ashmol.ox.ac.uk \
                        info.ox.ac.uk:81 \
                        www-preview.oucs.ox.ac.uk \
                        herald.ox.ac.uk \
                        acedirector.ox.ac.uk \
                        mercury.oucs.ox.ac.uk \
                        www.robots.ox.ac.uk/ftp \
                        kebl1088.keble.ox.ac.uk/ftp \
                        www2.merton.ox.ac.uk/~security \
                        security-archive.merton.ox.ac.uk \
                        wwwsearch.oucs.ox.ac.uk \
                        wwwsearch.ox.ac.uk

max_head_length:        10000

translate_amp:          true
translate_lt_gt:        true
translate_quot:         true
#



_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html
[htdig] Errors in reported hopcounts

Reply via email to