According to [EMAIL PROTECTED]:
> On Wed, Apr 05, 2000 at 12:43:44PM -0500, Gilles Detillieux wrote:
> > According to [EMAIL PROTECTED]:
> > > 'Litt�rature' returns 54 results, none of which is the page
> > > entitled 'Litt�rature francophone virtuelle' BUT almost all of which
> > > contain the target string...
> >
> > A few possibilities to look into:
> >
> > 1) the page entitled 'Litt�rature francophone virtuelle' contains a slightly
> > different spelling of 'Litt�rature' than your search string. Check the
> > HTML source for the page carefully, to make sure there isn't some difference
> > in accents or spelling.
>
> Double-checked this. Search string and version in the page is the
> same. In fact, copied and pasted directly from my browser window into
> the search page.
> >
> > 2) the SGML entity for the '�' in the title isn't being converted correctly.
> > There were problems with numeric entities in many 3.2 snapshots and the last
> > beta.
>
> hmmmm... the only problem with this line of thought is that if é
> isn't being properly converted, it wouldn't be converted across
> the entire website, so we'd never see the search string in htdig's
> results... Also, I'm running 3.1.5, not any of the 3.2 snapshots.
What I had in mind was the possibility that a different entity was used
in this title than elsewhere in other documents. That doesn't seem to be
the case.
> > 3) that page was indexed before you had the locale configured correctly,
> > and never reindexed, so the accented letter was lost. Try touching the
> > page's source file and reindexing it, or reindexing from scratch.
>
> Actually, I didn't index this particular site until after
> reconfiguring its locale. I reindexed the site (just to be on the
> safe side) using first htdig -i -c /path/to/config and then htmerge
> -c /path/to/config. The results of an "ALL" search for 'Litt�rature
> francophone virtuelle' remain the same - 54 results, without the target
> page entitled 'Litt�rature francophone virtuelle'.
OK, how about creating a different config file that sets start_url to
only the one page that's giving you problems, and perhaps change
database_dir to avoid clobbering your current database, and then running
"htdig -ivvvvc newconfig.conf" to see what htdig is doing when in parses
the title of this page. Take a look at the resulting db.wordlist as well,
to see if "litt�rature" (or some mangled form of it) is getting into the
database.
By the way, when you say the page is entitled 'Litt�rature francophone
virtuelle', do you mean the document's <head> section contains
"<title>Litt�rature francophone virtuelle</title>", or do you mean it's
the main heading (i.e. <h1>) in the document? Are your title_factor
and/or heading_factor_1 non-zero?
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.