On Tuesday 18 February 2003 12:22, Jim Cole wrote:
> On Monday, February 17, 2003, at 05:49 PM, Adam Brown wrote:
> > "womans" does not appear int the db.wordlist nor does it appear in the
> > -vvvv
> > output (see the attached rundig output). I have run this with both the
> > default 'valid_punctuation' and my customised one.
> >
> > Can't work out why it would be indexing "womans" as woman, why does it
> > trim
> > the "s"?
>
> Based on the output you attached, it appears that the problem is one of
> encoding. The pages are not using an ASCII character for the
> apostrophe. Instead, the pages are using a Microsoft Windows Latin-1
> extension (a hex 92). I guess you either need to fix the pages or try
> writing a hex 92 into your valid_punctuation attribute. I am not sure
> if a real hex 92 will have an adverse affect on the parse. If there is
> a way to encode such characters in the attribute, I am not aware of it.
>
> Jim

Thanks for the suggestion however if you look at the 'title' meta tag for the 
page you will see that the word "womans" is used without an apostrophe and 
this instance is being indexed as "woman". It's got me stumped. Can someone 
point me to the correct location in the source where I can check what is 
going on.

Ad


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to