According to Lachlan Andrew:
> On Friday 14 February 2003 11:16, Neal Richter wrote:
> 
> > Is there something you can tell us about the type of data you are
> > indexing?  Are they big pages with lots of repetitive information..
> > giving htdig many similar keys which hash/sort to the same pages?
> 
> Greetings Neal,
> 
> I've found one page in the  qt  documentation which may be causing 
> those problems (attached).  I hadn't realised it, but the 
> valid_punctuation  attribute seems to be treated as an *optional* 
> word break.  (The docs say it is *not* a word break, and that seems 
> the intention of  WordType::WordToken...)

I guess the docs haven't kept up with what the code does.  It used to
be that valid_punctuation didn't cause word breaks at all, i.e. these
punctuation characters were valid inside a word, and got stripped
out but didn't break up the word.  However, for some time now, this
functionality was extended to also index each word part, so that something
like "post-doctoral" gets indexed as postdoctoral, post and doctoral.
This greatly enhances searches for compound words, or parts thereof,
but it tends to break down when you're indexing something that's not
really words...

>  The page has long strings 
> with many valid_punctuation symbols, and gives output like
> 
> elliptical    1060    0       1113    34
> elp   1363    0       131     0
> elphick       1516    0       750     0
> elsbs 1372    0       968     4
> elsbsw        1372    0       968     4
> elsbswp       1372    0       968     4
> elsbswpe      1372    0       968     4
> elsbswpew     1372    0       968     4
> elsbswpewg    1372    0       968     4
> elsbswpewgr   1372    0       968     4
> elsbswpewgrr  1372    0       968     4
> elsbswpewgrr1 1372    0       968     4
> elsbswpewgrr1t        1372    0       968     4
> elsbswpewgrr1twa7     1372    0       968     4
> elsbswpewgrr1twa7z    1372    0       968     4
> elsbswpewgrr1twa7z1bea0       1372    0       968     4
> elsbswpewgrr1twa7z1bea0f      1372    0       968     4
> elsbswpewgrr1twa7z1bea0fk     1372    0       968     4
> elsbswpewgrr1twa7z1bea0fkd    1372    0       968     4
> elsbswpewgrr1twa7z1bea0fkdrbk 1372    0       968     4
> elsbswpewgrr1twa7z1bea0fkdrbke        1372    0       968     4
> elsbswpewgrr1twa7z1bea0fkdrbkezb      1372    0       968     4
> else  225     0       1285    0
> 
> Might that be the trouble?

Well, I would think that if you're going to feed a bunch of C code
into htdig, especially C code containing many pixmaps, then you should
probably do so with a severely stripped down setting of valid_punctuation.
This would speed up the process a lot and get rid of a lot of the spurious
junk that's getting indexed.  However, if the underlying word database is
solid, then it shouldn't fall apart no matter how much junk you throw at
it.  So, this might be the trigger that brings the trouble to the surface,
but the root cause of the trouble seems to be a bug somewhere in the code.

> (BTW, zlib 1.1.4 is still giving errors, albeit for a slightly 
> different data set.)

Bummer.  Have you tried running with no compression at all, and if so,
does that work reliably?

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to