According to Lachlan Andrew: > On Friday 14 February 2003 11:16, Neal Richter wrote: > > > Is there something you can tell us about the type of data you are > > indexing? Are they big pages with lots of repetitive information.. > > giving htdig many similar keys which hash/sort to the same pages? > > Greetings Neal, > > I've found one page in the qt documentation which may be causing > those problems (attached). I hadn't realised it, but the > valid_punctuation attribute seems to be treated as an *optional* > word break. (The docs say it is *not* a word break, and that seems > the intention of WordType::WordToken...)
I guess the docs haven't kept up with what the code does. It used to be that valid_punctuation didn't cause word breaks at all, i.e. these punctuation characters were valid inside a word, and got stripped out but didn't break up the word. However, for some time now, this functionality was extended to also index each word part, so that something like "post-doctoral" gets indexed as postdoctoral, post and doctoral. This greatly enhances searches for compound words, or parts thereof, but it tends to break down when you're indexing something that's not really words... > The page has long strings > with many valid_punctuation symbols, and gives output like > > elliptical 1060 0 1113 34 > elp 1363 0 131 0 > elphick 1516 0 750 0 > elsbs 1372 0 968 4 > elsbsw 1372 0 968 4 > elsbswp 1372 0 968 4 > elsbswpe 1372 0 968 4 > elsbswpew 1372 0 968 4 > elsbswpewg 1372 0 968 4 > elsbswpewgr 1372 0 968 4 > elsbswpewgrr 1372 0 968 4 > elsbswpewgrr1 1372 0 968 4 > elsbswpewgrr1t 1372 0 968 4 > elsbswpewgrr1twa7 1372 0 968 4 > elsbswpewgrr1twa7z 1372 0 968 4 > elsbswpewgrr1twa7z1bea0 1372 0 968 4 > elsbswpewgrr1twa7z1bea0f 1372 0 968 4 > elsbswpewgrr1twa7z1bea0fk 1372 0 968 4 > elsbswpewgrr1twa7z1bea0fkd 1372 0 968 4 > elsbswpewgrr1twa7z1bea0fkdrbk 1372 0 968 4 > elsbswpewgrr1twa7z1bea0fkdrbke 1372 0 968 4 > elsbswpewgrr1twa7z1bea0fkdrbkezb 1372 0 968 4 > else 225 0 1285 0 > > Might that be the trouble? Well, I would think that if you're going to feed a bunch of C code into htdig, especially C code containing many pixmaps, then you should probably do so with a severely stripped down setting of valid_punctuation. This would speed up the process a lot and get rid of a lot of the spurious junk that's getting indexed. However, if the underlying word database is solid, then it shouldn't fall apart no matter how much junk you throw at it. So, this might be the trigger that brings the trouble to the surface, but the root cause of the trouble seems to be a bug somewhere in the code. > (BTW, zlib 1.1.4 is still giving errors, albeit for a slightly > different data set.) Bummer. Have you tried running with no compression at all, and if so, does that work reliably? -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev
