Thanks for your explanations, Geoff :) More questions follow. On Saturday 01 March 2003 04:51, Geoff Hutchison wrote: > > 1. location must be in the range 0-1000? > That's a 3.1-ism. > > > 2. Could we add "meta" information > > at successive locations starting from, say, location 10,000? > > Actually, now that I think about it, a better idea is to use > negative word locations for META information. > As for some other arbitrary > number--we might actually have documents that long (esp. with PDF > indexing).
That could have its own problems. If they are labelled -1, -2, ... then phrase searching would have to match *backwards* for negative numbers. Then if true positions overflowed into negative numbers, the phrases wouldn't match. (If such overflow is impossible with n-bit numbers, we could use *unsigned* locations, and count forward from 2^(n-1) for meta information.) If we count *forward* from a very negative number, then it is essentially starting from a very large (unsigned) location. Thoughts? > > 3. With phrase searching, do we still need valid_punctuation? > > For example, "post-doctoral" > > This is a strange example. What if I had a hyphenated word? I don't > know that your "phrase conversion" is the best solution. What we do > need is a flexible "word parser" that addresses some of these > issues. I suppose a key is how often people do phrase searches vs word searches. Optionally-hyphenated words are trouble-prone since the status-quo gives oh-so-many fasle-negatives for non-hyphenated phrase-queries applied to over-hyphenated text... (The suggestion was based on what google does.) Regarding flexibility, we could make htsearch treat words separated by "invalid" puctuation (but no spaces) as a phrase, and make the default valid_punctuation empty. That way people who want the current functionality can have it (except queries where words are not separated by spaces but *should* match those words separately?) but the default would be less buggy for phrase searches. > For some people, punctuation has meaning. Let's say we have part > numbers or dates. "3/24/03" isn't really the same as "32403" and > I'm not sure the phrase search works well either. Ah, yes. All three would be too short to be indexed... But isn't that what extra_word_characters is for? > > 4. Does anybody know what the existing external parsers do about > > words less than the minimum length? > I don't think most external parsers bother with the config file. ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev
