Greetings all, I'm checking through phrase searching, and have found several possible bugs. First, some questions...
1. Why do the documentation for external_parser and the comments before Retriever::got_word both say that the word location must be in the range 0-1000? The HTML parser doesn't stick to that. If locations are just scaled down (rather than reduced modulo 1001), that will break the phrase searches. Is there a maximum in practice? 2. Every "meta" data entry (<title>, <meta ...> etc.) gets added as if it starts at location 0. This gives *heaps* of false-positives, because the second word of *any* entry is deemed adjacent to the first word of any *other* entry. Could we add "meta" information at successive locations starting from, say, location 10,000? 3. With phrase searching, do we still need valid_punctuation? For example, "post-doctoral" currently gets entered as three words at the *same* location: "post", "doctoral" and "postdoctoral". Would it be better to convert queries for post-doctoral into the phrase "post doctoral" in queries, and simply the words post and doctoral at successive locations in the database? As it stands, a search for "the non-smoker" will match "the smoker", since all the words are given the same position in the database, but a search for "the non smoker" won't match "the non-smoker". This also reduces the size of the database (marginally in most cases, but significantly for pathological documents). Now that there is phrase searching, is there any benefit of the current approach? 4. Does anybody know what the existing external parsers do about words less than the minimum length? Because they are passed the configuration file, they *could* omit them. Currently the HTML parser omits them, but that introduces false-positives into phrase queries, and I want to fix that. Thanks! Lachlan ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev
