At 12:05 PM 03/07/02 +0800, Stas Bekman wrote: >Hmm, you mean the thread we had before and the outcome were useless? >I thought we have agreed on putting most of the chars into the >WordCharacters variable.
I tested a bunch of settings and had mixed results. In some cases I'd gain being able to search some types of perl code successfully, but then some normal text was not searchable as expected. In that previous thread we also discussed searching two indexes at the same time, but that doesn't work well since duplicates are returned -- and since swish returns the results a page at a time you can't remove duplicates very easily (without asking swish to return all results for every search and then removing duplicates). >I'm absolutely against trying to come with a >list of buzzwords, since it's going to take so much work and you still >won't cover all the required "special" words. If we can search for >Apache::Registry and $! because they are "words", that's all we need. >This is what I've suggested to add to WordCharacters: >-$%@:*[]{}|& You can try those characters on my machine. (I should move it over to my fastest machine for better highlighting speed...) There's three index selections now, one uses the standard settings, one adds the above characters to the settings, and one is the standard plus a few buzzwords: Buzzwords mod_perl c++ $_ $+ $| $_ @$ @_ $ENV $SIG IgnoreFirstChar ({[]}):, IgnoreLastChar ({[]}):,. I think the standard index works best, in general. But as you might notice and expect, some things work better in one index while others work better in a different index. >Actually it would be cool if swish-e was able to accept char sequences >as valid word segments. For example we don't want > or - to be counted >as parts of the word, but we do want -> (think $r->no_cache) $r->no_cache is easy. You just search for the phrase "$r->no_cache" (with quotes). The standard index even finds "$self->{r}->no_cache" with that search. Same with modules names. I actually like that method (using quotes) best, as it's most flexible. You can search for Registry and find Apache::Registry, and you can search for "Apache::Registry" and find just that. The advantage of using buzzwords is you can search for things like $|, but still use a standard index which is more flexible. It won't find all occurrences of $|, but it will find some. Try some searches. Here's some examples: Search Results standard index with >-$%@:*[]{}|& with buzzword mod_perl mod_perl 912 878 866 "mod_perl" 904 852 866 So the difference between 904 and 852 is that the ">-$%@:*[]{}|&" is causing some of the "mod_perl" words to be indexed differently (probably mod_perl:) and thus not found. $env works differently: standard index with >-$%@:*[]{}|& with buzzword mod_perl env 42 15 42 $env 42 1 1 The standard index may give (a lot of) false hits, but that's better than missing ones, in some cases. standard index with >-$%@:*[]{}|& with buzzword mod_perl $ENV{MOD_PERL} 32 8 22 "$ENV{MOD_PERL}" 9 8 9 That one shows why it would be nice to adjust ranking on how close words are together in a document. Searching $ENV{MOD_PERL} is really searching three words, so it would be helpful if the docs where those words are close together ranked highest -- so of those 32 hits, the top 9 in rank would be the real ones. >Also you were talking about highlighting. Are you talking about >highlighting for the snippets presented on the hits page? Or are you >working on highlighting in the real 'full' text? Highlighting for the snippets. I feel that search results are suppose to help find the page you are looking for, not specific words on the final document. When I use google's cached pages I find the highlighted words awkward and in the way. Besides, our search results take you to the section of the document anyway, so I see no reason for highlighting the final doc. Besides, we don't want dynamic content generation -- (but it would make dealing with netscape easier ;). Hey, I've got some javascript that will highlight terms on a page ;) Sorry for rambling on so long. -- Bill Moseley mailto:[EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]