At 12:05 PM 03/07/02 +0800, Stas Bekman wrote:
>Hmm, you mean the thread we had before and the outcome were useless?
>I thought we have agreed on putting most of the chars into the 
>WordCharacters variable.

I tested a bunch of settings and had mixed results.  In some cases I'd gain
being able to search some types of perl code successfully, but then some
normal text was not searchable as expected.

In that previous thread we also discussed searching two indexes at the same
time, but that doesn't work well since duplicates are returned -- and since
swish returns the results a page at a time you can't remove duplicates very
easily (without asking swish to return all results for every search and
then removing duplicates).

>I'm absolutely against trying to come with a 
>list of buzzwords, since it's going to take so much work and you still 
>won't cover all the required "special" words. If we can search for 
>Apache::Registry and $! because they are "words", that's all we need.
>This is what I've suggested to add to WordCharacters: >-$%@:*[]{}|&

You can try those characters on my machine.  (I should move it over to my
fastest machine for better highlighting speed...)  There's three index
selections now, one uses the standard settings, one adds the above
characters to the settings, and one is the standard plus a few buzzwords:

  Buzzwords mod_perl c++ $_ $+ $| $_ @$ @_ $ENV $SIG
  IgnoreFirstChar ({[]}):,
  IgnoreLastChar ({[]}):,.

I think the standard index works best, in general.  But as you might notice
and expect, some things work better in one index while others work better
in a different index.

>Actually it would be cool if swish-e was able to accept char sequences 
>as valid word segments. For example we don't want > or - to be counted 
>as parts of the word, but we do want -> (think $r->no_cache)

$r->no_cache is easy.  You just search for the phrase "$r->no_cache" (with
quotes).  The standard index even finds "$self->{r}->no_cache" with that
search.  

Same with modules names.  I actually like that method (using quotes) best,
as it's most flexible.  You can search for Registry and find
Apache::Registry, and you can search for "Apache::Registry" and find just
that.

The advantage of using buzzwords is you can search for things like $|, but
still use a standard index which is more flexible.  It won't find all
occurrences of $|, but it will find some.

Try some searches.  Here's some examples:

                            Search Results
           standard index   with >-$%@:*[]{}|&    with buzzword mod_perl
mod_perl         912              878                 866
"mod_perl"       904              852                 866

So the difference between 904 and 852 is that the ">-$%@:*[]{}|&" is
causing some of the "mod_perl" words to be indexed differently (probably
mod_perl:) and thus not found.

$env works differently:

           standard index   with >-$%@:*[]{}|&    with buzzword mod_perl
env              42               15                     42
$env             42                1                      1

The standard index may give (a lot of) false hits, but that's better than
missing ones, in some cases.


           standard index   with >-$%@:*[]{}|&    with buzzword mod_perl
$ENV{MOD_PERL}    32              8                   22
"$ENV{MOD_PERL}"   9              8                    9

That one shows why it would be nice to adjust ranking on how close words
are together in a document.  Searching $ENV{MOD_PERL} is really searching
three words, so it would be helpful if the docs where those words are close
together ranked highest -- so of those 32 hits, the top 9 in rank would be
the real ones.

>Also you were talking about highlighting. Are you talking about 
>highlighting for the snippets presented on the hits page? Or are you 
>working on highlighting in the real 'full' text?

Highlighting for the snippets.  I feel that search results are suppose to
help find the page you are looking for, not specific words on the final
document.  When I use google's cached pages I find the highlighted words
awkward and in the way.

Besides, our search results take you to the section of the document anyway,
so I see no reason for highlighting the final doc.  Besides, we don't want
dynamic content generation -- (but it would make dealing with netscape
easier ;).  Hey, I've got some javascript that will highlight terms on a
page ;)

Sorry for rambling on so long.

-- 
Bill Moseley
mailto:[EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to