Re: [Fwd: Re: perl filters for swish-e]

Stas Bekman 21 Feb 2002 15:12:47 -0000

Bill Moseley wrote:

At 06:14 PM 2/21/2002 +0800, Stas Bekman wrote:

hmm, I read in swish docs that you cannot index ':'.


That might have been true once, but I never found a reason for that.  (And
it should not be in the 2.2 docs - is it?)

~/swish-e/src > cat c
wordcharacters $|abcdefghi:
begincharacters $|abcdefghi:
endcharacters $|abcdefghi:

~/swish-e/src > cat 1
a b c d $| abc:def

~/swish-e/src > ./swish-e -i 1 -T indexed_words -v0 -c c
Indexing Data Source: "File-System"
    Adding:[1:swishdefault(1)]   'a'   Pos:1  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'b'   Pos:2  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'c'   Pos:3  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'd'   Pos:4  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   '$|'   Pos:5  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'abc:def'   Pos:6  Stuct:0x1 ( FILE )
Indexing done!

~/swish-e/src > ./swish-e -w '$|' -H0
1000 1 "1" 19

So I can search for "$|".


cool!

I guess I wasn't clear to myself and others about what I meant by searching for Perl code. I don't care much about search the code sections per se. I care much about perl string found in the text. So I want Apache::Registry to be found and I want $| to be found.

If I understand correctly if I search for a sub-pattern it'll be found, right? So if I search for $|, I'll find $| and $|++, no?
No.  Swish generates a reverse index.  Therefore it must tokenize the text
into words.  Swish does create two types of indexes, so that you can do
wildcard searches.  So:
    $|* (where * is a wildcard operator) will find $|; $|++ and so on, but
that's not a sub string search, but rather finding words that start with $|.

:(

That's why grep would work better in some situations.  But, as I mentioned
before, grep can also be less effective in some cases.

But if you want one search to find both text and perl code then you run
into trouble because with text you want remove punctuation to make
searching work correctly.  But those punctuation characters are also used
as perl code, which you want to index.


that's true. but see below

Therefore we want most if not all chars to be indexed. Or at least $%@:-> (search for '$r->args' should be successful).

And we don't want to search for Apache AND Registry, nor Apache OR Registry when I ask for Apache::Registry. I think that's what most people will expect without knowing the internals of the search engine.
Apache::Registry is a bit easier than perl code, because you can say that
":" is ok in the middle of a word, but not at the end.  Then
    "see Apache::Registry"  -- is indexed as a single word, but
    "rules for using foo:"  - "foo" is indexed without ":"
But perl code is not that simple.
Now, the advantage of NOT including ":" in words is that you can then
search for "registry" and find places where it's "Apache::Registry"
(because that's indexed as two words), and you can still use a phrase
search and find only places where "registry" follows right after "apache",
which would typically find Apache::Registry.  That's more flexible.
The problem is teaching people how to search. Nobody would expect to search for (with quotes) "apache::registry". I'm trying to modify swish so that ranking is adjusted for how close words are together, so a multi-word search (such as [apache registry]) would rank the phrases highest.

OK, how about this solution. We want the search to be user-friendly without RTFM. So we index everything twice: once with English text in mind, second time with Perl code in mind. When user does the search we search both indices and then merge the results while discarding dups.

Is that doable? If that is, will that solve our problem?

_____________________________________________________________________
Stas Bekman             JAm_pH      --   Just Another mod_perl Hacker
http://stason.org/      mod_perl Guide   http://perl.apache.org/guide
mailto:[EMAIL PROTECTED]  http://ticketmaster.com http://apacheweek.com
http://singlesheaven.com http://perl.apache.org http://perlmonth.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [Fwd: Re: perl filters for swish-e]

Reply via email to