At 06:14 PM 2/21/2002 +0800, Stas Bekman wrote:
hmm, I read in swish docs that you cannot index ':'.
That might have been true once, but I never found a reason for that. (And it should not be in the 2.2 docs - is it?)
~/swish-e/src > cat c wordcharacters $|abcdefghi: begincharacters $|abcdefghi: endcharacters $|abcdefghi:
~/swish-e/src > cat 1 a b c d $| abc:def
~/swish-e/src > ./swish-e -i 1 -T indexed_words -v0 -c c Indexing Data Source: "File-System" Adding:[1:swishdefault(1)] 'a' Pos:1 Stuct:0x1 ( FILE ) Adding:[1:swishdefault(1)] 'b' Pos:2 Stuct:0x1 ( FILE ) Adding:[1:swishdefault(1)] 'c' Pos:3 Stuct:0x1 ( FILE ) Adding:[1:swishdefault(1)] 'd' Pos:4 Stuct:0x1 ( FILE ) Adding:[1:swishdefault(1)] '$|' Pos:5 Stuct:0x1 ( FILE ) Adding:[1:swishdefault(1)] 'abc:def' Pos:6 Stuct:0x1 ( FILE ) Indexing done!
~/swish-e/src > ./swish-e -w '$|' -H0 1000 1 "1" 19
So I can search for "$|".
cool!
I guess I wasn't clear to myself and others about what I meant by searching for Perl code. I don't care much about search the code sections per se. I care much about perl string found in the text. So I want Apache::Registry to be found and I want $| to be found.
If I understand correctly if I search for a sub-pattern it'll be found, right? So if I search for $|, I'll find $| and $|++, no?
No. Swish generates a reverse index. Therefore it must tokenize the text into words. Swish does create two types of indexes, so that you can do wildcard searches. So:
$|* (where * is a wildcard operator) will find $|; $|++ and so on, but that's not a sub string search, but rather finding words that start with $|.
:(
That's why grep would work better in some situations. But, as I mentioned before, grep can also be less effective in some cases.
But if you want one search to find both text and perl code then you run into trouble because with text you want remove punctuation to make searching work correctly. But those punctuation characters are also used as perl code, which you want to index.
that's true. but see below
Therefore we want most if not all chars to be indexed. Or at least $%@:-> (search for '$r->args' should be successful).
And we don't want to search for Apache AND Registry, nor Apache OR Registry when I ask for Apache::Registry. I think that's what most people will expect without knowing the internals of the search engine.
Apache::Registry is a bit easier than perl code, because you can say that ":" is ok in the middle of a word, but not at the end. Then
"see Apache::Registry" -- is indexed as a single word, but
"rules for using foo:" - "foo" is indexed without ":"
But perl code is not that simple.
Now, the advantage of NOT including ":" in words is that you can then search for "registry" and find places where it's "Apache::Registry" (because that's indexed as two words), and you can still use a phrase search and find only places where "registry" follows right after "apache", which would typically find Apache::Registry. That's more flexible.
The problem is teaching people how to search. Nobody would expect to
search for (with quotes) "apache::registry". I'm trying to modify swish
so that ranking is adjusted for how close words are together, so a
multi-word search (such as [apache registry]) would rank the phrases
highest.
OK, how about this solution. We want the search to be user-friendly without RTFM.
So we index everything twice: once with English text in mind, second time with Perl code in mind.
When user does the search we search both indices and then merge the results while discarding dups.
Is that doable? If that is, will that solve our problem?
_____________________________________________________________________ Stas Bekman JAm_pH -- Just Another mod_perl Hacker http://stason.org/ mod_perl Guide http://perl.apache.org/guide mailto:[EMAIL PROTECTED] http://ticketmaster.com http://apacheweek.com http://singlesheaven.com http://perl.apache.org http://perlmonth.com/
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
