Actually, I'll rest my case. I think I caused more problems by that remark, because it's a little too much to implement. Searching is still great, and turns out good results 75% of the time.
At 09:45 26.03.2002, Bill Moseley wrote:
At 02:39 PM 03/26/02 +0800, Stas Bekman wrote:
>Per Einar Ellefsen wrote: >> What I can suggest: as we generate our HTML from POD files, knowing what >> is code, could there maybe be some possibility of putting some <div> >> tags around the <pre> ones, and then patch Swish in some way to get it >> to treat those parts as searchable but not displayable? If I understood >> it right, it's already using some <div> tags to know what to index, so >> maybe it would be possible to make it a little more advanced?
You mean don't display the context since it doesn't look nice in the summary? I think I'd rather have it show the context even if it is ugly.
The highlighting code is designed to just show the first X words or X characters (depending on which highlight module is used) if no matching context is found to display, but I'd still rather see the word hits, if possible.
Search google for: ["the guide" AND "light registry"]
It does basically the same thing.
>I don't think this is possible, since the hit doesn't happen in the >sentence but an index which points to the section which includes this >sentence.
Right. It's isn't grep.
Also, if you start trying to preserve HTML then it becomes a bit more tricky and slow to do the phrase highlighting, since a phrase match in swish might match across HTML formatting. For example imagine highlighting the *phrase* that matches the last word of one link, and the first word of a link that follows. Matching "foo bar":
<a href="first">bla bla foo</a> <a href="second">bar bla bla</a>
ends up as:
<a href="first">bla bla <span class="mark">foo</span></a><span class="mark"> </span><a href="second"><span class="mark">bar</span> bla bla</a>
And it gets even harder when that first link might have looked like:
<a href="first">bla <em>bla fo</em>o</a>
Not very likely, buy you can see why you then need HTML::TreeBuilder to do that kind of rewriting of the HTML. The phrase highlighting code is messy enough working with just plain text. So if you start highlighting code in 50K or 100K documents where you need to first build a HTML tree then parsing speed becomes quite noticeable.
Currently, even without parsing the HTML, all the slowness in returning results you see is coming from the highlighting code (well, that I see on my LAN, as you may have a slow connection). Turn off highlighting and the results are returned very fast.
I'm sure Google is much smarter about highlighting than swish (considering swish doesn't do highlighting). But google doesn't try too hard either. Here's how it highlighted the phrase "light set" which included html formatting:
<i><B style="color:black;background-color:#99ff99">light</i> </B> <B style="color:black;background-color:#99ff99">set</B>
A little nesting problem there, I think.
>I've another suggesting: is it possible to distinguish between sentences >(or parts of) when presenting the hit's context? If so we could add ><br>'s after each sentence/part of and therefore make it more readable. >I know you said that \n are removed, but if there is a way to keep the >original strings as tokens in the index, this will improve the >readability a lot.
I probably could substitute \n inside of <pre> sections with %0A or something like that in the swish parser code, or in the code that splits the documents into sections replace \n inside <pre> with a set of chars that won't be indexed, but can then be used as a flag to show where \n are found (and thus replace with <br>). But I think that's overkill.
Look at http://hank.org:5000/search/swish.cgi?query=registry
I'll put back in the \n below -- swish does a s/\s+/ /g in some cases (when joining text together), so swish would need to be modified to keep whitespace, too, inside of <pre> sections.
... the light set we are going to use the registry.pl script running under Apache::Registry: benchmarks/registry.pl ---------------------- use strict; print "Content-type: text/plain ... pl file: use Apache::RegistryLoader (); Apache::RegistryLoader->new->handler( "/perl/benchmarks/registry.pl", "/home/httpd/perl/benchmarks/registry .pl"); To create the heavy benchmark set let ... results:------------------------------ name | avtime rps ------------------------------ light handler | 15 911 light registry | 21 680 ------------------------------ heavy handler | 183 81 heavy registry | 191 77 ------------------------------ Let's look at the results ... comparison: ------------------------------ name | avtime rps ------------------------------ light handler | 50 196 light registry | 160 61 ------------------------------ heavy handler | 149 67
Even if it was formatted correctly (without swish doing s/\s+//g;) you end up with a lot longer summary for a little readability gain.
The idea is not that the search results show the page correctly, rather that it just shows some content to help you decide if you should follow the link in the search result.
If I'm missing what you are suggesting, send what you think
http://hank.org:5000/search/swish.cgi?query=registry
should look like.
-- Bill Moseley mailto:[EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
-- Per Einar Ellefsen [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
