Re: [htdig3-dev] Architecture Overview: htsearch parsing revisited

Andrew Scherpbier Thu, 30 Mar 2000 08:19:14 -0800
If a redesign of htsearch is in order, I think you should think about
broadening the functionality a little at the same time.

One of the things I'd like to see added to htsearch is query/result caching. 
This can help in two areas:
1)  Queries that return multiple pages only have to be performed once
2)  Queries that are used a lot can be made faster.
   a)  The whole query and its results can be cached
   b)  Intermediate query results can be cached

I believe this can still be done with a single query/process setup.  The
tricky part will be the limiting and cleanup of the cache.

(more comments below...)

Geoff Hutchison wrote:
> 
> Well, I've put this off for a while, especially as I try to wrap up
> some of the loose ends for 3.2.0b2. But since I can't connect to the
> PPP server (the line's busy), I have an excuse to write offline. :-)
> 
> Let's take a look at the code in htsearch. To be perfectly honest,
> the division of labor isn't exactly fair. Most of it is done in three
> files: htsearch.cc, parser.cc and Display.cc. Since the first isn't a
> class, it should give up some of its code and maybe we can split the
> others into a few additional classes. Please forgive me, the article
> is going to be less of an overview of the current code and more of a
> loose proposal of where htsearch should (?) go.
> 
> For the time being, we'll stick to the idea of one query per process.
> To extend the current model of three methods (and/all, or/any,
> boolean), I agree with a suggestion and propose a fourth: exact (i.e.
> the whole query is treated as a phrase search). Now however the query
> is formatted, we will need to do some transforming and parsing. I
> think the current process of turning everything into a boolean
> expression is the right way. So I propose instead of the "" operators
> used now (which I consider a kudge that I wrote) we add a "near"
> operator with a parameter defining how close the words must be to
> match.
> 
> Some examples:
> Method: And, Query: For score and seven => ((For & score) & and) & seven
> Method: Or, Query: "Geoff Hutchison" code => (Geoff near[1] Hutchison) or code
> Method: Exact, Query: search engine codebase =>
>                  (search near[1] engine) near[1] codebase
> Method: Boolean, Query: (Gilles near Geoff) and ht://Dig => [same]
> 
> I'll also suggest that the transforming will allow a unary not
> operator that functions transforms as follows:
> not Microsoft => * not Microsoft
> 
> So in our parsed, transformed expression, we'll have a few
> "operators," namely (, ), &, |, !, *, and ~ (the last three being
> NOT, ALL, and NEAR respectively).
> 
> Now each word token at a minimum also needs to keep track of its
> "fuzzy factor," and if necessary the "field" to search. (We can treat
> this simply as a mask for the flags.)
> 
> What I've just described is IMHO some requirements for a Parser
> class--it transforms the query into an expression tree. Then a

Suggestion:  A parent ParseTree class with a derived class for each of the
search modes:
BooleanParseTree, OrParseTree, AndParseTree, etc.
Objects should tend to represent data, not functionality...

        ParseTree       parseTree = new AndParseTree(query);

> Searcher class would walk the tree, iterating over a Collection to
> return a ResultList. IMHO, the scoring, sorting and "winnowing"
> (removing deleted documents, those not matching restrict or exclude
> clauses, etc.) should be done inside this ResultList class, with some
> of the work like initial scoring being done by the Searcher class.

One of the things that makes htsearch such a mess as it stands right now is
the whole fuzzy search mechanism.  There's gotta be a better way to deal with
all of that.
Since the fuzzy algorithms can *add* new terms to a boolean search, the fuzzy
step needs to come after the parsing and before the searching.
Since fuzzy is an algorithm to be applied to the parse tree, it should
probably be incorporated into the ParseTree class.

Actual searching is also something that is applied to the parse tree, so it
should probably be incorporated into the ParseTree class as well.  Its return
value should be a Results object.

        Results searchResults = parseTree->search();

> This would leave the work of filling in the templates themselves to
> the Display class, which seems fitting. That's not to say there isn't
> a lot of work to be done for this, what with hilighting, finding
> anchors, SGML and URL encoding, etc.

The Result object should be the one that generates the results with the help
of something like an OutputRepresentation object.
The default OutputRepresentation class would have the functionality that the
current Display class has (output using templates).  Other derived objects
can then be used to do other things with the results.  (eg.: sending output
through PHP3 so that it can be parsed there, etc.)
The Result object should also be in charge of paging the results:

        // Generate output for page 1 using 20 results per page
        searchResults->output(representation, 20, 1);

With this, the main() in htsearch would essentially be reduced to the sample
code I included.
All the work will be done by the appropriate objects.

> You'll note that I've been rather hazy on some important issues.
> IMHO, it's very tricky to walk the expression tree optimally. After
> all, you would rather not waste the memory on having the results of
> all the searches at one time. However, if you do the searches
> pairwise, you waste time against comparing the lists all at once. But
> that's something we can talk about later.

I think maybe intermediate result caching may help here.  If a database
lookup is performed on a specific term (after parsing), the results can be
cached.  If the same term is used in another query or even in the same query,
only a disk read of the results needs to be performed.  (It would even be
cooler if the results can actually be mmap'ed to the cache file, but
unfortunately, that has proven to be less portable...)
A cache class can be in charge of managing the cache size and do funky LRU
type stuff.

> -Geoff

Mind you, I haven't looked at the code in htsearch for years...  (I no longer
have nightmares about it, so that's a good thing.  The drugs are keeping me
sane... :-))
I probably missed some details, but I think the outline will work.

-- 
Andrew Scherpbier <[EMAIL PROTECTED]>
Contigo Software <http://www.contigo.com/>

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this.
Re: [htdig3-dev] Architecture Overview: htsearch parsing revisited

Reply via email to