On Mon, Nov 07, 2011 at 10:36:23AM +0200, goran kent wrote:
> #---------incision start----------
> my $response;
> my $cached_object_id = md5sum($buf); # TODO: check if $buf is the search
> string
>
> if (is_cached($cached_object_id)) {
> $response = read_cached_object($cached_object_id);
> }
> else {
> $response = $dispatch{$method}->( $self, thaw($buf) );
> }
> #---------incision end----------
$buf is never the search string. It's this, from SearchClient.pm:
my $serialized = nfreeze($args);
my $packed_len = pack( 'N', length($serialized) );
print $sock "$method\n$packed_len$serialized";
$method is the name of the method to invoke on the SearchServer's local
Searcher object. $args is a Perl hashref containing the arguments to pass to
that method, which is subsequently serialized using Storable's nfreeze()
function and becomes the scalar $serialized.
When the method is "top_docs", then $args will contain a Lucy::Search::Query
*object*. The query string has already been parsed at this point, back in the
SearchClient; at no time does the raw query string ever get sent over the wire
to the SearchServer.
However, Query objects have a to_string() method you may be able to make use
of:
if ($method eq 'top_docs') {
my $args = thaw($buf)
my $key = $args->{query}->to_string;
if (is_cached($key)) {
$response = read_cached_object($key);
}
else {
$response = $dispatch{$method}->( $self, $args );
}
}
> I seem to recall though that the typical search is not an atomic
> transaction: ie, the remote search protocol is broken up into
> discrete request/response chunks:
Correct.
> my $hits = $poly_searcher->hits(
> query => $parsed_query,
> sort_spec => $sort_spec,
> offset => 0, # or 10, 20, etc
> num_wanted => 10,
> );
>
>
> is processed roughly as:
>
> doc_max/response
> doc_freq/response x 31
> ...
> top_docs/response
> fetch_doc/response x 10
> ...
> done
>
> So, my question is basically: which parts do I cache and what's the
> best way to identify those parts?
The only individual task which it could conceivably make sense to cache would
be top_docs().
All those calls to doc_freq() are part of the weighting process. The
behavior is not ideal, but changing it is a bit of a can of worms and
server-side caching won't help, as the calls are all fast lookups.
> I have a feeling I'm going to have
> to package a group of request/responses to cache it in it's
> entirety,... or something. --or maybe this is not feasible within
> the given framework.
I can't think of a way to bundle things up without significant refactoring of
how Lucy's searching works or rearchitecting of SearchClient.
I understand why you want to do this: it allows you to invalidate chunks of the
cache piecemeal as individual nodes move forwards, rather than invalidate the
whole cache whenever any one of the nodes changes. Hopefully caching
top_docs() alone will help.
> I essentially need a better understanding of the client/server
> interaction process so I can formulate an approach to achieve
> remote-end caching of search queries (in Perl of course, since that's
> what's being used here).
Understanding how Query objects are compiled to Matcher objects would help, so
maybe check out Lucy::Docs::Cookbook::CustomQuery. Those doc_freq calls
happen during the weighting stage, and are used to power IDF.
http://incubator.apache.org/lucy/docs/perl/Lucy/Docs/Cookbook/CustomQuery.html
http://incubator.apache.org/lucy/docs/perl/Lucy/Docs/IRTheory.html#TF-IDF-ranking-algorithm
Marvin Humphrey