This task reminds me more of a count(*) sql query than a text search query.
Assuming that using a text search engine is a pre requisite, I can think of
two approaches - basing on Lucene scoring as suggested in the question, or
a more simple approach (below).
For the scoring approach - I don't see an easy way to get the counts from
the score of the results, although the TF (term frequency in candidate
docs) is known+used during document scoring, and although it seems that the
application can be arranged such that TF of search result documents would
be the required count.
But perhaps a more straight forward solution can do - adding a Lucene
document for each star-movie pair. This would also allow easy update when a
new movie arrives: just add a document for each "star" in that movie. A
document can have these fields:
StarFirstName - stored, untokenized
StarLastName - stored, untokenized
MovieName - stored, tokenized
MovieType - stored, untokenized - this is the pre-computed type
mentioned below
MovieProps - unstored, tokenized - the word "horror" can appear in this
field, avoiding a pre-computation step.
Now a single search can do all the work:
+StarLastName:A* +MovieProps:horror
Sorting results by StarLastName would group all results of same "star" and
also allow to count them for each star.
This would create more documents in the index - #stars * |#movies per
star| - so there may be performance considerations, depending on the volume
of the data...
Regards,
Doron
"Russell M. Allen" <[EMAIL PROTECTED]> wrote on 27/07/2006 09:02:46:
> I am curious about the potential use of document scoring as a means to
> extract additional data from an index. Specifically, I would like the
> score to be a count of how many times a particular field matched a set
> of terms.
>
> For example, I am indexing movie-stars (Each document is a movie-star).
> A movie-star has a number of fields, such as name, movies they have been
> in, etc. I want to produce an 'index' of stars by name and show how
> many movies, which match a filter, that they have appeared in.
>
> In natural language my query might be:
> "List all stars who have appeared in a 'horror' movie, where
> last name starts with A, and tell me how many horror movies they were
> in."
>
> My search will look something like this:
> "+lastName:A* +movie:(1 7 21 58 92)" //where movie is a
> previously computed list of 'horror' movie ids
>
> If my index contained the following documents:
> doc1 = lastName:Anna movie:{3 10}
> doc2 = lastName:Aba movie:{1 10 12}
> doc3 = lastName:Addd movie:{3 21 55 92}
> doc4 = lastName:Baaa movie:{7 56}
>
> I would like to get back:
> doc2, score of 1 //score of 1 because only movie 1 matched
> doc3, score of 2 //score of 2 because movies 21 and 92 matched
>
>
>
> Currently, we perform an initial query against our Star index to
> retrieve a list of stars. Then we perform N queries against a separate
> movie index to count the number of movies that match our sub filter
> 'horror'. This is obviously very inefficient, and as I've shown above,
> the information (count) is available during the primary query.
>
> Thoughts?
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]