Chris,

You could also store term vectors for all docs at indexing
time, and add the termvectors for the matching docs into a
(large) map of terms in RAM.

Regards,
Paul Elschot


On Monday 12 October 2009 21:30:48 Christoph Boosz wrote:
> Hi Jake,
> 
> Thanks for your helpful explanation.
> In fact, my initial solution was to traverse each document in the result
> once and count the contained terms. As you mentioned, this process took a
> lot of memory.
> Trying to confine the memory usage with the facet approach, I was surprised
> by the decline in performance.
> Now I know it's nothing abnormal, at least.
> 
> Chris
> 
> 
> 2009/10/12 Jake Mannix <jake.man...@gmail.com>
> 
> > Hey Chris,
> >
> > On Mon, Oct 12, 2009 at 10:30 AM, Christoph Boosz <
> > christoph.bo...@googlemail.com> wrote:
> >
> > > Thanks for your reply.
> > > Yes, it's likely that many terms occur in few documents.
> > >
> > > If I understand you right, I should do the following:
> > > -Write a HitCollector that simply increments a counter
> > > -Get the filter for the user query once: new CachingWrapperFilter(new
> > > QueryWrapperFilter(userQuery));
> > > -Create a TermQuery for each term
> > > -Perform the search and read the counter of the HitCollector
> > >
> > > I did that, but it didn't get faster. Any ideas why?
> > >
> >
> > This killer is the "TermQuery for each term" part - this is huge. You need
> > to invert this process,
> > and use your query as is, but while walking in the HitCollector, on each
> > doc
> > which matches
> > your query, increment counters for each of the terms in that document
> > (which
> > means you need
> > an in-memory forward lookup for your documents, like a multivalued
> > FieldCache - and if you've
> > got roughly the same number of terms as documents, this cache is likely to
> > be as large as
> > your entire index - a pretty hefty RAM cost).
> >
> > But a good thing to keep in mind is that doing this kind of faceting
> > (massively multivalued
> > on a huge term-set) requires a lot of computation, even if you have all the
> > proper structures
> > living in memory:
> >
> > For each document you look at (which matches your query), you need to look
> > at all
> > of the terms in that document, and increment a counter for that term.  So
> > however much
> > time it would normally take for you to do the driving query, it can take as
> > much as that
> > multiplied by the average number of terms in a document in your index.  If
> > your documents
> > are big, this could be a pretty huge latency penalty.
> >
> >  -jake
> >
> 

Reply via email to