Hi Jake, Thanks for your helpful explanation. In fact, my initial solution was to traverse each document in the result once and count the contained terms. As you mentioned, this process took a lot of memory. Trying to confine the memory usage with the facet approach, I was surprised by the decline in performance. Now I know it's nothing abnormal, at least.
Chris 2009/10/12 Jake Mannix <jake.man...@gmail.com> > Hey Chris, > > On Mon, Oct 12, 2009 at 10:30 AM, Christoph Boosz < > christoph.bo...@googlemail.com> wrote: > > > Thanks for your reply. > > Yes, it's likely that many terms occur in few documents. > > > > If I understand you right, I should do the following: > > -Write a HitCollector that simply increments a counter > > -Get the filter for the user query once: new CachingWrapperFilter(new > > QueryWrapperFilter(userQuery)); > > -Create a TermQuery for each term > > -Perform the search and read the counter of the HitCollector > > > > I did that, but it didn't get faster. Any ideas why? > > > > This killer is the "TermQuery for each term" part - this is huge. You need > to invert this process, > and use your query as is, but while walking in the HitCollector, on each > doc > which matches > your query, increment counters for each of the terms in that document > (which > means you need > an in-memory forward lookup for your documents, like a multivalued > FieldCache - and if you've > got roughly the same number of terms as documents, this cache is likely to > be as large as > your entire index - a pretty hefty RAM cost). > > But a good thing to keep in mind is that doing this kind of faceting > (massively multivalued > on a huge term-set) requires a lot of computation, even if you have all the > proper structures > living in memory: > > For each document you look at (which matches your query), you need to look > at all > of the terms in that document, and increment a counter for that term. So > however much > time it would normally take for you to do the driving query, it can take as > much as that > multiplied by the average number of terms in a document in your index. If > your documents > are big, this could be a pretty huge latency penalty. > > -jake >