Re: how to select top categories.

Paul Elschot Thu, 26 Jan 2006 00:28:09 -0800

On Wednesday 25 January 2006 22:24, Chris Hostetter wrote:
> 
> : for this site, but would you cash all manufacturers and intersect all with
> : the initial query in one page load? Seems like that would be alot.
> 
> Yep it is a lot, but if you've got the RAM, it's not that time intensive.
> At CNET, depending on what page you are looking at, i'm doing
> anywhere from 100-1000 intersections.
> 
> : So your saying that in a single page load I might be able to do one 
intitial
> : query, and intersect thousands of bitsets in under a second with and an
> : index of around 5 million documents, assuming that the server/pc is decent
> 
> I can't predict exactly what your performance will be based on your
> index size -- and i certainly can't make any promises about it being under
> a second total time ... but you should try it, i think it's very feasible.
> 
> : speed and enough memory? I still would have to cache thousands of 625k
> : bits.. Could I do this with files instead of RAM maybe?
> 
> i played with serializing BitSets to disk, it can be done .. but reading
> the cached files has a definite added cost.  if you really don't have the
> ram to store alll of the bitsets in memory then you're going to want to at
> least use an in memory cache for some fixed number of commonly used
> categories ... wether that cache falls back to reding from disk on miss,
> or falls back to executing the category specific filter/query again
> depends on your goals.
>


These more compact filters might be useful here:
http://issues.apache.org/jira/browse/LUCENE-328

They would probably take about two bytes per filtered document in this
case, both in RAM and on disk. Perhaps one won't even need the disk,
but adding some code to read/write such filters from a stream is easy.

I'd still prefer to give a score to each category based on the best
(say 100) document scores for the initial query, even though that would
need some more coding to keep track of the category scores.

These compact filters should work reasonably well as long as the
categories don't contain too many documents. For larger categories
these filters have the disadvantage that they still need to access every
document number for doing an intersection with the category, but BitSets
have this problem too.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: how to select top categories.

Reply via email to