Re: One (large) field shared by many documents

Paul Elschot Sun, 20 May 2007 12:28:34 -0700

On Sunday 20 May 2007 19:52, Peter Bloem wrote:
> Thanks for your reply. This is getting me much deeper into the uncharted 
> territories of Lucene, especially the area of FieldCaches, but it's also 
> piqued my curiosity. Most of what I've been able to find are discussions 
> by people that are already using FieldCache, rather than explanations of 
> what they actually are. From what I understand a FieldCache caches 
> certain values, and has methods that retrieve the information from the 
> cache, or from a provided IndexReader if the cache doesn't have the 
> requested value. My main question is where to get a fieldcache, and how 
> to add things to it. The only publically available one in the API seems 
> to be FieldCache.DEFAULT, but you speak of multiple fieldcaches.


You can use that. I was not very precise, I meant using the IndexReaders
of the two indexes with this single cache implementation.

> I could  
> of course borrow FieldCacheImpl by copying the file to my own package, 
> but that would probably affect the license of my code (not that I care 
> much about that, but it does feel like another can of worms).

I would not expect you to need to borrow from FieldCachImpl.
 
> Do the scores for the collection id's get stored in FieldCache.DEFAULT 
> by the searcher or should I see to that myself?

You'll have to create another map to store score values.
You may want to use that map only for a single query.

> And what exactly does  
> the String field parameter in the getters do? Is this a lucene field, or 
> simply a key with which to retrieve the cached values?

It is the name of the lucene Field.
 
> I'm sorry to be asking this many questions.

> Normally I would dig into  
> the source code and try to figure this out myself, but I have a deadline 
> that is approaching at a rather frightening speed. And in any case, it 
> doesn't hurt to have these issues explained somewhere on the internet, 
> in case somebody finds himself in the same situation.

<less_serious text="But it's weekend and I thought deadlines only happen in 
other timeframes."/>
 
> All this business about fieldcaches, has led me to think that I might be 
> better off caching the collection scores for each query myself. The 
> process would then look like this:
> * query the collection index with the user query, and calculate the 
> scores per collection (possibly using only the top n collections, if I 
> get too many)
> * store the collection scores in a (weak) HashMap<String, Float> (or 
> maybe a treemap) mapping collection id's (which are strings) to the 
> collection scores (which are floats)

So far so good.

> * retrieve all documents in all collections (and perhaps any documents 
> that fit the query by themselves, if I ignored any collections)

Ouch. Why retrieve all docs when you only need the highest scoring ones?
See also below.

> * during the scoring process, score the document normally, retrieve the 
> collection id from the document, retrieve the collection score from the 
> hashmap and add it to the original score (possibly multiplying it by a 
> scalar 0<s<1, to diminish the effect of the collection. As far as I can 
> tell, this returns the same score as it would when the collection is 
> just another field in the document (boosted by s).

Sounds correct, but if you have many matching docs, you may
prefer using a TopDocs, and for that you will have to hook your 
adding/multiplication into the IndexSearcher method that returns
a TopDocs. I've never done that myself, so if there is no API,
have a look at the code.
 
> If I understand you correctly, the FieldCache would take the place of 
> the HashMap. Does this approach have any significant problems compared 
> to using a FieldCache?

No idea. I don't know how the internals of FieldCacheImpl.
 
> Another point I'm unclear about is where exactly to implement the last 
> step. IndexSearcher seems to call Scorer.score(HitCollector) for the 
> whole set of documents, which looks like it has a score() method for a 
> single document/query collection. I guess I could extend Scorer to wrap 
> around the regular scorer used, but this would also require me to extend 
> Weight. I'm hoping there's an easier way to accomplish all this.

Have a look at the IndexSearcher code, as I suggested above.
 
> The two performance penalties here that I can see are retrieving all 
> documents from all returned collections (as pointed out by Erick), since 
> it requires a whole bunch of OR clauses (for collection id) and 
> populating the hash map with the collection id's. The effect of both 
> depends on the number of collections. Unfortunately, a closer look at 
> the data tells me that the amount of collections is around a hundred 
> thousand.On the other hand, any reasonable query should return only as 
> much collections as it would from a set of medium sized documents. I 
> guess the only way to find out how bad the performance will be, is to 
> implement it.

A FieldCache will retrieve the necessary field values only once, and therefore
you can avoid retrieving many documents yourself.

Regards,
Paul Elschot.

> 
> regards,
>  Peter
> 
> Paul Elschot wrote:
> > On Sunday 20 May 2007 02:49, Peter Bloem wrote:
> >   
> >> Ah, now we're getting somewhere. So I run the first query on the 
> >> collection index, get a set of collection id's from that. But how do I 
> >> use them in the second query on the document index? It should be easy 
> >> enough to retrieve all documents in the returned collections (which is 
> >> what I'm after), but then I want to rank them as if they had the 
> >> collection's term vector as a field. Is there some way to modify a 
> >> document just prior to processing?
> >>     
> >
> > One way is indeed to index the docs and collections separately.
> > The trick is to use FieldCaches for your collection id's.
> > The price is that these FieldCaches must be loaded initially.
> >
> > First query the collections, using a HitCollector
> > to keep all their scores by your collection id using a FieldCache.
> > By the time your indexes get really large, you may
> > want to keep only a maximum number of the best scoring
> > collections here.
> >
> > Then query the docs, and smooth their scores during the search
> > using the collection scores, using a FieldCache for the collection
> > id per doc. For this, have a look at the IndexSearcher code on how
> > to hook in your own smoothing HitCollector to return for example
> > a TopDocs.
> >   
> >> I have several thousand collections, but the number of collections 
> >> matching a query should remain quite small. The collections contain 
> >> about as much text as a small webpage, so the chance that one query 
> >> matches huge amounts of collections is small. If this does become a 
> >> problem, I could still store the document id's. My data won't change, so 
> >> there's no danger of the document id's changing. The end result of the 
> >> project has to look like a production system, but it doesn't have to be 
> >> one. :)
> >>     
> >
> > This does not sound like something that will run out of RAM for 
> > a few FieldCaches.
> >  
> >   
> >> I can see why using Lucene like a database is worrying. There's already 
> >> the problem of referential integrity (what if you update 
> >> document/collection id's), which databases do well, and Lucene doesn't 
> >> do at all (as there doesn't seem to be a standard mechanism for this 
> >> sort of thing).
> >>     
> >
> > Relational databases also use caches for various relational keys.
> > With lucene you just have to explicitly choose them and program their use.
> >
> >   
> >> On the other hand, I don't think this technique is very  
> >> new. I think it's a common smoothing method in xml element retrieval  
> >> (smoothing an element with the contents of its ancestor elements). So 
> >> surely this sort of thing gets done a lot. I guess there are bound to be 
> >> some limits to the inverse index that require less pretty tricks like 
these.
> >>     
> >
> > For small texts like link anchors it is easier to add them to each page as 
a 
> > "relational attribute".
> > A collection text looks more like a sum text than like a relational 
attribute, 
> > so treating it as a separate lucene doc (lucene "entity") feels just about 
> > right.
> >
> > Regards,
> > Paul Elschot
> >
> >
> >   
> >> regards,
> >>  Peter
> >>
> >> Erick Erickson wrote:
> >>     
> >>> You're right, your index will bloat considerably. In fact, I'm surprised
> >>> it's only a factor of 5....
> >>>
> >>> The only thing that comes to mind is really a variant on your approach
> >>> from your first e-mail. But I wouldn't use document ids because document
> >>> IDs can change. So using doc IDs is...er.... fraught.
> >>>
> >>> So here's the variant. Go ahead and index your "collection vector",
> >>> but index it with a second field that is your "collection ID". Then, add
> >>> that collection ID to each document in your original index. So, you have
> >>> something like
> >>>
> >>> a: text:{look, a, cat}  collectionID:32
> >>> b: text:{my, chimpansee, is , hairy} collectionID:32
> >>> c: text:{dogs, are, playful} collectionID:32
> >>>
> >>> Your other index has
> >>> collectionID:32 collectionVector:{look, a, cat, my, chimpansee, is , 
> >>> hairy,
> >>> dogs, are, playful}
> >>>
> >>> Now, you essentially make two queries, one to get a set of
> >>> collection IDs from your second index (that is, querying your terms
> >>> against collectionVector) and using that set of collectionIDs in a
> >>> query against your first index.
> >>>
> >>> You might be able to do some interesting things with boosts
> >>> to score either query more to your liking.
> >>>
> >>> This will come close to doubling the size of your index, but your
> >>> first approach could bloat it by an arbitrary factor depending upon
> >>> how many documents were in your largest collection.....
> >>>
> >>> One thing to note, however, is that there is no need to have
> >>> two separate physical indexes. Lucene does not require that
> >>> all  documents have the same fields. So this could all be in one
> >>> big happy index. As long as the fields are different in the two
> >>> sets of documents, the queries won't interfere with each other. In
> >>> that case, you'd have to name the "foreign key" field differently for
> >>> the sets of documents, say collectionID1 and collectionID2.
> >>>
> >>>
> >>> All that said, this approach bothers me because it's mixing
> >>> some database ideas with a Lucene index. I suppose in a controlled
> >>> situation where you won't be trying to do arbitrary joins it's probably
> >>> a misplaced unease. But I'm leery of trying to make Lucene act
> >>> like a database. But that may just be a personal problem <G>
> >>>
> >>> The only other consideration is "how many collections do you have?"
> >>> The reason I ask is that in the worst case scenario, you'll have an
> >>> OR clause for every collection ID you have. Lucene can easily handle
> >>> many thousands of terms in an OR, but your search time will suffer.
> >>> And you'll have to take special action (really, just set 
> >>> MaxBooleanClauses)
> >>> if this is over 1024 or you'll get a TooManyClauses exception.
> >>>
> >>> Best
> >>> Erick
> >>>
> >>>
> >>> On 5/19/07, Peter Bloem <[EMAIL PROTECTED]> wrote:
> >>>       
> >>>> I'm sorry, I should have explained the intended behavior more clearly.
> >>>>
> >>>> The basic idea (without the collection fields) is that there are very
> >>>> simple documents in the index with one content field each. All I do 
with
> >>>> this index is a standard search in this text field. To improve the
> >>>> search results, I want to also add the concatenation of all documents 
in
> >>>> a collection as a field to every single document. I then search the
> >>>> index using both fields, and diminishing the effect of the collection
> >>>> field. This should improve the search results.
> >>>>
> >>>> As an example, say I have the documents a:"look a cat" b:"my chimpansee
> >>>> is hairy" c:"dogs are playful" and many others. These three documents
> >>>> are grouped into one collection (of many). The term vectors for the
> >>>> documents would then be
> >>>> a: {look, a, cat}
> >>>> b: {my, chimpansee, is , hairy}
> >>>> c: {dogs, are, playful}
> >>>> If I create a term vector for the whole collection: {look, a, cat, my,
> >>>> chimpansee, is , hairy, dogs, are, playful} and add it to each of the
> >>>> documents as a separate field, the query "my hairy cat" scores well
> >>>> against document a because of the match on cat, but also because of the
> >>>> match on both cat and hairy on the collection field. Documents about 
the
> >>>> linux command 'cat' do not have the word "hairy" in their collection
> >>>> field (because they're part of a different collection), and so would 
not
> >>>> get this benefit. It's essentially a smoothing technique, since it
> >>>> allows query words that aren't in the document to still have some 
> >>>> effect.
> >>>>
> >>>> The problem of course is that storing these collection term vectors for
> >>>> each document greatly increases the size of the index and the indexing
> >>>> time. It would be alot faster if I could somehow use a second index to
> >>>> store the collections as documents, so I would only have to store one
> >>>> term vector per collection. (This isn't my own idea btw, I'm trying to
> >>>> replicate the results from some other research that used this method).
> >>>>
> >>>> I hope this is more clear,
> >>>> Peter
> >>>>
> >>>> Erick Erickson wrote:
> >>>>         
> >>>>> This seems kind of kludgy, but that may just mean I don't understand
> >>>>> your problem very well.
> >>>>>
> >>>>> What is it that you're trying to accomplish? Searching constrained
> >>>>> by topic or groups?
> >>>>>
> >>>>> If you're trying to search by groups, search the archive for the
> >>>>> word "facet" or "faceted search".
> >>>>>
> >>>>> Otherwise, could you describe what behavior you're after and maybe
> >>>>> there'd be more ideas....
> >>>>>
> >>>>> Best
> >>>>> Erick
> >>>>>
> >>>>> On 5/19/07, Peter Bloem <[EMAIL PROTECTED]> wrote:
> >>>>>           
> >>>>>> Hi,
> >>>>>>
> >>>>>> I have the following problem. I'm indexing documents that belong to
> >>>>>>             
> >>>> some
> >>>>         
> >>>>>> collection (ie. the dataset is divided into collections, which are
> >>>>>> divided into documents). These documents become my lucene documents,
> >>>>>> with some relatively small string that becomes the field I want to
> >>>>>> search. However, I would also like to add to document d the
> >>>>>> concatenation of all documents in d's collection as a field 
> >>>>>>             
> >>>> (mainly as
> >>>> a
> >>>>         
> >>>>>> smoothing technique, because documents correspond roughly to topics).
> >>>>>> I'm currently doing just that, adding an extra field for the entire
> >>>>>> concatenated collection to each document in that collection. Of 
> >>>>>>             
> >>>> course
> >>>>         
> >>>>>> this increases the index size and indexing time greatly (about
> >>>>>> five-fold).
> >>>>>>
> >>>>>> There must be a better way to do this. My idea was to create a second
> >>>>>> index where the collections are indexed as (lucene) documents. This
> >>>>>> index would have the text as a field, and a list of document id's
> >>>>>> referring back to the main index. I could then retrieve the term 
> >>>>>>             
> >>>> vector
> >>>>         
> >>>>>> for each collection from this second index for each search result 
> >>>>>>             
> >>>> from
> >>>>         
> >>>>>> the original index.
> >>>>>>
> >>>>>> My question is if this is a smart approach. And if it is, which of
> >>>>>> Lucene's classes should I use for this. The best I could find was the
> >>>>>> FilterIndexReader. If extending the FilterIndexReader is really the
> >>>>>>             
> >>>> best
> >>>>         
> >>>>>> way to go, could I simply override the document(int, FieldSelector)
> >>>>>> method, or is there more to it? I doubt I'm the first person that's
> >>>>>>             
> >>>> ever
> >>>>         
> >>>>>> wanted a many to one relation between fields and documents, so I hope
> >>>>>> there's a simpler way about this.
> >>>>>>
> >>>>>> Thank you,
> >>>>>> Peter
> >>>>>>
> >>>>>> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>>>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>>>>
> >>>>>>
> >>>>>>             
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>>
> >>>>
> >>>>         
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >>
> >>     
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >   
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: One (large) field shared by many documents

Reply via email to