On Sunday 20 May 2007 19:52, Peter Bloem wrote: > Thanks for your reply. This is getting me much deeper into the uncharted > territories of Lucene, especially the area of FieldCaches, but it's also > piqued my curiosity. Most of what I've been able to find are discussions > by people that are already using FieldCache, rather than explanations of > what they actually are. From what I understand a FieldCache caches > certain values, and has methods that retrieve the information from the > cache, or from a provided IndexReader if the cache doesn't have the > requested value. My main question is where to get a fieldcache, and how > to add things to it. The only publically available one in the API seems > to be FieldCache.DEFAULT, but you speak of multiple fieldcaches.
You can use that. I was not very precise, I meant using the IndexReaders of the two indexes with this single cache implementation. > I could > of course borrow FieldCacheImpl by copying the file to my own package, > but that would probably affect the license of my code (not that I care > much about that, but it does feel like another can of worms). I would not expect you to need to borrow from FieldCachImpl. > Do the scores for the collection id's get stored in FieldCache.DEFAULT > by the searcher or should I see to that myself? You'll have to create another map to store score values. You may want to use that map only for a single query. > And what exactly does > the String field parameter in the getters do? Is this a lucene field, or > simply a key with which to retrieve the cached values? It is the name of the lucene Field. > I'm sorry to be asking this many questions. > Normally I would dig into > the source code and try to figure this out myself, but I have a deadline > that is approaching at a rather frightening speed. And in any case, it > doesn't hurt to have these issues explained somewhere on the internet, > in case somebody finds himself in the same situation. <less_serious text="But it's weekend and I thought deadlines only happen in other timeframes."/> > All this business about fieldcaches, has led me to think that I might be > better off caching the collection scores for each query myself. The > process would then look like this: > * query the collection index with the user query, and calculate the > scores per collection (possibly using only the top n collections, if I > get too many) > * store the collection scores in a (weak) HashMap<String, Float> (or > maybe a treemap) mapping collection id's (which are strings) to the > collection scores (which are floats) So far so good. > * retrieve all documents in all collections (and perhaps any documents > that fit the query by themselves, if I ignored any collections) Ouch. Why retrieve all docs when you only need the highest scoring ones? See also below. > * during the scoring process, score the document normally, retrieve the > collection id from the document, retrieve the collection score from the > hashmap and add it to the original score (possibly multiplying it by a > scalar 0<s<1, to diminish the effect of the collection. As far as I can > tell, this returns the same score as it would when the collection is > just another field in the document (boosted by s). Sounds correct, but if you have many matching docs, you may prefer using a TopDocs, and for that you will have to hook your adding/multiplication into the IndexSearcher method that returns a TopDocs. I've never done that myself, so if there is no API, have a look at the code. > If I understand you correctly, the FieldCache would take the place of > the HashMap. Does this approach have any significant problems compared > to using a FieldCache? No idea. I don't know how the internals of FieldCacheImpl. > Another point I'm unclear about is where exactly to implement the last > step. IndexSearcher seems to call Scorer.score(HitCollector) for the > whole set of documents, which looks like it has a score() method for a > single document/query collection. I guess I could extend Scorer to wrap > around the regular scorer used, but this would also require me to extend > Weight. I'm hoping there's an easier way to accomplish all this. Have a look at the IndexSearcher code, as I suggested above. > The two performance penalties here that I can see are retrieving all > documents from all returned collections (as pointed out by Erick), since > it requires a whole bunch of OR clauses (for collection id) and > populating the hash map with the collection id's. The effect of both > depends on the number of collections. Unfortunately, a closer look at > the data tells me that the amount of collections is around a hundred > thousand.On the other hand, any reasonable query should return only as > much collections as it would from a set of medium sized documents. I > guess the only way to find out how bad the performance will be, is to > implement it. A FieldCache will retrieve the necessary field values only once, and therefore you can avoid retrieving many documents yourself. Regards, Paul Elschot. > > regards, > Peter > > Paul Elschot wrote: > > On Sunday 20 May 2007 02:49, Peter Bloem wrote: > > > >> Ah, now we're getting somewhere. So I run the first query on the > >> collection index, get a set of collection id's from that. But how do I > >> use them in the second query on the document index? It should be easy > >> enough to retrieve all documents in the returned collections (which is > >> what I'm after), but then I want to rank them as if they had the > >> collection's term vector as a field. Is there some way to modify a > >> document just prior to processing? > >> > > > > One way is indeed to index the docs and collections separately. > > The trick is to use FieldCaches for your collection id's. > > The price is that these FieldCaches must be loaded initially. > > > > First query the collections, using a HitCollector > > to keep all their scores by your collection id using a FieldCache. > > By the time your indexes get really large, you may > > want to keep only a maximum number of the best scoring > > collections here. > > > > Then query the docs, and smooth their scores during the search > > using the collection scores, using a FieldCache for the collection > > id per doc. For this, have a look at the IndexSearcher code on how > > to hook in your own smoothing HitCollector to return for example > > a TopDocs. > > > >> I have several thousand collections, but the number of collections > >> matching a query should remain quite small. The collections contain > >> about as much text as a small webpage, so the chance that one query > >> matches huge amounts of collections is small. If this does become a > >> problem, I could still store the document id's. My data won't change, so > >> there's no danger of the document id's changing. The end result of the > >> project has to look like a production system, but it doesn't have to be > >> one. :) > >> > > > > This does not sound like something that will run out of RAM for > > a few FieldCaches. > > > > > >> I can see why using Lucene like a database is worrying. There's already > >> the problem of referential integrity (what if you update > >> document/collection id's), which databases do well, and Lucene doesn't > >> do at all (as there doesn't seem to be a standard mechanism for this > >> sort of thing). > >> > > > > Relational databases also use caches for various relational keys. > > With lucene you just have to explicitly choose them and program their use. > > > > > >> On the other hand, I don't think this technique is very > >> new. I think it's a common smoothing method in xml element retrieval > >> (smoothing an element with the contents of its ancestor elements). So > >> surely this sort of thing gets done a lot. I guess there are bound to be > >> some limits to the inverse index that require less pretty tricks like these. > >> > > > > For small texts like link anchors it is easier to add them to each page as a > > "relational attribute". > > A collection text looks more like a sum text than like a relational attribute, > > so treating it as a separate lucene doc (lucene "entity") feels just about > > right. > > > > Regards, > > Paul Elschot > > > > > > > >> regards, > >> Peter > >> > >> Erick Erickson wrote: > >> > >>> You're right, your index will bloat considerably. In fact, I'm surprised > >>> it's only a factor of 5.... > >>> > >>> The only thing that comes to mind is really a variant on your approach > >>> from your first e-mail. But I wouldn't use document ids because document > >>> IDs can change. So using doc IDs is...er.... fraught. > >>> > >>> So here's the variant. Go ahead and index your "collection vector", > >>> but index it with a second field that is your "collection ID". Then, add > >>> that collection ID to each document in your original index. So, you have > >>> something like > >>> > >>> a: text:{look, a, cat} collectionID:32 > >>> b: text:{my, chimpansee, is , hairy} collectionID:32 > >>> c: text:{dogs, are, playful} collectionID:32 > >>> > >>> Your other index has > >>> collectionID:32 collectionVector:{look, a, cat, my, chimpansee, is , > >>> hairy, > >>> dogs, are, playful} > >>> > >>> Now, you essentially make two queries, one to get a set of > >>> collection IDs from your second index (that is, querying your terms > >>> against collectionVector) and using that set of collectionIDs in a > >>> query against your first index. > >>> > >>> You might be able to do some interesting things with boosts > >>> to score either query more to your liking. > >>> > >>> This will come close to doubling the size of your index, but your > >>> first approach could bloat it by an arbitrary factor depending upon > >>> how many documents were in your largest collection..... > >>> > >>> One thing to note, however, is that there is no need to have > >>> two separate physical indexes. Lucene does not require that > >>> all documents have the same fields. So this could all be in one > >>> big happy index. As long as the fields are different in the two > >>> sets of documents, the queries won't interfere with each other. In > >>> that case, you'd have to name the "foreign key" field differently for > >>> the sets of documents, say collectionID1 and collectionID2. > >>> > >>> > >>> All that said, this approach bothers me because it's mixing > >>> some database ideas with a Lucene index. I suppose in a controlled > >>> situation where you won't be trying to do arbitrary joins it's probably > >>> a misplaced unease. But I'm leery of trying to make Lucene act > >>> like a database. But that may just be a personal problem <G> > >>> > >>> The only other consideration is "how many collections do you have?" > >>> The reason I ask is that in the worst case scenario, you'll have an > >>> OR clause for every collection ID you have. Lucene can easily handle > >>> many thousands of terms in an OR, but your search time will suffer. > >>> And you'll have to take special action (really, just set > >>> MaxBooleanClauses) > >>> if this is over 1024 or you'll get a TooManyClauses exception. > >>> > >>> Best > >>> Erick > >>> > >>> > >>> On 5/19/07, Peter Bloem <[EMAIL PROTECTED]> wrote: > >>> > >>>> I'm sorry, I should have explained the intended behavior more clearly. > >>>> > >>>> The basic idea (without the collection fields) is that there are very > >>>> simple documents in the index with one content field each. All I do with > >>>> this index is a standard search in this text field. To improve the > >>>> search results, I want to also add the concatenation of all documents in > >>>> a collection as a field to every single document. I then search the > >>>> index using both fields, and diminishing the effect of the collection > >>>> field. This should improve the search results. > >>>> > >>>> As an example, say I have the documents a:"look a cat" b:"my chimpansee > >>>> is hairy" c:"dogs are playful" and many others. These three documents > >>>> are grouped into one collection (of many). The term vectors for the > >>>> documents would then be > >>>> a: {look, a, cat} > >>>> b: {my, chimpansee, is , hairy} > >>>> c: {dogs, are, playful} > >>>> If I create a term vector for the whole collection: {look, a, cat, my, > >>>> chimpansee, is , hairy, dogs, are, playful} and add it to each of the > >>>> documents as a separate field, the query "my hairy cat" scores well > >>>> against document a because of the match on cat, but also because of the > >>>> match on both cat and hairy on the collection field. Documents about the > >>>> linux command 'cat' do not have the word "hairy" in their collection > >>>> field (because they're part of a different collection), and so would not > >>>> get this benefit. It's essentially a smoothing technique, since it > >>>> allows query words that aren't in the document to still have some > >>>> effect. > >>>> > >>>> The problem of course is that storing these collection term vectors for > >>>> each document greatly increases the size of the index and the indexing > >>>> time. It would be alot faster if I could somehow use a second index to > >>>> store the collections as documents, so I would only have to store one > >>>> term vector per collection. (This isn't my own idea btw, I'm trying to > >>>> replicate the results from some other research that used this method). > >>>> > >>>> I hope this is more clear, > >>>> Peter > >>>> > >>>> Erick Erickson wrote: > >>>> > >>>>> This seems kind of kludgy, but that may just mean I don't understand > >>>>> your problem very well. > >>>>> > >>>>> What is it that you're trying to accomplish? Searching constrained > >>>>> by topic or groups? > >>>>> > >>>>> If you're trying to search by groups, search the archive for the > >>>>> word "facet" or "faceted search". > >>>>> > >>>>> Otherwise, could you describe what behavior you're after and maybe > >>>>> there'd be more ideas.... > >>>>> > >>>>> Best > >>>>> Erick > >>>>> > >>>>> On 5/19/07, Peter Bloem <[EMAIL PROTECTED]> wrote: > >>>>> > >>>>>> Hi, > >>>>>> > >>>>>> I have the following problem. I'm indexing documents that belong to > >>>>>> > >>>> some > >>>> > >>>>>> collection (ie. the dataset is divided into collections, which are > >>>>>> divided into documents). These documents become my lucene documents, > >>>>>> with some relatively small string that becomes the field I want to > >>>>>> search. However, I would also like to add to document d the > >>>>>> concatenation of all documents in d's collection as a field > >>>>>> > >>>> (mainly as > >>>> a > >>>> > >>>>>> smoothing technique, because documents correspond roughly to topics). > >>>>>> I'm currently doing just that, adding an extra field for the entire > >>>>>> concatenated collection to each document in that collection. Of > >>>>>> > >>>> course > >>>> > >>>>>> this increases the index size and indexing time greatly (about > >>>>>> five-fold). > >>>>>> > >>>>>> There must be a better way to do this. My idea was to create a second > >>>>>> index where the collections are indexed as (lucene) documents. This > >>>>>> index would have the text as a field, and a list of document id's > >>>>>> referring back to the main index. I could then retrieve the term > >>>>>> > >>>> vector > >>>> > >>>>>> for each collection from this second index for each search result > >>>>>> > >>>> from > >>>> > >>>>>> the original index. > >>>>>> > >>>>>> My question is if this is a smart approach. And if it is, which of > >>>>>> Lucene's classes should I use for this. The best I could find was the > >>>>>> FilterIndexReader. If extending the FilterIndexReader is really the > >>>>>> > >>>> best > >>>> > >>>>>> way to go, could I simply override the document(int, FieldSelector) > >>>>>> method, or is there more to it? I doubt I'm the first person that's > >>>>>> > >>>> ever > >>>> > >>>>>> wanted a many to one relation between fields and documents, so I hope > >>>>>> there's a simpler way about this. > >>>>>> > >>>>>> Thank you, > >>>>>> Peter > >>>>>> > >>>>>> --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>>>> For additional commands, e-mail: [EMAIL PROTECTED] > >>>>>> > >>>>>> > >>>>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>> For additional commands, e-mail: [EMAIL PROTECTED] > >>>> > >>>> > >>>> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > >> > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]