Thanks for the reply. I thought of using boosting, for example "((userId:14 AND title:have)^10 OR (title:have))" or "((userId:14^10 AND title:have) OR (title:have))" or something like that. However, there would still be duplicates (all 3 docs for "To Have and To Have Not" would be included whereas I would only want the one I own to be there). This also requires using the scoring for sorting so I can't apply other sorting (I would want to sort the results secondarily by title for example). I might be able to go this route, but it seems like some combination of custom filtering and sorting would work better.
I thought of somehow doing an empty query to fetch all docs, sorting them to put docs with the userId first, and then running a DuplicateFilter on title with KM_USE_FIRST_OCCURRENCE. This is the duplicate elimination behavior I want. Then do a text search on the remainder. But this seems very expensive. Randy On Jul 22, 2012, at 11:33 AM, Erick Erickson wrote: > Hmmm, what about simply boosting very high on owner, and probably > grouping on title? > > If you boosted on owner, you wouldn't even have to index the title > separately for each user, your "owner" field could be multivalued and > contain _all_ the owner IDs. In that case you wouldn't have to group > at all...... > > Best > Erick > > On Sun, Jul 22, 2012 at 11:06 AM, Uncle <unclelongha...@gmail.com> wrote: >> I also posted this to StackOverflow, apologies if you see this twice. >> >> I have a data set whereby documents are associated to a user id. Say that >> the documents represent books, and each book can have one or more owner. I >> am indexing the titles with Lucene. When searching, I want all results owned >> by me to be sorted at the top of the results before results that are not >> owned by me. So the data might look like: >> >> Owner ID Book Title >> -------- ---------- >> 13 To Have and To Have Not >> 14 To Have and To Have Not >> 19 To Have and To Have Not >> 18 Have a Little Faith >> 15 Snow Crash >> 17 Snow Crash >> 18 Cryptonomicon >> 14 Of Mice And Men >> 17 Flash Crash >> >> Say that my user id is 14 and I search on "have", I want to match on both >> "To Have and To Have Not" and "Have a Little Faith", but "To Have and To >> Have Not" should show up higher in my search results, because I own it. >> Similarly, if I am user id 15 and search for "Crash", I will match both >> "Snow Crash" and "want "Flash Crash", but "Snow Crash" should show up first >> because I own it. If I am user id 14 and I search for "crash", I would >> still get a match for "Snow Crash" even though I don't own it. If I did a >> fuzzy match for "a" which would match almost all of these titles, I would >> see those that I own before I see the others. >> >> I am a little stuck on whether this is a query, filter, custom sort, or some >> combination, and how to get the best performance. For example, if I could >> write a filter that eliminates all duplicate titles, giving preference to >> those owned by me, I could then just perform a search on the remainder >> (assuming that filters are applied before searches). Then, a custom sort >> based on whether or not I own the doc would be straightforward. >> >> But I am not sure how to implement the filter. It is not a simple >> DuplicateFilter because it operates on two fields. It is similar to the >> security filter example in section 5.6.7 of Lucene in Action, except that I >> still want to be able to see documents that I don't own, if I don't own a >> book with the same title. The custom filter in section 6.4 is also close, >> but my problem is more complex because it depends on two fields. >> >> While iterating over the documents, the filter would have to remember which >> titles have been seen, and then keep the ones that I own. For example if it >> iterated over the values above in order, it would see the title "To Have and >> To Have Not", not owned by me; and then see the same title again, owned by >> me, and have to know that it should drop the first doc and keep the second. >> I can't think of how to do this without using a lot of memory, essentially >> keeping all titles in memory while iterating, which seems very expensive. It >> isn't a simple "match" function because whether or not I match depends on >> the other documents in the set. >> >> Thanks much for any guidance or info. >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org