Re: Matching on "owned" docs -- filter or query? Or sort?

Uncle Sun, 22 Jul 2012 10:33:37 -0700

Thanks for the reply.  I thought of using boosting, for example "((userId:14 
AND title:have)^10 OR (title:have))" or "((userId:14^10 AND title:have) OR 
(title:have))" or something like that.  However, there would still be 
duplicates (all 3 docs for "To Have and To Have Not" would be included whereas 
I would only want the one I own to be there).  This also requires using the 
scoring for sorting so I can't apply other sorting (I would want to sort the 
results secondarily by title for example). I might be able to go this route, 
but it seems like some combination of custom filtering and sorting would work 
better.


I thought of somehow doing an empty query to fetch all docs, sorting them to 
put docs with the userId first, and then running a DuplicateFilter on title 
with KM_USE_FIRST_OCCURRENCE.  This is the duplicate elimination behavior I 
want.  Then do a text search on the remainder.  But this seems very expensive.

Randy

On Jul 22, 2012, at 11:33 AM, Erick Erickson wrote:

> Hmmm, what about simply boosting very high on owner, and probably
> grouping on title?
> 
> If you boosted on owner, you wouldn't even have to index the title
> separately for each user, your "owner" field could be multivalued and
> contain _all_ the owner IDs. In that case you wouldn't have to group
> at all......
> 
> Best
> Erick
> 
> On Sun, Jul 22, 2012 at 11:06 AM, Uncle <unclelongha...@gmail.com> wrote:
>> I also posted this to StackOverflow, apologies if you see this twice.
>> 
>> I have a data set whereby documents are associated to a user id. Say that 
>> the documents represent books, and each book can have one or more owner. I 
>> am indexing the titles with Lucene. When searching, I want all results owned 
>> by me to be sorted at the top of the results before results that are not 
>> owned by me. So the data might look like:
>> 
>> Owner ID       Book Title
>> --------             ----------
>> 13                   To Have and To Have Not
>> 14                   To Have and To Have Not
>> 19                   To Have and To Have Not
>> 18                   Have a Little Faith
>> 15                   Snow Crash
>> 17                   Snow Crash
>> 18                   Cryptonomicon
>> 14                   Of Mice And Men
>> 17                   Flash Crash
>> 
>> Say that my user id is 14 and I search on "have", I want to match on both 
>> "To Have and To Have Not" and "Have a Little Faith", but "To Have and To 
>> Have Not" should show up higher in my search results, because I own it.  
>> Similarly, if I am user id 15 and search for "Crash", I will match both 
>> "Snow Crash" and "want "Flash Crash", but "Snow Crash" should show up first 
>> because I own it.  If I am user id 14 and I search for "crash", I would 
>> still get a match for "Snow Crash" even though I don't own it.  If I did a 
>> fuzzy match for "a" which would match almost all of these titles, I would 
>> see those that I own before I see the others.
>> 
>> I am a little stuck on whether this is a query, filter, custom sort, or some 
>> combination, and how to get the best performance.  For example, if I could 
>> write a filter that eliminates all duplicate titles, giving preference to 
>> those owned by me, I could then just perform a search on the remainder 
>> (assuming that filters are applied before searches). Then, a custom sort 
>> based on whether or not I own the doc would be straightforward.
>> 
>> But I am not sure how to implement the filter. It is not a simple 
>> DuplicateFilter because it operates on two fields. It is similar to the 
>> security filter example in section 5.6.7 of Lucene in Action, except that I 
>> still want to be able to see documents that I don't own, if I don't own a 
>> book with the same title. The custom filter in section 6.4 is also close, 
>> but my problem is more complex because it depends on two fields.
>> 
>> While iterating over the documents, the filter would have to remember which 
>> titles have been seen, and then keep the ones that I own. For example if it 
>> iterated over the values above in order, it would see the title "To Have and 
>> To Have Not", not owned by me; and then see the same title again, owned by 
>> me, and have to know that it should drop the first doc and keep the second. 
>> I can't think of how to do this without using a lot of memory, essentially 
>> keeping all titles in memory while iterating, which seems very expensive. It 
>> isn't a simple "match" function because whether or not I match depends on 
>> the other documents in the set.
>> 
>> Thanks much for any guidance or info.
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Matching on "owned" docs -- filter or query? Or sort?

Reply via email to