[jira] Updated: (LUCENE-1494) Additional features for searching for value across multiple fields (many-to-one style)

Hoss Man (JIRA) Thu, 30 Apr 2009 16:43:55 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hoss Man updated LUCENE-1494:
-----------------------------

    Attachment: LUCENE-1494-masking.patch

some things looked like they wouldn't work with the masking patch, so i wrote 
some test cases to convince myself they were broken (and because new code 
should always have test cases).  In particular i was worried about the lack of 
equals/hashCode methods, and the broken rewrite method

one interesting thing I discovered was that the code worked in many cases even 
though rewrite was constantly just returning the masked inner query -- digging 
into it i realized the reason was because none of the other SpanQuery classes 
pay any attention to what their nested clauses return when they recursively 
rewrite, so a SpanNearQuery whose constructor freaks out if the fields of all 
the clauses don't match, happily generates spans if one of those clauses 
returns a complteley different SpanQuery on rewrite.

I also removed the proxying of getBoost and setBoost ... it was causing 
problems with some stock testing framework code that expects a 
q1.equals(q1.clone().setBoost(newBoost)) to be false (this was evaluating to 
true because it's a shallow clone and setBoost was proxying and modifying the 
original inner query's boost value) ... this means that FieldMaskingSpanQuery 
is consistent with how other SpanQueries deal with boost (they ignore the 
boosts of their nested clauses)

new patch (with tests) attached ... i'd like to have some more tests before 
committing (spans is deep voodoo, we're doing funky stuff with spans, all the 
more reason to test thoroughly)

> Additional features for searching for value across multiple fields 
> (many-to-one style)
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1494
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1494
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 2.4
>            Reporter: Paul Cowan
>            Priority: Minor
>         Attachments: LUCENE-1494-masking.patch, LUCENE-1494-masking.patch, 
> LUCENE-1494-multifield.patch, LUCENE-1494-positionincrement.patch
>
>
> This issue is to cover the changes required to do a search across multiple 
> fields with the same name in a fashion similar to a many-to-one database. 
> Below is my post on java-dev on the topic, which details the changes we need:
> ---
> We have an interesting situation where we are effectively indexing two 
> 'entities' in our system, which share a one-to-many relationship (imagine 
> 'User' and 'Delivery Address' for demonstration purposes). At the moment, we 
> index one Lucene Document per 'many' end, duplicating the 'one' end data, 
> like so:
>     userid: 1
>     userfirstname: fred
>     addresscountry: au
>     addressphone: 1234
>     userid: 1
>     userfirstname: fred
>     addresscountry: nz
>     addressphone: 5678
>     userid: 2
>     userfirstname: mary
>     addresscountry: au
>     addressphone: 5678
> (note: 2 Documents indexed for user 1). This is somewhat annoying for us, 
> because when we search in Lucene the results we want back (conceptually) are 
> at the 'user' level, so we have to collapse the results by distinct user id, 
> etc. etc (let alone that it blows out the size of our index enormously). So 
> why do we do it? It would make more sense to use multiple fields:
>     userid: 1
>     userfirstname: fred
>     addresscountry: au
>     addressphone: 1234
>     addresscountry: nz
>     addressphone: 5678
>     userid: 2
>     userfirstname: mary
>     addresscountry: au
>     addressphone: 5678
> But imagine the search "+addresscountry:au +addressphone:5678". We'd like 
> this to match ONLY Mary, but of course it matches Fred also because he 
> matches both those terms (just for different addresses).
> There are two aspects to the approach we've (more or less) got working but 
> I'd like to run them past the group and see if they're worth trying to get 
> them into Lucene proper (if so, I'll create a JIRA issue for them)
> 1) Use a modified SpanNearQuery. If we assume that country + phone will 
> always be one token, we can rely on the fact that the positions of 'au' and 
> '5678' in Fred's document will be different.
>    SpanQuery q1 = new SpanTermQuery(new Term("addresscountry", "au"));
>    SpanQuery q2 = new SpanTermQuery(new Term("addressphone", "5678"));
>    SpanQuery snq = new SpanNearQuery(new SpanQuery[]{q1, q2}, 0, false);
> the slop of 0 means that we'll only return those where the two terms are in 
> the same position in their respective fields. This works brilliantly, BUT 
> requires a change to SpanNearQuery's constructor (which checks that all the 
> clauses are against the same field). Are people amenable to perhaps adding 
> another constructor to SNQ which doesn't do the check, or subclassing it to 
> do the same (give it a protected non-checking constructor for the subclass to 
> call)?
> 2) It gets slightly more complicated in the case of variable-length terms. 
> For example, imagine if we had an 'address' field ('123 Smith St') which will 
> result in (1 to n) tokens; slop 0 in a SpanNearQuery won't work here, of 
> course. One thing we've toyed with is the idea of using 
> getPositionIncrementGap -- if we knew that 'address' would be, at most, 20 
> tokens, we might use a position increment gap of 100, and make the slop 
> factor 50; this works fine for the simple case (yay!), but with a great many 
> addresses-per-user starts to get more complicated, as the gap counts from the 
> last term (so the position sequence for a single value field might be 0, 100, 
> 200, but for the address field it might be 0, 1, 2, 3, 103, 104, 105, 106, 
> 206, 207... so it's going to get out of sync). The simplest option here seems 
> to be changing (or supplementing)
>    public int getPositionIncrementGap(String fieldname)
> to
>    public int getPositionIncrementGap(String fieldname, int currentPos)
> so that we can override that to round up to the nearest 100 (or whatever) 
> based on currentPos. The default implementation could just delegate to 
> getPositionIncrementGap().
> ---
> Patches (x2) to follow shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Updated: (LUCENE-1494) Additional features for searching for value across multiple fields (many-to-one style)

Reply via email to