[jira] Commented: (LUCENE-1494) Additional features for searching for value across multiple fields (many-to-one style)

Hoss Man (JIRA) Wed, 21 Jan 2009 12:34:22 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665961#action_12665961
 ]


Hoss Man commented on LUCENE-1494:
----------------------------------



bq. I don't disagree that an inverted inheritance hierarchy would make more 
sense, but the problem with that is that getField (which I think is the only 
thing on SpanNearQuery that doesn't really make sense for a MultiField one) is 
mandated by the abstract method declaration of same in SpanQuery.

Ah, right right ... of course.  I was thinking getField was just a 
SpanNearQuery concept, but it's actually central to the whole concept of 
SpanQuery.

This actually raises some interesting questions about the behavior of all of 
this...

Beyond just the explain methods, SpanQuery.getField plays two important roles:
# it determines what norms get used by SpanScorer
# it dictates what other SpanQueries this query can be nested in -- so far 
we've really only discussed directly executing a MultiFieldSpanNearQuery, but 
we also have to consider what it means to combine a MultiFieldSpanNearQuery in 
another SpanQuery

At the moment, your patch treats the first element of the SpanQuery[] used to 
construct the MultiFieldSpanNearQuery as "special" -- it specifies the field 
which determines the norms used and what oher SpanQueries it can be combined 
with.  At a minimum that special case behavior needs to be documented, but we 
may also want to consider tweaking the API to make it more explicit (ie: 
perhaps when constructing a MultiFieldSpanNearQuery you should be required to 
explicitly state the field name you want to use).  It may also be worth 
considering whether or not MultiFieldSpanNearQuery should use a custom Scorer 
that takes into account the norms of all the fields (averaging them maybe?)


(FWIW: this highlights one of the reasons why a multi-field PhraseQuery would 
be much simpler to implement then a multi-field SpanNearQuery ... the super 
class of PhraseQuery (Query) has no inherent concept of a field, so it would be 
easy to inject a new superclass in the middle there)


The more i think about this, the more i wonder if a simpler solution would be a 
SpanQuery that wrapped another SpanQuery and proxied all of hte method except 
for the getField() method, ie...

{code}
public class MaskFieldSpanQuery extends SpanQuery {
  SpanQuery inner;
  String field;
  public MaskFieldSpanQuery(String field, SpanQuery inner) { ... }
  public String getField() { return field; }
  public Spans getSpans(IndexReader r) { return inner.getSpans(reader); }
  public PayloadSpans getPayloadSpans(IndexReader reader) { ...
  ...
}
{code}

I haven't tested this out, but it seems that wrapping a bunch of SpanQueries in 
something like this and then building up a SpanNearQuery should be functionally 
equivalent to the existing MultiFieldSpanNearQuery in the patch, but would also 
allow for other interesting things (like a SpanNotQuery where you want to find 
all docs that match on rating_year:2004 but *not* if rating_score:POOR matches 
in the same position.


what do people think?




> Additional features for searching for value across multiple fields 
> (many-to-one style)
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1494
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1494
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 2.4
>            Reporter: Paul Cowan
>            Priority: Minor
>         Attachments: LUCENE-1494-multifield.patch, 
> LUCENE-1494-positionincrement.patch
>
>
> This issue is to cover the changes required to do a search across multiple 
> fields with the same name in a fashion similar to a many-to-one database. 
> Below is my post on java-dev on the topic, which details the changes we need:
> ---
> We have an interesting situation where we are effectively indexing two 
> 'entities' in our system, which share a one-to-many relationship (imagine 
> 'User' and 'Delivery Address' for demonstration purposes). At the moment, we 
> index one Lucene Document per 'many' end, duplicating the 'one' end data, 
> like so:
>     userid: 1
>     userfirstname: fred
>     addresscountry: au
>     addressphone: 1234
>     userid: 1
>     userfirstname: fred
>     addresscountry: nz
>     addressphone: 5678
>     userid: 2
>     userfirstname: mary
>     addresscountry: au
>     addressphone: 5678
> (note: 2 Documents indexed for user 1). This is somewhat annoying for us, 
> because when we search in Lucene the results we want back (conceptually) are 
> at the 'user' level, so we have to collapse the results by distinct user id, 
> etc. etc (let alone that it blows out the size of our index enormously). So 
> why do we do it? It would make more sense to use multiple fields:
>     userid: 1
>     userfirstname: fred
>     addresscountry: au
>     addressphone: 1234
>     addresscountry: nz
>     addressphone: 5678
>     userid: 2
>     userfirstname: mary
>     addresscountry: au
>     addressphone: 5678
> But imagine the search "+addresscountry:au +addressphone:5678". We'd like 
> this to match ONLY Mary, but of course it matches Fred also because he 
> matches both those terms (just for different addresses).
> There are two aspects to the approach we've (more or less) got working but 
> I'd like to run them past the group and see if they're worth trying to get 
> them into Lucene proper (if so, I'll create a JIRA issue for them)
> 1) Use a modified SpanNearQuery. If we assume that country + phone will 
> always be one token, we can rely on the fact that the positions of 'au' and 
> '5678' in Fred's document will be different.
>    SpanQuery q1 = new SpanTermQuery(new Term("addresscountry", "au"));
>    SpanQuery q2 = new SpanTermQuery(new Term("addressphone", "5678"));
>    SpanQuery snq = new SpanNearQuery(new SpanQuery[]{q1, q2}, 0, false);
> the slop of 0 means that we'll only return those where the two terms are in 
> the same position in their respective fields. This works brilliantly, BUT 
> requires a change to SpanNearQuery's constructor (which checks that all the 
> clauses are against the same field). Are people amenable to perhaps adding 
> another constructor to SNQ which doesn't do the check, or subclassing it to 
> do the same (give it a protected non-checking constructor for the subclass to 
> call)?
> 2) It gets slightly more complicated in the case of variable-length terms. 
> For example, imagine if we had an 'address' field ('123 Smith St') which will 
> result in (1 to n) tokens; slop 0 in a SpanNearQuery won't work here, of 
> course. One thing we've toyed with is the idea of using 
> getPositionIncrementGap -- if we knew that 'address' would be, at most, 20 
> tokens, we might use a position increment gap of 100, and make the slop 
> factor 50; this works fine for the simple case (yay!), but with a great many 
> addresses-per-user starts to get more complicated, as the gap counts from the 
> last term (so the position sequence for a single value field might be 0, 100, 
> 200, but for the address field it might be 0, 1, 2, 3, 103, 104, 105, 106, 
> 206, 207... so it's going to get out of sync). The simplest option here seems 
> to be changing (or supplementing)
>    public int getPositionIncrementGap(String fieldname)
> to
>    public int getPositionIncrementGap(String fieldname, int currentPos)
> so that we can override that to round up to the nearest 100 (or whatever) 
> based on currentPos. The default implementation could just delegate to 
> getPositionIncrementGap().
> ---
> Patches (x2) to follow shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1494) Additional features for searching for value across multiple fields (many-to-one style)

Reply via email to