Searching in same position across multiple fields

Paul Cowan Mon, 15 Dec 2008 19:15:03 -0800

Hi all,

(All examples below are using Lucene 2.2; if things have changed inlater versions please adjust accordingly, though a quick check of theclasses involved shows no major changes in trunk)

We have an interesting situation where we are effectively indexing two'entities' in our system, which share a one-to-many relationship(imagine 'User' and 'Delivery Address' for demonstration purposes). Atthe moment, we index one Lucene Document per 'many' end, duplicating the'one' end data, like so:


        userid: 1
        userfirstname: fred
        addresscountry: au
        addressphone: 1234

        userid: 1
        userfirstname: fred
        addresscountry: nz
        addressphone: 5678

        userid: 2
        userfirstname: mary
        addresscountry: au
        addressphone: 5678

(note: 2 Documents indexed for user 1). This is somewhat annoying forus, because when we search in Lucene the results we want back(conceptually) are at the 'user' level, so we have to collapse theresults by distinct user id, etc. etc (let alone that it blows out thesize of our index enormously). So why do we do it? It would make moresense to use multiple fields:

        userid: 1
        userfirstname: fred
        addresscountry: au
        addressphone: 1234
        addresscountry: nz
        addressphone: 5678

        userid: 2
        userfirstname: mary
        addresscountry: au
        addressphone: 5678

But imagine the search "+addresscountry:au +addressphone:5678". We'dlike this to match ONLY Mary, but of course it matches Fred also becausehe matches both those terms (just for different addresses).

There are two aspects to the approach we've (more or less) got workingbut I'd like to run them past the group and see if they're worth tryingto get them into Lucene proper (if so, I'll create a JIRA issue for them)

1) Use a modified SpanNearQuery. If we assume that country + phone willalways be one token, we can rely on the fact that the positions of 'au'and '5678' in Fred's document will be different.


   SpanQuery q1 = new SpanTermQuery(new Term("addresscountry", "au"));
   SpanQuery q2 = new SpanTermQuery(new Term("addressphone", "5678"));
   SpanQuery snq = new SpanNearQuery(new SpanQuery[]{q1, q2}, 0, false);

the slop of 0 means that we'll only return those where the two terms arein the same position in their respective fields. This works brilliantly,BUT requires a change to SpanNearQuery's constructor (which checks thatall the clauses are against the same field). Are people amenable toperhaps adding another constructor to SNQ which doesn't do the check, orsubclassing it to do the same (give it a protected non-checkingconstructor for the subclass to call)?

2) It gets slightly more complicated in the case of variable-lengthterms. For example, imagine if we had an 'address' field ('123 SmithSt') which will result in (1 to n) tokens; slop 0 in a SpanNearQuerywon't work here, of course. One thing we've toyed with is the idea ofusing getPositionIncrementGap -- if we knew that 'address' would be, atmost, 20 tokens, we might use a position increment gap of 100, and makethe slop factor 50; this works fine for the simple case (yay!), but witha great many addresses-per-user starts to get more complicated, as thegap counts from the last term (so the position sequence for a singlevalue field might be 0, 100, 200, but for the address field it might be0, 1, 2, 3, 103, 104, 105, 106, 206, 207... so it's going to get out ofsync). The simplest option here seems to be changing (or supplementing)

   public int getPositionIncrementGap(String fieldname)
to
   public int getPositionIncrementGap(String fieldname, int currentPos)

so that we can override that to round up to the nearest 100 (orwhatever) based on currentPos. The default implementation could justdelegate to getPositionIncrementGap().

What do people think? Is this ugly, or worth pursuing? Does anyone haveany other, better ideas? I was curious as to whether Hibernate Searchdeals with this problem, in terms of many-to-one relationships. However,it's actually not clear from the documentation whether it actually DOESor not, so if anyone has insight into that that would be great.


Thanks in advance,

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Searching in same position across multiple fields

Reply via email to