[ https://issues.apache.org/jira/browse/LUCENE-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12658318#action_12658318 ]
Hoss Man commented on LUCENE-1494: ---------------------------------- i've only looked at LUCENE-1494-multifield.patch ... one problem i see is that SpanNearQuery stores and utilizes the field name in ways that don't make sense for the new MultiFieldSpanNearQuery subclass (ie: getField, . I would suggest that instead you invert the inheritence: move the guts of SpanNearQuery into MultiFieldSpanNearQuery and make it a superclass of SpanNearQuery. This also eliminates the need for the mustBeSameField param... {code} public class SpanNearQuery extends MultiFieldSpanNearQuery { private final String field; public String getField() { return field; } public SpanNearQuery(SpanQuery[] clauses, int slop, boolean inOrder) { super(clauses, slop, inOrder, true); for (int i = 0; i < clauses.length; i++) { SpanQuery clause = clauses[i]; if (i == 0) { field = clause[i].getField(); } } else if (!clause[i[.getField().equals(field)) { throw new IllegalArgumentException("Clauses must have same field."); } } // :TODO: need to override equals from super ... maybe hashCode too } {code} > Additional features for searching for value across multiple fields > (many-to-one style) > -------------------------------------------------------------------------------------- > > Key: LUCENE-1494 > URL: https://issues.apache.org/jira/browse/LUCENE-1494 > Project: Lucene - Java > Issue Type: New Feature > Components: Search > Affects Versions: 2.4 > Reporter: Paul Cowan > Priority: Minor > Attachments: LUCENE-1494-multifield.patch, > LUCENE-1494-positionincrement.patch > > > This issue is to cover the changes required to do a search across multiple > fields with the same name in a fashion similar to a many-to-one database. > Below is my post on java-dev on the topic, which details the changes we need: > --- > We have an interesting situation where we are effectively indexing two > 'entities' in our system, which share a one-to-many relationship (imagine > 'User' and 'Delivery Address' for demonstration purposes). At the moment, we > index one Lucene Document per 'many' end, duplicating the 'one' end data, > like so: > userid: 1 > userfirstname: fred > addresscountry: au > addressphone: 1234 > userid: 1 > userfirstname: fred > addresscountry: nz > addressphone: 5678 > userid: 2 > userfirstname: mary > addresscountry: au > addressphone: 5678 > (note: 2 Documents indexed for user 1). This is somewhat annoying for us, > because when we search in Lucene the results we want back (conceptually) are > at the 'user' level, so we have to collapse the results by distinct user id, > etc. etc (let alone that it blows out the size of our index enormously). So > why do we do it? It would make more sense to use multiple fields: > userid: 1 > userfirstname: fred > addresscountry: au > addressphone: 1234 > addresscountry: nz > addressphone: 5678 > userid: 2 > userfirstname: mary > addresscountry: au > addressphone: 5678 > But imagine the search "+addresscountry:au +addressphone:5678". We'd like > this to match ONLY Mary, but of course it matches Fred also because he > matches both those terms (just for different addresses). > There are two aspects to the approach we've (more or less) got working but > I'd like to run them past the group and see if they're worth trying to get > them into Lucene proper (if so, I'll create a JIRA issue for them) > 1) Use a modified SpanNearQuery. If we assume that country + phone will > always be one token, we can rely on the fact that the positions of 'au' and > '5678' in Fred's document will be different. > SpanQuery q1 = new SpanTermQuery(new Term("addresscountry", "au")); > SpanQuery q2 = new SpanTermQuery(new Term("addressphone", "5678")); > SpanQuery snq = new SpanNearQuery(new SpanQuery[]{q1, q2}, 0, false); > the slop of 0 means that we'll only return those where the two terms are in > the same position in their respective fields. This works brilliantly, BUT > requires a change to SpanNearQuery's constructor (which checks that all the > clauses are against the same field). Are people amenable to perhaps adding > another constructor to SNQ which doesn't do the check, or subclassing it to > do the same (give it a protected non-checking constructor for the subclass to > call)? > 2) It gets slightly more complicated in the case of variable-length terms. > For example, imagine if we had an 'address' field ('123 Smith St') which will > result in (1 to n) tokens; slop 0 in a SpanNearQuery won't work here, of > course. One thing we've toyed with is the idea of using > getPositionIncrementGap -- if we knew that 'address' would be, at most, 20 > tokens, we might use a position increment gap of 100, and make the slop > factor 50; this works fine for the simple case (yay!), but with a great many > addresses-per-user starts to get more complicated, as the gap counts from the > last term (so the position sequence for a single value field might be 0, 100, > 200, but for the address field it might be 0, 1, 2, 3, 103, 104, 105, 106, > 206, 207... so it's going to get out of sync). The simplest option here seems > to be changing (or supplementing) > public int getPositionIncrementGap(String fieldname) > to > public int getPositionIncrementGap(String fieldname, int currentPos) > so that we can override that to round up to the nearest 100 (or whatever) > based on currentPos. The default implementation could just delegate to > getPositionIncrementGap(). > --- > Patches (x2) to follow shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org