Hi Tim.
Thanks for your help. I had a friend provide me some code (some snippets
below) that could dump the supposed matching spans (this provided some more
insight). Perhaps, some of my findings could help someone potentially fix the
bug.
So, I added my 2 documents
public static String [] DOCS = {
"bauthors bauthor blname mcbeath elname slname bfname darin william
efname sfname eauthor sauthor bauthor blname fulford elname slname bfname
darby efname sfname eauthor sauthor bauthor blname mcbeath elname slname
bfname darby efname sfname eauthor sauthor eauthors sauthors",
"bauthors bauthor blname mcbeath elname slname bfname darin efname sfname
eauthor sauthor bauthor blname fulford elname slname bfname darin efname
sfname eauthor sauthor eauthors sauthors",
};
I then coded the following SpanQuery.
// Simple query for fname:darin and lname:fulford
ArrayList<SpanQuery> innerSpans = new ArrayList<SpanQuery>();
// Construct the last name span
ArrayList<SpanQuery> spansln = new ArrayList<SpanQuery>();
spansln.add(new SpanTermQuery(new Term("content", "blname")));
spansln.add(new SpanTermQuery(new Term("content", "fulford")));
spansln.add(new SpanTermQuery(new Term("content", "elname")));
SpanNearQuery lnInnerIncludeQuery = new SpanNearQuery(spansln.toArray(new
SpanQuery[spansln.size()]), Integer.MAX_VALUE, true);
// Add the sep marker to the not clause
SpanQuery lnInnerExcludeQuery = new SpanTermQuery(new Term("content",
"slname"));
innerSpans.add(new SpanNotQuery(lnInnerIncludeQuery,lnInnerExcludeQuery));
// Construct the first name span
ArrayList<SpanQuery> spansfn = new ArrayList<SpanQuery>();
spansfn.add(new SpanTermQuery(new Term("content", "bfname")));
spansfn.add(new SpanTermQuery(new Term("content", "darin")));
spansfn.add(new SpanTermQuery(new Term("content", "efname")));
SpanNearQuery fnInnerIncludeQuery = new SpanNearQuery(spansfn.toArray(new
SpanQuery[spansfn.size()]), Integer.MAX_VALUE, true);
// Add the sep marker to the not clause
SpanQuery fnInnerExcludeQuery = new SpanTermQuery(new Term("content",
"sfname"));
innerSpans.add(new SpanNotQuery(fnInnerIncludeQuery,fnInnerExcludeQuery));
// Make the first/last name spans unordered
SpanNearQuery innerSpanQuery = new SpanNearQuery(innerSpans.toArray(new
SpanQuery[innerSpans.size()]), Integer.MAX_VALUE, false);
ArrayList<SpanQuery> outerSpanQuery = new ArrayList<SpanQuery>();
outerSpanQuery.add(new SpanTermQuery(new Term("content", "bauthor")));
outerSpanQuery.add(innerSpanQuery);
outerSpanQuery.add(new SpanTermQuery(new Term("content", "eauthor")));
SpanNearQuery includeQuery = new SpanNearQuery(outerSpanQuery.toArray(new
SpanQuery[outerSpanQuery.size()]), Integer.MAX_VALUE, true);
// Add the sep marker to the not clause
SpanQuery excludeQuery = new SpanTermQuery(new Term("content", "sauthor"));
SpanNotQuery finalQuery = new SpanNotQuery(includeQuery,excludeQuery);
doSpanQuery(finalQuery, searcher, "fname:darin AND lname:fulford");
And noticed this incorrectly matches DOC 4 (results are below).
BEGIN QUERY (fname:darin AND lname:fulford): spanNot(spanNear([content:bauthor,
spanNear([spanNot(spanNear([content:blname, content:fulford, content:elname],
2147483647, true), content:slname, 0, 0), spanNot(spanNear([content:bfname,
content:darin, content:efname], 2147483647, true), content:sfname, 0, 0)],
2147483647, false), content:eauthor], 2147483647, true), content:sauthor, 0, 0)
Score Doc: doc=5 score=1.0829407 shardIndex=-1
'bauthors bauthor blname mcbeath elname slname bfname darin efname sfname
eauthor sauthor bauthor blname fulford elname slname bfname darin efname
sfname eauthor sauthor eauthors sauthors'
Score Doc: doc=4 score=0.610962 shardIndex=-1
'bauthors bauthor blname mcbeath elname slname bfname darin william efname
sfname eauthor sauthor bauthor blname fulford elname slname bfname darby
efname sfname eauthor sauthor bauthor blname mcbeath elname slname bfname
darby efname sfname eauthor sauthor eauthors sauthors'
Doc: 4 Start: 1 End: 12
Doc: 5 Start: 1 End: 11
Doc: 5 Start: 12 End: 22
END QUERY (fname:darin AND lname:fulford): spanNot(spanNear([content:bauthor,
spanNear([spanNot(spanNear([content:blname, content:fulford, content:elname],
2147483647, true), content:slname, 0, 0), spanNot(spanNear([content:bfname,
content:darin, content:efname], 2147483647, true), content:sfname, 0, 0)],
2147483647, false), content:eauthor], 2147483647, true), content:sauthor, 0, 0)
I then made one small change (made this SpanNearQuery 'ordered')
// Make the first/last name spans ordered
SpanNearQuery innerSpanQuery = new SpanNearQuery(innerSpans.toArray(new
SpanQuery[innerSpans.size()]), Integer.MAX_VALUE, true);
And I get the correct results.
BEGIN QUERY (fname:darin AND lname:fulford): spanNot(spanNear([content:bauthor,
spanNear([spanNot(spanNear([content:blname, content:fulford, content:elname],
2147483647, true), content:slname, 0, 0), spanNot(spanNear([content:bfname,
content:darin, content:efname], 2147483647, true), content:sfname, 0, 0)],
2147483647, true), content:eauthor], 2147483647, true), content:sauthor, 0, 0)
Score Doc: doc=5 score=0.76575476 shardIndex=-1
'bauthors bauthor blname mcbeath elname slname bfname darin efname sfname
eauthor sauthor bauthor blname fulford elname slname bfname darin efname
sfname eauthor sauthor eauthors sauthors'
Doc: 5 Start: 12 End: 22
END QUERY (fname:darin AND lname:fulford): spanNot(spanNear([content:bauthor,
spanNear([spanNot(spanNear([content:blname, content:fulford, content:elname],
2147483647, true), content:slname, 0, 0), spanNot(spanNear([content:bfname,
content:darin, content:efname], 2147483647, true), content:sfname, 0, 0)],
2147483647, true), content:eauthor], 2147483647, true), content:sauthor, 0, 0)
Not sure why 'ordered' vs 'unordered' makes it work correctly, but certainly
sounds like a bug with Lucene.
If you have any thoughts for a workaround, I would be interested.
Thanks again.
Darin.
----- Original Message -----
From: "Allison, Timothy B." <[email protected]>
To: Darin McBeath <[email protected]>; "[email protected]"
<[email protected]>
Cc:
Sent: Monday, June 9, 2014 2:10 PM
Subject: RE: SpanQuery not working as expected
Darin,
I confirmed the behavior you reported. This is probably the same bug that
was reported in LUCENE-5331. The trigger there seems to be multiple examples of
the same token (which you have plenty of). I tested with just this:
[[darin fulford]~100 sauthor]!~0,0
darin fulford (non-directional) but no intervening sauthor
And that works correctly.
I also tested:
[[darin fulford]~100 (bauthor sauthor)]!~0,0
Same as above but with a SpanOr for bauthor|sauthor. And that works correctly,
too.
So, yes, I think what you've found is a bug, unfortunately a known one that
hasn't been fixed. There's also a chance that something else is going
on...when I took your query and removed b[lf]name and e[fl]name, the query
still brought back both docs. So, if you want to go this route, I'd recommend
flattening the markup as much as possible, but it still just might not be
possible.
I'm not sure that I understand all of your use cases, but, in general, the more
you can do with adding non-hierarchical meta-fields and the less you have to
hack markup, the better. That said, it sounds like your problem is what the
child/parent block join queries were built for, and given your response, it
sounds like you've already gone that route and you've found performance not to
be sufficient.
I'm sorry that I can't be of more help.
Best,
Tim
-----Original Message-----
From: Darin McBeath [mailto:[email protected]]
Sent: Friday, June 06, 2014 1:03 PM
To: Allison, Timothy B.; [email protected]
Subject: Re: SpanQuery not working as expected
Thanks Tim.
I have thought about this for the author field (and like you suggest) it would
probably work. I was actually going to experiment with this later today.
But, I have another field that has a bit more nesting (and it contains authors)
For example, within a given document, I have the following:
References [ one or more]
Authors [one or more]
First Name
Last Name
So, I would need to search for a specific author (matching first name and last
name) within a specific reference for a document. With this double level of
nesting, I don't think the multivalued field approach would work (please
correct me if I'm wrong). That's why I decided to use span queries. My index
has more than 100 fields, but I only have 2 or 3 fields that require this
structure search capability. There are also many documents (100M) so I didn't
really want to get into a parent-child type approach.
Plus, there are also many other fields (both within an author) and within an
individual reference that need to be scoped. For example, there is a 'source
tittle' at the reference level and an 'article title' at the reference level. I
would need to search within a given reference where the 'source title' contains
some words, where the 'article title' contains some words, and within this
reference where a specific author contains 'john' for the first name and
'smith' in the last name.
I guess I'm curious if what I was doing with the SpanQuery should have worked,
whether I misunderstood something, or if this is a bug.
Darin.
________________________________
From: "Allison, Timothy B." <[email protected]>
To: "[email protected]" <[email protected]>; Darin McBeath
<[email protected]>
Sent: Friday, June 6, 2014 10:12 AM
Subject: RE: SpanQuery not working as expected
Hi Darin,
Have you thought about using multivalued fields? If you set the
positionIncrementGap to something kind of big (well > 1, say :) ), and you know
that your data is always authorfirst, authorlast, you could just search for
"darin fulford".
The positionincrementgap will prevent matching on doc2 below.
Doc1
Authorsfield:
Darin fulford
Doc2
Authorsfield:
Matilda darin
Fulford alexandria
Don't get me wrong, I love the capabilities of SpanQuery, but will this simple
solution meet your needs?
-----Original Message-----
From: Darin McBeath [mailto:[email protected]]
Sent: Thursday, June 05, 2014 7:17 PM
To: [email protected]
Subject: SpanQuery not working as expected
I read through the http://searchhub.org/2009/07/18/the-spanquery/ which
provided a good overview for how one can construct fairly complex span queries.
I was particularly interested in the ability to construct nested span queries.
I'm trying to apply this concept to search a field that contains some
structure (as below). I have a couple of other fields that will have a bit
more nesting, but this should give the general idea.
authors
author [one or more]
first name
last name
Prior to indexing the content with Lucene, I added some 'markers' around the
various bits I might want to search. For example 'bauthor' implies beginning
author, 'eauthor' implies ending author, and 'sauthor' implies a separator
between individual authors (that would be used as part of the exclude clause in
a not span query). I do similar things for 'first name' and 'last name'.
My constructed query (as interpreted by Lucene) is included below. This was
extracted from the 'parsed string' returned from the query when I set
debug=true. Within a given 'authscope' field, I'm trying to find a situation
where the author first name is 'darin' and the last name is 'fulford' within a
given 'author'.
spanNot(
spanNear(
[authscope:bauthor,
spanNear(
[spanNot(
spanNear(
[authscope:bfname,
authscope:darin,
authscope:efname],
2147483647, true),
authscope:sfname, 0, 0),
spanNot(
spanNear(
[authscope:blname,
authscope:fulford,
authscope:elname],
2147483647, true),
authscope:slname, 0, 0)],
2147483647, false),
authscope:eauthor],
2147483647, true),
authscope:sauthor, 0, 0)",
I have loaded the following 2 documents into my index.
[
{"id":"1", "authscope":" bauthors bauthor blname mcbeath elname slname
bfname darin efname sfname eauthor sauthor bauthor blname fulford elname
slname bfname darby efname sfname eauthor sauthor bauthor blname mcbeath
elname slname bfname darby efname sfname eauthor sauthor eauthors sauthors
"},
{"id":"2", "authscope":" bauthors bauthor blname mcbeath elname slname
bfname darin efname sfname eauthor sauthor bauthor blname fulford elname
slname bfname darin efname sfname eauthor sauthor eauthors sauthors "}
]
What I can't figure out is why the above query would match on both documents.
It should only match the document with id:2.
Any insights would be appreciated. I'm using Lucene 4.7.2.
Darin.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]