Re: Payloads and PhraseQuery
I guess this also ties in with 'getPositionIncrementGap', which is relevant to fields with multiple occurrences. Peter On 7/27/07, Peter Keegan [EMAIL PROTECTED] wrote: I have a question about the way fields are analyzed and inverted by the index writer. Currently, if a field has multiple occurrences in a document, each occurrence is analyzed separately (see DocumentsWriter.processField). Is it safe to assume that this behavior won't change in the future? The reason I ask is that my custom analyzer's 'tokenStream' method creates a custom filter which produces a payload based on the existence of each field occurrence. However, if DocumentsWriter was changed and combined all the occurrences before inversion, my scheme wouldn't work. Since payloads are created by filters/tokenizers, it helps to keep things flexible. Thanks, Peter On 7/12/07, Grant Ingersoll [EMAIL PROTECTED] wrote: On Jul 12, 2007, at 6:12 PM, Chris Hostetter wrote: Hmm... okay so the issue is that in order to get the payload data, you have to have a TermPositions instance. instead of adding getPayload methods to the Spans class (which as Paul points out, can have nesting issues) perhaps more general solutions would be: a) a more high level getPayload API that let's you get a payload arbitrarily for a toc/position (perhaps as part of the TernDocs API?) ... then for Spans you could use this new API with Spans.start() and Spans.end(). (and all the positions in between) Not sure I follow this. I don't see the fit w/ TermDocs. b) add a variation of the TermPositions class to allow people to iterate through the terms of a TermDoc in position order (TermPosition first iterates over the Terms and then over the positions) ... then you could seek(span.start()) to get the Payload data c) add methods to the Spans API to get the subspans (if any) ... this would be the Spans corrilary to getTerms() and would always return TermSpans which would have TermPositions for getting payload data. This could be a good alternative. When we first talked about payloads we wondered if we could just make all Queries into SpanQueries by passing TermPositions instead of term docs, but in the end decided not to do it because of performance issues (some of which are lessened by lazy loading of TermPositions. The thing is, I think, that the Spans is already moving you along in the term positions, so it just seems like a natural fit to have it there, even if there is nesting. It doesn't seem like it would be that hard to then return back the nesting stuff b/c you are just collating the results from the underlying SpanTermQuery. Having said that, I haven't looked into the actual code, so take that w/ a grain of salt. I will try to do some more investigation, as others are welcome to do. Perhaps we should move this to dev? Cheers, Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payloads and PhraseQuery
I'm looking for Spans.getPositions(), as shown in BoostingTermQuery, but neither NearSpansOrdered nor NearSpansUnordered (which are the Spans provided by SpanNearQuery) provide this method and it's not clear to me how to add it. Peter On 7/11/07, Chris Hostetter [EMAIL PROTECTED] wrote: : I'm now looking at using payloads with SpanNearQuery but I don't see any : clear way of getting the payload(s) from the matching span terms. The term : positions for the payloads seem to be buried beneath SpanCells in the Isn't Spans.start() and Spans.end() what you are looking for? -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payloads and PhraseQuery
That is off of the TermSpans class. BTQ (BoostingTermQuery) is implemented to extend SpanQuery, thus SpanNearQuery isn't, w/o modification, going to have access to these things. However, if you look at the SpanTermQuery, you will see that it's implementation of Spans is indeed the TermSpans class. So, I think you could cast to it or handle it through instanceof. I am not completely sure here, but it seems like we may need an efficient way to access the TermPositions for each document. That is, the Spans class doesn't provide this and maybe it should somehow. Again, I am just thinking out loud here. Thus, if we modified Spans to have the following methods: byte[] getPayload(byte[] data, int offset) boolean isPayloadAvailable() I think this would be useful. Perhaps this should be discussed on dev. Cheers, Grant On Jul 12, 2007, at 8:20 AM, Peter Keegan wrote: I'm looking for Spans.getPositions(), as shown in BoostingTermQuery, but neither NearSpansOrdered nor NearSpansUnordered (which are the Spans provided by SpanNearQuery) provide this method and it's not clear to me how to add it. Peter On 7/11/07, Chris Hostetter [EMAIL PROTECTED] wrote: : I'm now looking at using payloads with SpanNearQuery but I don't see any : clear way of getting the payload(s) from the matching span terms. The term : positions for the payloads seem to be buried beneath SpanCells in the Isn't Spans.start() and Spans.end() what you are looking for? -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payloads and PhraseQuery
On Thursday 12 July 2007 14:50, Grant Ingersoll wrote: That is off of the TermSpans class. BTQ (BoostingTermQuery) is implemented to extend SpanQuery, thus SpanNearQuery isn't, w/o modification, going to have access to these things. However, if you look at the SpanTermQuery, you will see that it's implementation of Spans is indeed the TermSpans class. So, I think you could cast to it or handle it through instanceof. I am not completely sure here, but it seems like we may need an efficient way to access the TermPositions for each document. That is, the Spans class doesn't provide this and maybe it should somehow. Again, I am just thinking out loud here. SpanQueries can be nested, so the relationship between a span and a term position can also be one to many, not only one to one. For example a matching span in the Spans of a SpanNearQuery can be based on two matching (near enough to match) term positions. Thus, if we modified Spans to have the following methods: byte[] getPayload(byte[] data, int offset) boolean isPayloadAvailable() I think this would be useful. Perhaps this should be discussed on dev. And the same holds for the payloads, there many be more than one for a single Span. Regards, Paul Elschot Cheers, Grant On Jul 12, 2007, at 8:20 AM, Peter Keegan wrote: I'm looking for Spans.getPositions(), as shown in BoostingTermQuery, but neither NearSpansOrdered nor NearSpansUnordered (which are the Spans provided by SpanNearQuery) provide this method and it's not clear to me how to add it. Peter On 7/11/07, Chris Hostetter [EMAIL PROTECTED] wrote: : I'm now looking at using payloads with SpanNearQuery but I don't see any : clear way of getting the payload(s) from the matching span terms. The term : positions for the payloads seem to be buried beneath SpanCells in the Isn't Spans.start() and Spans.end() what you are looking for? -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payloads and PhraseQuery
Yep, totally agree.One way to handle this initially at least is have isPayloadAvailable() only return true for the SpanTermQuery. The other option is to come up with some modification of the suggested methods below to return all the payloads in a span. I have a basic implementation for just the SpanTermQuery (i.e. via TermSpans) in the works. I will take a crack at fleshing out the rest at some point soon. -Grant On Jul 12, 2007, at 1:22 PM, Paul Elschot wrote: On Thursday 12 July 2007 14:50, Grant Ingersoll wrote: That is off of the TermSpans class. BTQ (BoostingTermQuery) is implemented to extend SpanQuery, thus SpanNearQuery isn't, w/o modification, going to have access to these things. However, if you look at the SpanTermQuery, you will see that it's implementation of Spans is indeed the TermSpans class. So, I think you could cast to it or handle it through instanceof. I am not completely sure here, but it seems like we may need an efficient way to access the TermPositions for each document. That is, the Spans class doesn't provide this and maybe it should somehow. Again, I am just thinking out loud here. SpanQueries can be nested, so the relationship between a span and a term position can also be one to many, not only one to one. For example a matching span in the Spans of a SpanNearQuery can be based on two matching (near enough to match) term positions. Thus, if we modified Spans to have the following methods: byte[] getPayload(byte[] data, int offset) boolean isPayloadAvailable() I think this would be useful. Perhaps this should be discussed on dev. And the same holds for the payloads, there many be more than one for a single Span. Regards, Paul Elschot Cheers, Grant On Jul 12, 2007, at 8:20 AM, Peter Keegan wrote: I'm looking for Spans.getPositions(), as shown in BoostingTermQuery, but neither NearSpansOrdered nor NearSpansUnordered (which are the Spans provided by SpanNearQuery) provide this method and it's not clear to me how to add it. Peter On 7/11/07, Chris Hostetter [EMAIL PROTECTED] wrote: : I'm now looking at using payloads with SpanNearQuery but I don't see any : clear way of getting the payload(s) from the matching span terms. The term : positions for the payloads seem to be buried beneath SpanCells in the Isn't Spans.start() and Spans.end() what you are looking for? -Hoss --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/ LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://www.grantingersoll.com/ http://lucene.grantingersoll.com http://www.paperoftheweek.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payloads and PhraseQuery
Grant, If/when you have an implementation for SpanNearQuery, I'd be happy to test it. Peter On 7/12/07, Grant Ingersoll [EMAIL PROTECTED] wrote: Yep, totally agree.One way to handle this initially at least is have isPayloadAvailable() only return true for the SpanTermQuery. The other option is to come up with some modification of the suggested methods below to return all the payloads in a span. I have a basic implementation for just the SpanTermQuery (i.e. via TermSpans) in the works. I will take a crack at fleshing out the rest at some point soon. -Grant On Jul 12, 2007, at 1:22 PM, Paul Elschot wrote: On Thursday 12 July 2007 14:50, Grant Ingersoll wrote: That is off of the TermSpans class. BTQ (BoostingTermQuery) is implemented to extend SpanQuery, thus SpanNearQuery isn't, w/o modification, going to have access to these things. However, if you look at the SpanTermQuery, you will see that it's implementation of Spans is indeed the TermSpans class. So, I think you could cast to it or handle it through instanceof. I am not completely sure here, but it seems like we may need an efficient way to access the TermPositions for each document. That is, the Spans class doesn't provide this and maybe it should somehow. Again, I am just thinking out loud here. SpanQueries can be nested, so the relationship between a span and a term position can also be one to many, not only one to one. For example a matching span in the Spans of a SpanNearQuery can be based on two matching (near enough to match) term positions. Thus, if we modified Spans to have the following methods: byte[] getPayload(byte[] data, int offset) boolean isPayloadAvailable() I think this would be useful. Perhaps this should be discussed on dev. And the same holds for the payloads, there many be more than one for a single Span. Regards, Paul Elschot Cheers, Grant On Jul 12, 2007, at 8:20 AM, Peter Keegan wrote: I'm looking for Spans.getPositions(), as shown in BoostingTermQuery, but neither NearSpansOrdered nor NearSpansUnordered (which are the Spans provided by SpanNearQuery) provide this method and it's not clear to me how to add it. Peter On 7/11/07, Chris Hostetter [EMAIL PROTECTED] wrote: : I'm now looking at using payloads with SpanNearQuery but I don't see any : clear way of getting the payload(s) from the matching span terms. The term : positions for the payloads seem to be buried beneath SpanCells in the Isn't Spans.start() and Spans.end() what you are looking for? -Hoss --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/ LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://www.grantingersoll.com/ http://lucene.grantingersoll.com http://www.paperoftheweek.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payloads and PhraseQuery
: That is off of the TermSpans class. BTQ (BoostingTermQuery) is ... : I am not completely sure here, but it seems like we may need an : efficient way to access the TermPositions for each document. That : is, the Spans class doesn't provide this and maybe it should ... : I'm looking for Spans.getPositions(), as shown in ... : : I'm now looking at using payloads with SpanNearQuery but I don't : see any : : clear way of getting the payload(s) from the matching span : terms. The Hmm... okay so the issue is that in order to get the payload data, you have to have a TermPositions instance. instead of adding getPayload methods to the Spans class (which as Paul points out, can have nesting issues) perhaps more general solutions would be: a) a more high level getPayload API that let's you get a payload arbitrarily for a toc/position (perhaps as part of the TernDocs API?) ... then for Spans you could use this new API with Spans.start() and Spans.end(). (and all the positions in between) b) add a variation of the TermPositions class to allow people to iterate through the terms of a TermDoc in position order (TermPosition first iterates over the Terms and then over the positions) ... then you could seek(span.start()) to get the Payload data c) add methods to the Spans API to get the subspans (if any) ... this would be the Spans corrilary to getTerms() and would always return TermSpans which would have TermPositions for getting payload data. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payloads and PhraseQuery
On Jul 12, 2007, at 6:12 PM, Chris Hostetter wrote: Hmm... okay so the issue is that in order to get the payload data, you have to have a TermPositions instance. instead of adding getPayload methods to the Spans class (which as Paul points out, can have nesting issues) perhaps more general solutions would be: a) a more high level getPayload API that let's you get a payload arbitrarily for a toc/position (perhaps as part of the TernDocs API?) ... then for Spans you could use this new API with Spans.start() and Spans.end(). (and all the positions in between) Not sure I follow this. I don't see the fit w/ TermDocs. b) add a variation of the TermPositions class to allow people to iterate through the terms of a TermDoc in position order (TermPosition first iterates over the Terms and then over the positions) ... then you could seek(span.start()) to get the Payload data c) add methods to the Spans API to get the subspans (if any) ... this would be the Spans corrilary to getTerms() and would always return TermSpans which would have TermPositions for getting payload data. This could be a good alternative. When we first talked about payloads we wondered if we could just make all Queries into SpanQueries by passing TermPositions instead of term docs, but in the end decided not to do it because of performance issues (some of which are lessened by lazy loading of TermPositions. The thing is, I think, that the Spans is already moving you along in the term positions, so it just seems like a natural fit to have it there, even if there is nesting. It doesn't seem like it would be that hard to then return back the nesting stuff b/c you are just collating the results from the underlying SpanTermQuery. Having said that, I haven't looked into the actual code, so take that w/ a grain of salt. I will try to do some more investigation, as others are welcome to do. Perhaps we should move this to dev? Cheers, Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payloads and PhraseQuery
I'm now looking at using payloads with SpanNearQuery but I don't see any clear way of getting the payload(s) from the matching span terms. The term positions for the payloads seem to be buried beneath SpanCells in the NearSpansOrdered and NearSpansUnordered classes, which are not public. I'd be content to be able to get the payload from just the first term of the span. Can anyone suggest an approach for making payloads work with SpanNearQuery? Peter On 6/27/07, Grant Ingersoll [EMAIL PROTECTED] wrote: Could you get what you need combining the BoostingTermQuery with a SpanNearQuery to produce a score? Just guessing here.. At some point, I would like to see more Query classes around the payload stuff, so please submit patches/feedback if and when you get a solution On Jun 27, 2007, at 10:45 AM, Peter Keegan wrote: I'm looking at the new Payload api and would like to use it in the following manner. Meta-data is indexed as a special phrase (all terms at same position) and a payload is stored with the first term of each phrase. I would like to create a custom query class that extends PhraseQuery and uses its PhraseScorer to find matching documents. The custom query class then reads the payload from the first term of the matching query and uses it to produce a new score. However, I don't see how to get the payload from the PhraseScorer's TermPositions. Is this possible? Peter -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payloads and PhraseQuery
: I'm now looking at using payloads with SpanNearQuery but I don't see any : clear way of getting the payload(s) from the matching span terms. The term : positions for the payloads seem to be buried beneath SpanCells in the Isn't Spans.start() and Spans.end() what you are looking for? -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payloads and PhraseQuery
I tried to subclass PhraseScorer, but discovered that it's an abstract class and its subclasses (ExactPhraseScorer and SloppyPhraseScorer) are final classes. So instead, I extended Scorer with my custom scorer and extended PhraseWeight (after making it public). My scorer's constructor is passed the instance of PhraseScorer created by PhraseQuery.scorer(). My scorer's 'next' and 'skipTo' methods call the PhraseScorer's methods first and if the result is 'true', the payload is loaded and used to determine whether or not the PhraseScorer's doc is a hit. If not, PhraseScorer.next() or skipTo() is called again. In order to get the payload, I modified PhraseQuery to save the TermPositions array it creates for its scorers and added a 'get' method. The diff is included, below. This is probably not the best solution, but at least a starting point for further discussion. Here's the diff: Index: PhraseQuery.java === --- PhraseQuery.java(revision 551992) +++ PhraseQuery.java(working copy) @@ -36,7 +36,8 @@ private Vector terms = new Vector(); private Vector positions = new Vector(); private int slop = 0; - + private TermPositions[] tps; + /** Constructs an empty phrase query. */ public PhraseQuery() {} @@ -104,7 +105,7 @@ return result; } - private class PhraseWeight implements Weight { + public class PhraseWeight implements Weight { private Similarity similarity; private float value; private float idf; @@ -138,7 +139,7 @@ if (terms.size() == 0) // optimize zero-term case return null; - TermPositions[] tps = new TermPositions[terms.size()]; + tps = new TermPositions[terms.size()]; for (int i = 0; i terms.size(); i++) { TermPositions p = reader.termPositions((Term)terms.elementAt(i)); if (p == null) @@ -155,7 +156,9 @@ reader.norms(field)); } - +public TermPositions[] getTermPositions() { +return tps; +} public Explanation explain(IndexReader reader, int doc) throws IOException { On 6/27/07, Mark Miller [EMAIL PROTECTED] wrote: You cannot do it because TermPositions is read in the PhraseWeight.scorer(IndexReader) method (or MultiPhraseWeight) and loaded into an array which is passed to PhraseScorer. Extend the Weight as well and pass the payload to the Scorer as well is a possibility. - Mark Peter Keegan wrote: I'm looking at the new Payload api and would like to use it in the following manner. Meta-data is indexed as a special phrase (all terms at same position) and a payload is stored with the first term of each phrase. I would like to create a custom query class that extends PhraseQuery and uses its PhraseScorer to find matching documents. The custom query class then reads the payload from the first term of the matching query and uses it to produce a new score. However, I don't see how to get the payload from the PhraseScorer's TermPositions. Is this possible? Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payloads and PhraseQuery
You cannot do it because TermPositions is read in the PhraseWeight.scorer(IndexReader) method (or MultiPhraseWeight) and loaded into an array which is passed to PhraseScorer. Extend the Weight as well and pass the payload to the Scorer as well is a possibility. - Mark Peter Keegan wrote: I'm looking at the new Payload api and would like to use it in the following manner. Meta-data is indexed as a special phrase (all terms at same position) and a payload is stored with the first term of each phrase. I would like to create a custom query class that extends PhraseQuery and uses its PhraseScorer to find matching documents. The custom query class then reads the payload from the first term of the matching query and uses it to produce a new score. However, I don't see how to get the payload from the PhraseScorer's TermPositions. Is this possible? Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payloads and PhraseQuery
Could you get what you need combining the BoostingTermQuery with a SpanNearQuery to produce a score? Just guessing here.. At some point, I would like to see more Query classes around the payload stuff, so please submit patches/feedback if and when you get a solution On Jun 27, 2007, at 10:45 AM, Peter Keegan wrote: I'm looking at the new Payload api and would like to use it in the following manner. Meta-data is indexed as a special phrase (all terms at same position) and a payload is stored with the first term of each phrase. I would like to create a custom query class that extends PhraseQuery and uses its PhraseScorer to find matching documents. The custom query class then reads the payload from the first term of the matching query and uses it to produce a new score. However, I don't see how to get the payload from the PhraseScorer's TermPositions. Is this possible? Peter -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]