[jira] [Commented] (LUCENE-5181) Passage knows its own docID
[ https://issues.apache.org/jira/browse/LUCENE-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13750086#comment-13750086 ] Jon Stewart commented on LUCENE-5181: - {quote} How does highlighting fit into this? {quote} 1. It's convenient to have the Passage snippet, for showing context around the term matches. 2. It's very unclear how to actually retrieve term matches per doc given a TopDocs result set. This was the subject of recent [mailing list traffic|http://search-lucene.com/m/XhEx62ewjeQ/hits+coordinatessubj=How+to+get+hits+coordinates+in+Lucene+4+4+0], and the PostingsHighlighter was suggested as the efficient way to do this, which led to this JIRA being opened. Passage knows its own docID --- Key: LUCENE-5181 URL: https://issues.apache.org/jira/browse/LUCENE-5181 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.4 Reporter: Jon Stewart Priority: Minor The new PostingsHighlight package allows for retrieval of term matches from a query if one creates a class that extends PassageFormatter and overrides format(). However, class Passage does not have a docID field, nor is this provided via PassageFormatter.format(). Therefore, it's very difficult to know which Document contains a given Passage. It would suffice for PassageFormatter.format() to be passed the docID as a parameter. From the code in PostingsHighlight, this seems like it would be easy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5181) Passage knows its own docID
[ https://issues.apache.org/jira/browse/LUCENE-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13749363#comment-13749363 ] Jon Stewart commented on LUCENE-5181: - Sure. I'm working in a high recall/low precision domain, where a large portion of the source documents are irrelevant junk. For their review, users are often presented with a match-oriented table view rather than a document-oriented table view, i.e., each row in the table represents a term match, generally with some context, and is joined with some document metadata. I can use the PassageFormatter to get access to the Passages in a result set, but it is hard to generate this table view without knowing which Document goes with the Passage. Additionally, a research problem I'm working on is using a combination of match properties and Document properties to score the individual matches (including metadata, like file type, created dates, etc.). The properties get normalized and fed into liblinear and out comes a score for us to sort on. This, too, is difficult without having the Document. Happy to contribute a patch if there's consensus. Passing in the docID via PassageFormatter.format is what I did, but that breaks backwards compatibility. It'd be easy enough to set on Passage as a field. Passage knows its own docID --- Key: LUCENE-5181 URL: https://issues.apache.org/jira/browse/LUCENE-5181 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.4 Reporter: Jon Stewart Priority: Minor The new PostingsHighlight package allows for retrieval of term matches from a query if one creates a class that extends PassageFormatter and overrides format(). However, class Passage does not have a docID field, nor is this provided via PassageFormatter.format(). Therefore, it's very difficult to know which Document contains a given Passage. It would suffice for PassageFormatter.format() to be passed the docID as a parameter. From the code in PostingsHighlight, this seems like it would be easy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5181) Passage knows its own docID
[ https://issues.apache.org/jira/browse/LUCENE-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13749414#comment-13749414 ] Robert Muir commented on LUCENE-5181: - {quote} For their review, users are often presented with a match-oriented table view rather than a document-oriented table view, i.e., each row in the table represents a term match, generally with some context, and is joined with some document metadata. {quote} How does highlighting fit into this? My general concern is that passing docid/encouraging the use of o.a.l.document.Document within passage-processing will mean that people are retrieving from the stored fields for every single match: and this would be very slow. Are you using highlighting to rank the most relevant sentences or do you really want to enumerate term matches? In the latter case Query.extractTerms() + TermsEnum.docsAndPositionsEnum(FLAG_OFFSETS) would be much more efficient. Passage knows its own docID --- Key: LUCENE-5181 URL: https://issues.apache.org/jira/browse/LUCENE-5181 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.4 Reporter: Jon Stewart Priority: Minor The new PostingsHighlight package allows for retrieval of term matches from a query if one creates a class that extends PassageFormatter and overrides format(). However, class Passage does not have a docID field, nor is this provided via PassageFormatter.format(). Therefore, it's very difficult to know which Document contains a given Passage. It would suffice for PassageFormatter.format() to be passed the docID as a parameter. From the code in PostingsHighlight, this seems like it would be easy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5181) Passage knows its own docID
[ https://issues.apache.org/jira/browse/LUCENE-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13747493#comment-13747493 ] Luca Cavanna commented on LUCENE-5181: -- True, having the doc id would be useful there. Why not adding it directly to the Passage, to be able know which document the Passage comes from? Passage knows its own docID --- Key: LUCENE-5181 URL: https://issues.apache.org/jira/browse/LUCENE-5181 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.4 Reporter: Jon Stewart Priority: Minor The new PostingsHighlight package allows for retrieval of term matches from a query if one creates a class that extends PassageFormatter and overrides format(). However, class Passage does not have a docID field, nor is this provided via PassageFormatter.format(). Therefore, it's very difficult to know which Document contains a given Passage. It would suffice for PassageFormatter.format() to be passed the docID as a parameter. From the code in PostingsHighlight, this seems like it would be easy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5181) Passage knows its own docID
[ https://issues.apache.org/jira/browse/LUCENE-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13747499#comment-13747499 ] Robert Muir commented on LUCENE-5181: - Can you give a concrete example where docid is actually useful? Passage knows its own docID --- Key: LUCENE-5181 URL: https://issues.apache.org/jira/browse/LUCENE-5181 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.4 Reporter: Jon Stewart Priority: Minor The new PostingsHighlight package allows for retrieval of term matches from a query if one creates a class that extends PassageFormatter and overrides format(). However, class Passage does not have a docID field, nor is this provided via PassageFormatter.format(). Therefore, it's very difficult to know which Document contains a given Passage. It would suffice for PassageFormatter.format() to be passed the docID as a parameter. From the code in PostingsHighlight, this seems like it would be easy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5181) Passage knows its own docID
[ https://issues.apache.org/jira/browse/LUCENE-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13744139#comment-13744139 ] Michael McCandless commented on LUCENE-5181: +1, I think this (pass docID as a parameter to PassageFormatter.format) is reasonable? Passage knows its own docID --- Key: LUCENE-5181 URL: https://issues.apache.org/jira/browse/LUCENE-5181 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.4 Reporter: Jon Stewart Priority: Minor The new PostingsHighlight package allows for retrieval of term matches from a query if one creates a class that extends PassageFormatter and overrides format(). However, class Passage does not have a docID field, nor is this provided via PassageFormatter.format(). Therefore, it's very difficult to know which Document contains a given Passage. It would suffice for PassageFormatter.format() to be passed the docID as a parameter. From the code in PostingsHighlight, this seems like it would be easy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org