[jira] [Commented] (LUCENE-5181) Passage knows its own docID

2013-08-26 Thread Jon Stewart (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13750086#comment-13750086
 ] 

Jon Stewart commented on LUCENE-5181:
-

{quote}
How does highlighting fit into this?
{quote}

1. It's convenient to have the Passage snippet, for showing context around the 
term matches.

2. It's very unclear how to actually retrieve term matches per doc given a 
TopDocs result set. This was the subject of recent [mailing list 
traffic|http://search-lucene.com/m/XhEx62ewjeQ/hits+coordinatessubj=How+to+get+hits+coordinates+in+Lucene+4+4+0],
 and the PostingsHighlighter was suggested as the efficient way to do this, 
which led to this JIRA being opened.


 Passage knows its own docID
 ---

 Key: LUCENE-5181
 URL: https://issues.apache.org/jira/browse/LUCENE-5181
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.4
Reporter: Jon Stewart
Priority: Minor

 The new PostingsHighlight package allows for retrieval of term matches from a 
 query if one creates a class that extends PassageFormatter and overrides 
 format(). However, class Passage does not have a docID field, nor is this 
 provided via PassageFormatter.format(). Therefore, it's very difficult to 
 know which Document contains a given Passage.
 It would suffice for PassageFormatter.format() to be passed the docID as a 
 parameter. From the code in PostingsHighlight, this seems like it would be 
 easy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5181) Passage knows its own docID

2013-08-24 Thread Jon Stewart (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13749363#comment-13749363
 ] 

Jon Stewart commented on LUCENE-5181:
-

Sure. I'm working in a high recall/low precision domain, where a large portion 
of the source documents are irrelevant junk. For their review, users are often 
presented with a match-oriented table view rather than a document-oriented 
table view, i.e., each row in the table represents a term match, generally with 
some context, and is joined with some document metadata.

I can use the PassageFormatter to get access to the Passages in a result set, 
but it is hard to generate this table view without knowing which Document goes 
with the Passage. Additionally, a research problem I'm working on is using a 
combination of match properties and Document properties to score the individual 
matches (including metadata, like file type, created dates, etc.). The 
properties get normalized and fed into liblinear and out comes a score for us 
to sort on. This, too, is difficult without having the Document.

Happy to contribute a patch if there's consensus. Passing in the docID via 
PassageFormatter.format is what I did, but that breaks backwards compatibility. 
It'd be easy enough to set on Passage as a field.

 Passage knows its own docID
 ---

 Key: LUCENE-5181
 URL: https://issues.apache.org/jira/browse/LUCENE-5181
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.4
Reporter: Jon Stewart
Priority: Minor

 The new PostingsHighlight package allows for retrieval of term matches from a 
 query if one creates a class that extends PassageFormatter and overrides 
 format(). However, class Passage does not have a docID field, nor is this 
 provided via PassageFormatter.format(). Therefore, it's very difficult to 
 know which Document contains a given Passage.
 It would suffice for PassageFormatter.format() to be passed the docID as a 
 parameter. From the code in PostingsHighlight, this seems like it would be 
 easy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5181) Passage knows its own docID

2013-08-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13749414#comment-13749414
 ] 

Robert Muir commented on LUCENE-5181:
-

{quote}
For their review, users are often presented with a match-oriented table view 
rather than a document-oriented table view, i.e., each row in the table 
represents a term match, generally with some context, and is joined with some 
document metadata.
{quote}

How does highlighting fit into this?

My general concern is that passing docid/encouraging the use of 
o.a.l.document.Document within passage-processing will mean that people are 
retrieving from the stored fields for every single match: and this would be 
very slow.

Are you using highlighting to rank the most relevant sentences or do you really 
want to enumerate term matches? In the latter case Query.extractTerms() + 
TermsEnum.docsAndPositionsEnum(FLAG_OFFSETS) would be much more efficient.


 Passage knows its own docID
 ---

 Key: LUCENE-5181
 URL: https://issues.apache.org/jira/browse/LUCENE-5181
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.4
Reporter: Jon Stewart
Priority: Minor

 The new PostingsHighlight package allows for retrieval of term matches from a 
 query if one creates a class that extends PassageFormatter and overrides 
 format(). However, class Passage does not have a docID field, nor is this 
 provided via PassageFormatter.format(). Therefore, it's very difficult to 
 know which Document contains a given Passage.
 It would suffice for PassageFormatter.format() to be passed the docID as a 
 parameter. From the code in PostingsHighlight, this seems like it would be 
 easy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5181) Passage knows its own docID

2013-08-22 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13747493#comment-13747493
 ] 

Luca Cavanna commented on LUCENE-5181:
--

True, having the doc id would be useful there. Why not adding it directly to 
the Passage, to be able know which document the Passage comes from?

 Passage knows its own docID
 ---

 Key: LUCENE-5181
 URL: https://issues.apache.org/jira/browse/LUCENE-5181
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.4
Reporter: Jon Stewart
Priority: Minor

 The new PostingsHighlight package allows for retrieval of term matches from a 
 query if one creates a class that extends PassageFormatter and overrides 
 format(). However, class Passage does not have a docID field, nor is this 
 provided via PassageFormatter.format(). Therefore, it's very difficult to 
 know which Document contains a given Passage.
 It would suffice for PassageFormatter.format() to be passed the docID as a 
 parameter. From the code in PostingsHighlight, this seems like it would be 
 easy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5181) Passage knows its own docID

2013-08-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13747499#comment-13747499
 ] 

Robert Muir commented on LUCENE-5181:
-

Can you give a concrete example where docid is actually useful?


 Passage knows its own docID
 ---

 Key: LUCENE-5181
 URL: https://issues.apache.org/jira/browse/LUCENE-5181
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.4
Reporter: Jon Stewart
Priority: Minor

 The new PostingsHighlight package allows for retrieval of term matches from a 
 query if one creates a class that extends PassageFormatter and overrides 
 format(). However, class Passage does not have a docID field, nor is this 
 provided via PassageFormatter.format(). Therefore, it's very difficult to 
 know which Document contains a given Passage.
 It would suffice for PassageFormatter.format() to be passed the docID as a 
 parameter. From the code in PostingsHighlight, this seems like it would be 
 easy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5181) Passage knows its own docID

2013-08-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13744139#comment-13744139
 ] 

Michael McCandless commented on LUCENE-5181:


+1, I think this (pass docID as a parameter to PassageFormatter.format) is 
reasonable?

 Passage knows its own docID
 ---

 Key: LUCENE-5181
 URL: https://issues.apache.org/jira/browse/LUCENE-5181
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.4
Reporter: Jon Stewart
Priority: Minor

 The new PostingsHighlight package allows for retrieval of term matches from a 
 query if one creates a class that extends PassageFormatter and overrides 
 format(). However, class Passage does not have a docID field, nor is this 
 provided via PassageFormatter.format(). Therefore, it's very difficult to 
 know which Document contains a given Passage.
 It would suffice for PassageFormatter.format() to be passed the docID as a 
 parameter. From the code in PostingsHighlight, this seems like it would be 
 easy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org