RE: [jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module
Hello Grant, Lance and Joern I have been developing 'similarity' component for OpenNLP that can be plugged into SOLR. This component does relevance assessment based on matching the parse tree of query with the parse trees of candidate answers. The idea of this component is that a search engineer does not need to be familiar with the linguistics, just plugs inSyntGenRequestHandler for longer queries or longer texts, and checks out if it improves the relevance. There are many other applications of similarity component of OpenNLP besides search which live as junits such as semantic filtering for speech recognition, content generation, and auto code generation from NL.This component is about to be released, hopefully, and is currently there: https://issues.apache.org/jira/browse/OPENNLP-497 It sounds like it is complementary to LUCENE 2899. RegardsBoris Date: Mon, 1 Oct 2012 00:35:07 +1100 From: j...@apache.org To: dev@lucene.apache.org Subject: [jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module [ https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466478#comment-13466478 ] mailformailingli...@yahoo.de commented on LUCENE-2899: -- Could you please create a new Patch for the current Trunk? I had some problems on applying it to my working copy... I am not entirely sure whether its the Trunk or your Code, but it seems like your OpenNLP-code only works for the first request. As far as I was able to debug, the create()-method of the TokenFilterFactory is only called every now and again (are created TokenFilters reused for longer than one call in Solr?). If create() of your FilterFactory was called, everything works. However if the TokenFilter is somehow reused, it fails. Is this a bug of Solr or of your Patch? Add OpenNLP Analysis capabilities as a module - Key: LUCENE-2899 URL: https://issues.apache.org/jira/browse/LUCENE-2899 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, opennlp_trunk.patch Now that OpenNLP is an ASF project and has a nice license, it would be nice to have a submodule (under analysis) that exposed capabilities for it. Drew Farris, Tom Morton and I have code that does: * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it would have to change slightly to buffer tokens) * NamedEntity recognition as a TokenFilter We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads (PartOfSpeechAttribute?) on a token or at the same position. I'd propose it go under: modules/analysis/opennlp -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
eliminating scoring for the sake of efficiency
Hello We don't need any scoring in our application domain, but efficiency is the key because we are getting tens thousand of hits for span queries; all these hits are necessary to collect. Is there a simple way to turn scoring off while indexing, while search and while delivering document IDs to save on time? Best regards Boris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
accelerate hits.id(i) function: eliminating scoring for the sake of efficiency
Yes, thanks Paul. We are already using getSpans() on the top level SpanQuery, and use a loop calling next() on the Spans, and ignore duplicate doc() values from the Spans in that loop. A counter in the loop would also give you the number of matching occurrences of the SpanQuery. I will look into NearSpansOrdered here might be a bit faster than the NearSpans However what significantly slows us down is the hits.id(i) function. Can we accelerate it somehow cleaning Lucene code itself from scoring? Best regards Boris On Thursday 11 May 2006 22:42, Boris Galitsky wrote: Hello We don't need any scoring in our application domain, but efficiency is the key because we are getting tens thousand of hits for span queries; all these hits are necessary to collect. Is there a simple way to turn scoring off while indexing, while search and while delivering document IDs to save on time? You could use getSpans() on the top level SpanQuery, and use a loop calling next() on the Spans, and ignore duplicate doc() values from the Spans in that loop. A counter in the loop would also give you the number of matching occurrences of the SpanQuery. This way of using the Spans directly should be slightly more efficient than using a HitCollector, but don't hold your breath. In case you have ordered SpanQuery's without overlaps, the NearSpansOrdered here might be a bit faster than the NearSpans currently in Lucene: http://issues.apache.org/jira/browse/LUCENE-413 (you'll also need the patch to SpanNearQuery). Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
SRND query
Hello We need to construct nested span queries, and it seems like SrndQuery is a good way to do it. Are there examples available for SrndQueries? How to construct them (is it using QueryParser?). Where to get Surround Parser? How to run them? Best regards Boris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
how to match Documents from Hits with Documents from Query Spans?
Hello I am using span queries to get hits (Documents) and occurrences (positions) of search terms within these documents. For some reason, there is a disagreement between the order the Documents are returned in hits, and the Documents are referenced (via order number, starting from 0) in the Spans? The problem is depicted at the diagram below Query = Lucene = hits -Documents | | Spans - doc(), start(), end() | \- Lucene gets a Query and gives away hits with resultant Documents, and the occurrences of search expression are obtained form the Query. Why is there such an odd logic? Again, how to match Documents from Hits with Documents from Query Spans? Regards Boris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to get Document (or filename) from Span
Thanks a lot Hoss The question is when I get Spans, I get start/end positions and a Document order (starting from 0), not the Document object itself from which I could get a filename. Since I believe there is no way to get a Document object from Spans, and there is no such thing as Document ID in Lucene (right?) I attempt to have the same order for Hits and for Spans (the indexing order) and retrieve Document for each Spans this way. I will try to prepare a test case. It works so far but I am afraid it will be unstable. Best regards Boris On Tue, 18 Apr 2006 10:29:30 -0700 (PDT) Chris Hostetter [EMAIL PROTECTED] wrote: : For some reason, there is a disagreement between the order the : Documents are returned in hits, and the Documents are referenced (via : order number, starting from 0) in the Spans? When dealing with a Hits instance, documents are iterated over in results order -- which may be by score, or may be by some other sort you've specified. When dealing with a Spans instance, i believe the matches are iterated over in index order. Besides the perofrmance reasosn why this may be true, you also have to keep in mind that the Spans instance has no idea what ordering you may have used when you executed your search -- even if it assumed you sorted by score, the SpanQuery may have been a part of a much larger more complicated query in which the final scores were vastly different. If i've missunderstood your problem, could you plee post a JUnit test case that builds a small index in a RAMDIrectory, with some code that demonstrates what you expect to happen, and how it fails? -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to get Document (or filename) from Span
I fully understand now. Thanks a lot Boris On Tue, 18 Apr 2006 11:10:20 -0700 (PDT) Chris Hostetter [EMAIL PROTECTED] wrote: : The question is when I get Spans, I get start/end positions and a : Document order (starting from 0), not the Document object itself from Are you sure about that? Spans.doc() should return you the internal document Identifier which you can pass to indexReader.doc(int) : which I could get a filename. Since I believe there is no way to get a : Document object from Spans, and there is no such thing as Document ID : in Lucene (right?) I attempt to have the same order for : Hits and for Spans (the indexing order) and retrieve Document for each : Spans this way. Documents do have Document IDs, assigned based on index order. that's what Hits.id() returns. FYI: take a look at the TestSpans.testSpanNearOrderedOverlap class for an example of how the Spans class works. (it's what i'm using as a basis for my suggestion as to how to use the class -- i've never used it myself) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]