RE: [jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2012-09-30 Thread Boris Galitsky

Hello Grant, Lance and Joern
   I have been developing 'similarity' component for OpenNLP that can be 
plugged into SOLR. This component does relevance assessment based on matching 
the parse tree of query with the parse trees of candidate answers. The idea of 
this component is that a search engineer does not need to be familiar with the 
linguistics, just plugs inSyntGenRequestHandler for longer queries or longer 
texts, and checks out if it improves the relevance.   There are many other 
applications of similarity component of OpenNLP besides search which live as 
junits such as semantic filtering for speech recognition, content generation, 
and auto code generation from NL.This component is about to be released, 
hopefully, and is currently there:
https://issues.apache.org/jira/browse/OPENNLP-497 It sounds like it is 
complementary to LUCENE 2899. RegardsBoris 





 Date: Mon, 1 Oct 2012 00:35:07 +1100
 From: j...@apache.org
 To: dev@lucene.apache.org
 Subject: [jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities 
 as a module
 
 
 [ 
 https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466478#comment-13466478
  ] 
 
 mailformailingli...@yahoo.de commented on LUCENE-2899:
 --
 
 Could you please create a new Patch for the current Trunk? I had some 
 problems on applying it to my working copy...
 
 I am not entirely sure whether its the Trunk or your Code, but it seems like 
 your OpenNLP-code only works for the first request.
 
 As far as I was able to debug, the create()-method of the TokenFilterFactory 
 is only called every now and again (are created TokenFilters reused for 
 longer than one call in Solr?).
 
 If create() of your FilterFactory was called, everything works. However if 
 the TokenFilter is somehow reused, it fails. 
 
 Is this a bug of Solr or of your Patch?
 
  Add OpenNLP Analysis capabilities as a module
  -
 
  Key: LUCENE-2899
  URL: https://issues.apache.org/jira/browse/LUCENE-2899
  Project: Lucene - Core
   Issue Type: New Feature
   Components: modules/analysis
 Reporter: Grant Ingersoll
 Assignee: Grant Ingersoll
 Priority: Minor
  Attachments: LUCENE-2899.patch, LUCENE-2899.patch, 
  LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
  opennlp_trunk.patch
 
 
  Now that OpenNLP is an ASF project and has a nice license, it would be nice 
  to have a submodule (under analysis) that exposed capabilities for it. Drew 
  Farris, Tom Morton and I have code that does:
  * Sentence Detection as a Tokenizer (could also be a TokenFilter, although 
  it would have to change slightly to buffer tokens)
  * NamedEntity recognition as a TokenFilter
  We are also planning a Tokenizer/TokenFilter that can put parts of speech 
  as either payloads (PartOfSpeechAttribute?) on a token or at the same 
  position.
  I'd propose it go under:
  modules/analysis/opennlp
 
 --
 This message is automatically generated by JIRA.
 If you think it was sent incorrectly, please contact your JIRA administrators
 For more information on JIRA, see: http://www.atlassian.com/software/jira
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 
  

eliminating scoring for the sake of efficiency

2006-05-11 Thread Boris Galitsky

Hello

   We don't need any scoring in our application domain, but 
efficiency is the key because we are getting tens thousand of hits for 
span queries; all these hits are necessary to collect.
   Is there a simple way to turn scoring off while indexing, while 
search  and while delivering document IDs to save on time?


Best regards
Boris

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



accelerate hits.id(i) function: eliminating scoring for the sake of efficiency

2006-05-11 Thread Boris Galitsky

Yes, thanks Paul.

 We are already using

 getSpans() on the top level SpanQuery, and use a loop
calling next() on the Spans, and ignore duplicate doc() values from 
the Spans

in that loop.
A counter in the loop would also give you the number of matching 
occurrences

of the SpanQuery.


I will look into

NearSpansOrdered here  might be a bit faster than the NearSpans


However what significantly slows us down is the hits.id(i) function.
Can we accelerate it somehow cleaning Lucene code itself from 
scoring?


Best regards
Boris




On Thursday 11 May 2006 22:42, Boris Galitsky wrote:

Hello

We don't need any scoring in our application domain, but 
efficiency is the key because we are getting tens thousand of hits 
for 
span queries; all these hits are necessary to collect.
Is there a simple way to turn scoring off while indexing, while 
search  and while delivering document IDs to save on time?


You could use getSpans() on the top level SpanQuery, and use a loop
calling next() on the Spans, and ignore duplicate doc() values from 
the Spans

in that loop.
A counter in the loop would also give you the number of matching 
occurrences

of the SpanQuery.

This way of using the Spans directly should be slightly more 
efficient than

using a HitCollector, but don't hold your breath.

In case you have ordered SpanQuery's without overlaps, the
NearSpansOrdered here  might be a bit faster than the NearSpans
currently in Lucene:
http://issues.apache.org/jira/browse/LUCENE-413
(you'll also need the patch to SpanNearQuery).

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



SRND query

2006-04-25 Thread Boris Galitsky

Hello


We need to construct nested span queries, and it seems like SrndQuery 
is a good way to do it.


Are there examples available for SrndQueries? How to construct them 
(is it using QueryParser?).

Where to get Surround Parser?

How to run them?

Best regards
Boris


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



how to match Documents from Hits with Documents from Query Spans?

2006-04-18 Thread Boris Galitsky

Hello

 

I am using span queries to get hits (Documents) and occurrences 
(positions) of search terms within these documents.


For some reason, there is a disagreement between the order the 
Documents are returned in hits, and the Documents are referenced (via 
order number, starting from 0) in the Spans?


 


The problem is depicted at the diagram below

 

 


Query = Lucene = hits -Documents

|   |

Spans - doc(), start(), end() |

 \-



 

Lucene gets a Query and gives away hits with resultant Documents, and 
the occurrences of search expression are obtained form the Query. Why 
is there such an odd logic? Again, how to match Documents from Hits 
with Documents from Query Spans?


 


Regards

Boris

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How to get Document (or filename) from Span

2006-04-18 Thread Boris Galitsky

Thanks a lot Hoss

The question is when I get Spans, I get start/end positions and a 
Document order (starting from 0), not the Document object itself from 
which I could get a filename. Since I believe there is no way to get a 
Document object from Spans, and there is no such thing as Document ID 
in Lucene (right?) I attempt to have the same order for
Hits and for Spans (the indexing order) and retrieve Document for each 
Spans this way.


I will try to prepare a test case. It works so far but I am afraid it 
will be unstable.


Best regards
Boris



On Tue, 18 Apr 2006 10:29:30 -0700 (PDT)
 Chris Hostetter [EMAIL PROTECTED] wrote:


: For some reason, there is a disagreement between the order the
: Documents are returned in hits, and the Documents are referenced 
(via

: order number, starting from 0) in the Spans?

When dealing with a Hits instance, documents are iterated over in 
results

order -- which may be by score, or may be by some other sort you've
specified.

When dealing with a Spans instance, i believe the matches are 
iterated

over in index order.  Besides the perofrmance reasosn why this may
be true, you also have to keep in mind that the Spans instance has 
no
idea what ordering you may have used when you executed your search 
-- even
if it assumed you sorted by score, the SpanQuery may have been a 
part of a
much larger more complicated query in which the final scores were 
vastly

different.

If i've missunderstood your problem, could you plee post a JUnit 
test case

that builds a small index in a RAMDIrectory, with some code that
demonstrates what you expect to happen, and how it fails?



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to get Document (or filename) from Span

2006-04-18 Thread Boris Galitsky

I fully understand now. Thanks a lot
Boris

On Tue, 18 Apr 2006 11:10:20 -0700 (PDT)
 Chris Hostetter [EMAIL PROTECTED] wrote:


: The question is when I get Spans, I get start/end positions and a
: Document order (starting from 0), not the Document object itself 
from


Are you sure about that?  Spans.doc() should return you the internal
document Identifier which you can pass to indexReader.doc(int)

: which I could get a filename. Since I believe there is no way to 
get a
: Document object from Spans, and there is no such thing as Document 
ID

: in Lucene (right?) I attempt to have the same order for
: Hits and for Spans (the indexing order) and retrieve Document for 
each

: Spans this way.

Documents do have Document IDs, assigned based on index order. 
that's

what Hits.id() returns.

FYI: take a look at the TestSpans.testSpanNearOrderedOverlap class 
for an
example of how the Spans class works. (it's what i'm using as a 
basis for
my suggestion as to how to use the class -- i've never used it 
myself)





-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]