[jira] [Commented] (SOLR-7136) Add an AutoPhrasing TokenFilter

2015-11-20 Thread Koorosh Vakhshoori (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018409#comment-15018409
 ] 

Koorosh Vakhshoori commented on SOLR-7136:
--

As far as the memory leaks issue, I looked at my version and identified couple 
of areas that it could cause problems: 1) in AutoPhrasingQParserPluging I 
updated the code so all resources associated with AutoPhrasingTokenFilter 
instance is released by calling end() and close(), 2) when it came to 
phraseSets, I make sure it is populated once, in the earlier version every time 
the filter was instantiated the constructor would repopulated it. However, in 
some cases you want that, I have a version of constructor that force-populate 
the phraseSets! I call it in the ManagedAutophraseFilterFactory class.

As far as performance/scaling, no I have not done any formal evaluation. All I 
can tell, we have it running in our QA and people who have tested it are 
satisfied with the speed. However, our speed requires are in seconds and not 
milliseconds. I love to hear the result of your A/B testing.

On the acronym topic, you hit the nail on the head. This falls under 
personalized or context search. In our use case, our content is collections of 
different corpus, i.e. carpi. This means different users depending on their 
specialty want to see different results for the same query. This is a tough nut 
to crack. Glad to hear you would be looking into this issue.


> Add an AutoPhrasing TokenFilter
> ---
>
> Key: SOLR-7136
> URL: https://issues.apache.org/jira/browse/SOLR-7136
> Project: Solr
>  Issue Type: New Feature
>Reporter: Ted Sullivan
> Attachments: AutoPhaseFiniteStateDiagram.pdf, SOLR-7136.patch, 
> SOLR-7136.patch, SOLR-7136.patch, SOLR-7136.patch
>
>
> Adds an 'autophrasing' token filter which is designed to enable noun phrases 
> that represent a single entity to be tokenized in a singular fashion. Adds 
> support for ManagedResources and Query parser auto-phrasing support given 
> LUCENE-2605.
> The rationale for this Token Filter and its use in solving the long standing 
> multi-term synonym problem in Lucene Solr has been documented online. 
> http://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-7136) Add an AutoPhrasing TokenFilter

2015-11-19 Thread Koorosh Vakhshoori (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koorosh Vakhshoori updated SOLR-7136:
-
Attachment: SOLR-7136.patch
AutoPhaseFiniteStateDiagram.pdf

Here I am uploading a new implementation of AutoPhrasing in coordination with 
Ted. This version adds a few features on top of the previous code. Here they 
are:
- The phrase detection algorithm is refactored as a finite-state machine. This 
FSM takes a term as input for each transition. I am including the FSM diagram 
here.
- The new code correctly keeps track of the start and end offsets in all cases.
- Now the code records the PostionLength attribute, since it would be handy for 
highlighter. That is once the highlighter is fixed, SOLR-3390.
- There is a new argument ‘emitAmbiguousPhrases’. When it is set to false, it 
would only emit auto-phrase that matches the longest sequence of terms. For 
example, if we have ‘New York City’ and ‘New York’ in the autophrases.txt file 
and the text is ‘New York City is a great place to live’, in this case only 
‘New York City’ is emitted. Well, my use case required it and I am sure others 
may want it too.
- Rather than applying AutoPhrasing at index time, now you can detect it at 
query time by setting ‘quotePhrase’ to true. This is a major enhancement, no 
need to do anything special at index time, now the queryParser would just 
double quote the detected phrase and run the search as a phrase query. Another 
advantage is you can update the autophrases.txt file on the fly, no need to 
re-index.
- Updated the queryParser so it would not touch any term in quoted string, 
since it would be interfering with user’s intend. For example, in query ‘we are 
going to “New York airport”’ the phrase “new York airport” is untouched.
Side note, as far as comparing SOLR-4381 patch and this one, in my opinion they 
are complementary not competing. I did some experimentation by chaining 
AutoPhrasing and Query-time Synonym as a queryParser. They work well together, 
where one detected the phrases and the other one expanded the query to its 
synonyms. However, one issue I found was around acronyms in synonym list. For 
example, DC stands for ‘Direct Current’. If the index text has DC in it, 
searching for ‘Current’ would not match DC, since the indexed document has not 
expanded the term to ‘Direct Current’.


> Add an AutoPhrasing TokenFilter
> ---
>
> Key: SOLR-7136
> URL: https://issues.apache.org/jira/browse/SOLR-7136
> Project: Solr
>  Issue Type: New Feature
>Reporter: Ted Sullivan
> Attachments: AutoPhaseFiniteStateDiagram.pdf, SOLR-7136.patch, 
> SOLR-7136.patch, SOLR-7136.patch, SOLR-7136.patch
>
>
> Adds an 'autophrasing' token filter which is designed to enable noun phrases 
> that represent a single entity to be tokenized in a singular fashion. Adds 
> support for ManagedResources and Query parser auto-phrasing support given 
> LUCENE-2605.
> The rationale for this Token Filter and its use in solving the long standing 
> multi-term synonym problem in Lucene Solr has been documented online. 
> http://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: TestAllAnalyzersHaveFactories fails when looking for a new Factory class, is it class loader issue?

2015-11-17 Thread Koorosh Vakhshoori
Hi Alan,
  Thanks, that was the problem. I will definitely check out the SPI link.

Regards,

Koorosh

On Tue, Nov 17, 2015 at 1:16 AM, Alan Woodward <a...@flax.co.uk> wrote:
> Hi Koorosh,
>
> Lucene analyzers and tokenfilters are discovered via Java SPI (see
> https://docs.oracle.com/javase/tutorial/sound/SPI-intro.html).  In order to
> make your TokenFilter discoverable, you need to add the fully qualified
> classname of your factory to the file
> resources/META-INF/org.apache.lucene.analyzer.util.TokenFilterFactory
>
> Alan Woodward
> www.flax.co.uk
>
>
> On 16 Nov 2015, at 23:55, Koorosh Vakhshoori wrote:
>
> Hi all,
>  I am in process of creating a patch for Lucene. However, I can’t get
> the JUnit test TestAllAnalyzersHaveFactories pass. Hope this is the
> right forum for help. If not kindly direct me to the correct forum.
> Any help is greatly appreciated!
>
>  First, some background. The patch is building on Ted Sullivan work,
> SOLR-7136. It is an enhanced version of AutoPhrase which I like to
> submit to community. The code includes a new TokenFilter,
> AutoPhrasingTokenFilter with Junit tests. I have created following
> package:
>
> org.apache.lucene.analysis.autophrase
>
> This package contains the following class files:
>
> AutoPhraseDetector.java
> AutoPhrasingTokenFilter.java
> AutoPhrasingTokenFilterFactory.java
> package-info.java
>
> When running the test under ant, the test
> TestAllAnalyzersHaveFactories fails with following output, I have
> added some print statements for debugging:
> 
> -test:
>   [junit4]  says ! Master seed: 86F1C35C6CE11696
>   [junit4] Your default console's encoding may not display certain
> unicode glyphs: US-ASCII
>   [junit4] Executing 1 suite with 1 JVM.
>   [junit4]
>   [junit4] Started J0 PID(15156@localhost).
>   [junit4] Suite:
> org.apache.lucene.analysis.core.TestAllAnalyzersHaveFactories
>   [junit4]   1> clazzName: IndicNormalizationFilter
>   [junit4]   1> simpleName: IndicNormalization
>   [junit4]   1> clazzName: HyphenationCompoundWordTokenFilter
>   [junit4]   1> simpleName: HyphenationCompoundWord
>   [junit4]   1> clazzName: DictionaryCompoundWordTokenFilter
>   [junit4]   1> simpleName: DictionaryCompoundWord
>   [junit4]   1> clazzName: BulgarianStemFilter
>   [junit4]   1> simpleName: BulgarianStem
>   [junit4]   1> clazzName: ShingleFilter
>   [junit4]   1> simpleName: Shingle
>   [junit4]   1> clazzName: ReverseStringFilter
>   [junit4]   1> simpleName: ReverseString
>   [junit4]   1> clazzName: GreekLowerCaseFilter
>   [junit4]   1> simpleName: GreekLowerCase
>   [junit4]   1> clazzName: GreekStemFilter
>   [junit4]   1> simpleName: GreekStem
>   [junit4]   1> clazzName: HungarianLightStemFilter
>   [junit4]   1> simpleName: HungarianLightStem
>   [junit4]   1> clazzName: GermanNormalizationFilter
>   [junit4]   1> simpleName: GermanNormalization
>   [junit4]   1> clazzName: GermanLightStemFilter
>   [junit4]   1> simpleName: GermanLightStem
>   [junit4]   1> clazzName: GermanMinimalStemFilter
>   [junit4]   1> simpleName: GermanMinimalStem
>   [junit4]   1> clazzName: GermanStemFilter
>   [junit4]   1> simpleName: GermanStem
>   [junit4]   1> clazzName: EnglishPossessiveFilter
>   [junit4]   1> simpleName: EnglishPossessive
>   [junit4]   1> clazzName: EnglishMinimalStemFilter
>   [junit4]   1> simpleName: EnglishMinimalStem
>   [junit4]   1> clazzName: PorterStemFilter
>   [junit4]   1> simpleName: PorterStem
>   [junit4]   1> clazzName: KStemFilter
>   [junit4]   1> simpleName: KStem
>   [junit4]   1> clazzName: ItalianLightStemFilter
>   [junit4]   1> simpleName: ItalianLightStem
>   [junit4]   1> clazzName: HindiStemFilter
>   [junit4]   1> simpleName: HindiStem
>   [junit4]   1> clazzName: HindiNormalizationFilter
>   [junit4]   1> simpleName: HindiNormalization
>   [junit4]   1> clazzName: RussianLightStemFilter
>   [junit4]   1> simpleName: RussianLightStem
>   [junit4]   1> clazzName: ClassicFilter
>   [junit4]   1> simpleName: Classic
>   [junit4]   1> clazzName: StandardFilter
>   [junit4]   1> simpleName: Standard
>   [junit4]   1> clazzName: CzechStemFilter
>   [junit4]   1> simpleName: CzechStem
>   [junit4]   1> clazzName: ElisionFilter
>   [junit4]   1> simpleName: Elision
>   [junit4]   1> clazzName: DelimitedPayloadTokenFilter
>   [junit4]   1> simpleName: DelimitedPayload
>   [junit4]   1> clazzName: TokenOffsetPayloadTokenFilter
>   [junit4]   1> simpleName: TokenOffsetPayload
&g

TestAllAnalyzersHaveFactories fails when looking for a new Factory class, is it class loader issue?

2015-11-16 Thread Koorosh Vakhshoori
Hi all,
  I am in process of creating a patch for Lucene. However, I can’t get
the JUnit test TestAllAnalyzersHaveFactories pass. Hope this is the
right forum for help. If not kindly direct me to the correct forum.
Any help is greatly appreciated!

  First, some background. The patch is building on Ted Sullivan work,
SOLR-7136. It is an enhanced version of AutoPhrase which I like to
submit to community. The code includes a new TokenFilter,
AutoPhrasingTokenFilter with Junit tests. I have created following
package:

org.apache.lucene.analysis.autophrase

This package contains the following class files:

AutoPhraseDetector.java
AutoPhrasingTokenFilter.java
AutoPhrasingTokenFilterFactory.java
package-info.java

When running the test under ant, the test
TestAllAnalyzersHaveFactories fails with following output, I have
added some print statements for debugging:

-test:
   [junit4]  says ! Master seed: 86F1C35C6CE11696
   [junit4] Your default console's encoding may not display certain
unicode glyphs: US-ASCII
   [junit4] Executing 1 suite with 1 JVM.
   [junit4]
   [junit4] Started J0 PID(15156@localhost).
   [junit4] Suite: org.apache.lucene.analysis.core.TestAllAnalyzersHaveFactories
   [junit4]   1> clazzName: IndicNormalizationFilter
   [junit4]   1> simpleName: IndicNormalization
   [junit4]   1> clazzName: HyphenationCompoundWordTokenFilter
   [junit4]   1> simpleName: HyphenationCompoundWord
   [junit4]   1> clazzName: DictionaryCompoundWordTokenFilter
   [junit4]   1> simpleName: DictionaryCompoundWord
   [junit4]   1> clazzName: BulgarianStemFilter
   [junit4]   1> simpleName: BulgarianStem
   [junit4]   1> clazzName: ShingleFilter
   [junit4]   1> simpleName: Shingle
   [junit4]   1> clazzName: ReverseStringFilter
   [junit4]   1> simpleName: ReverseString
   [junit4]   1> clazzName: GreekLowerCaseFilter
   [junit4]   1> simpleName: GreekLowerCase
   [junit4]   1> clazzName: GreekStemFilter
   [junit4]   1> simpleName: GreekStem
   [junit4]   1> clazzName: HungarianLightStemFilter
   [junit4]   1> simpleName: HungarianLightStem
   [junit4]   1> clazzName: GermanNormalizationFilter
   [junit4]   1> simpleName: GermanNormalization
   [junit4]   1> clazzName: GermanLightStemFilter
   [junit4]   1> simpleName: GermanLightStem
   [junit4]   1> clazzName: GermanMinimalStemFilter
   [junit4]   1> simpleName: GermanMinimalStem
   [junit4]   1> clazzName: GermanStemFilter
   [junit4]   1> simpleName: GermanStem
   [junit4]   1> clazzName: EnglishPossessiveFilter
   [junit4]   1> simpleName: EnglishPossessive
   [junit4]   1> clazzName: EnglishMinimalStemFilter
   [junit4]   1> simpleName: EnglishMinimalStem
   [junit4]   1> clazzName: PorterStemFilter
   [junit4]   1> simpleName: PorterStem
   [junit4]   1> clazzName: KStemFilter
   [junit4]   1> simpleName: KStem
   [junit4]   1> clazzName: ItalianLightStemFilter
   [junit4]   1> simpleName: ItalianLightStem
   [junit4]   1> clazzName: HindiStemFilter
   [junit4]   1> simpleName: HindiStem
   [junit4]   1> clazzName: HindiNormalizationFilter
   [junit4]   1> simpleName: HindiNormalization
   [junit4]   1> clazzName: RussianLightStemFilter
   [junit4]   1> simpleName: RussianLightStem
   [junit4]   1> clazzName: ClassicFilter
   [junit4]   1> simpleName: Classic
   [junit4]   1> clazzName: StandardFilter
   [junit4]   1> simpleName: Standard
   [junit4]   1> clazzName: CzechStemFilter
   [junit4]   1> simpleName: CzechStem
   [junit4]   1> clazzName: ElisionFilter
   [junit4]   1> simpleName: Elision
   [junit4]   1> clazzName: DelimitedPayloadTokenFilter
   [junit4]   1> simpleName: DelimitedPayload
   [junit4]   1> clazzName: TokenOffsetPayloadTokenFilter
   [junit4]   1> simpleName: TokenOffsetPayload
   [junit4]   1> clazzName: NumericPayloadTokenFilter
   [junit4]   1> simpleName: NumericPayload
   [junit4]   1> clazzName: TypeAsPayloadTokenFilter
   [junit4]   1> simpleName: TypeAsPayload
   [junit4]   1> clazzName: AutoPhrasingTokenFilter
   [junit4]   1> simpleName: AutoPhrasing
   [junit4]   2> NOTE: reproduce with: ant test
-Dtestcase=TestAllAnalyzersHaveFactories -Dtests.method=test
-Dtests.seed=86F1C35C6CE11696 -Dtests.slow=true -Dtests.locale=zh_CN
-Dtests.timezone=US/Samoa -Dtests.asserts=true
-Dtests.file.encoding=UTF-8
   [junit4] ERROR   2.94s | TestAllAnalyzersHaveFactories.test <<<
   [junit4]> Throwable #1: java.lang.IllegalArgumentException: A
SPI class of type org.apache.lucene.analysis.util.TokenFilterFactory
with name 'AutoPhrasing' does not exist. You need to add the
corresponding JAR file supporting this SPI to your classpath. The
current classpath supports the following names: [apostrophe,
arabicnormalization, arabicstem, bulgarianstem, brazilianstem,
cjkbigram, cjkwidth, soraninormalization, soranistem, commongrams,
commongramsquery, dictionarycompoundword, hyphenationcompoundword,
decimaldigit, lowercase, stop, type, uppercase, czechstem,
germanlightstem,