[jira] [Comment Edited] (LUCENE-8727) IndexSearcher#search(Query,int) should operate on a shared priority queue when configured with an executor

2019-07-19 Thread Mayya Sharipova (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889181#comment-16889181
 ] 

Mayya Sharipova edited comment on LUCENE-8727 at 7/19/19 9:28 PM:
--

Some comments about design option # 1.

I think we should just share  min competitive score(it could be AtomicLong or 
something) between collectors, and not the top hits.  The reason for not 
sharing top hits  is that Collectors expect leaves in the [sequential 
order.|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java#L240-L242]
 And if it happens that we start processing leaves with higher doc Ids first in 
the executor, we may populate the global priority queue with docs with higher 
ids and set the global min competitive score to the next float. Next, when we 
process leaves with smaller doc Ids, as our global priority queue is full and 
as we use this updated global min competitive score, we will have to skip all 
these docs with smaller doc Ids even if they have the same scores as docs with 
higher doc Ids and should be selected instead. 

If all collectors have their own priority queues, they will make sure first to 
fill them to N and only after that set min competitive score. 


was (Author: mayyas):
Some comments about design option # 1.

I think we should just share  min competitive score(it could be AtomicLong or 
something) between collectors, and not the top hits.  The reason for not 
sharing top hits  is that Collectors expect leaves in [the sequential 
order|[https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java#L240-L242]].
 And if it happens that we start processing leaves with higher doc Ids first in 
the executor, we may populate the global priority queue with docs with higher 
ids and set the global min competitive score to the next float. Next, when we 
process leaves with smaller doc Ids, as our global priority queue is full and 
as we use this updated global min competitive score, we will have to skip all 
these docs with smaller doc Ids even if they have the same scores as docs with 
higher doc Ids and should be selected instead. 

If all collectors have their own priority queues, they will make sure first to 
fill them to N and only after that set min competitive score. 

> IndexSearcher#search(Query,int) should operate on a shared priority queue 
> when configured with an executor
> --
>
> Key: LUCENE-8727
> URL: https://issues.apache.org/jira/browse/LUCENE-8727
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> If IndexSearcher is configured with an executor, then the top docs for each 
> slice are computed separately before being merged once the top docs for all 
> slices are computed. With block-max WAND this is a bit of a waste of 
> resources: it would be better if an increase of the min competitive score 
> could help skip non-competitive hits on every slice and not just the current 
> one.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8727) IndexSearcher#search(Query,int) should operate on a shared priority queue when configured with an executor

2019-07-19 Thread Mayya Sharipova (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889181#comment-16889181
 ] 

Mayya Sharipova commented on LUCENE-8727:
-

Some comments about design option # 1.

I think we should just share  min competitive score(it could be AtomicLong or 
something) between collectors, and not the top hits.  The reason for not 
sharing top hits  is that Collectors expect leaves in [the sequential 
order|[https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java#L240-L242]].
 And if it happens that we start processing leaves with higher doc Ids first in 
the executor, we may populate the global priority queue with docs with higher 
ids and set the global min competitive score to the next float. Next, when we 
process leaves with smaller doc Ids, as our global priority queue is full and 
as we use this updated global min competitive score, we will have to skip all 
these docs with smaller doc Ids even if they have the same scores as docs with 
higher doc Ids and should be selected instead. 

If all collectors have their own priority queues, they will make sure first to 
fill them to N and only after that set min competitive score. 

> IndexSearcher#search(Query,int) should operate on a shared priority queue 
> when configured with an executor
> --
>
> Key: LUCENE-8727
> URL: https://issues.apache.org/jira/browse/LUCENE-8727
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> If IndexSearcher is configured with an executor, then the top docs for each 
> slice are computed separately before being merged once the top docs for all 
> slices are computed. With block-max WAND this is a bit of a waste of 
> resources: it would be better if an increase of the min competitive score 
> could help skip non-competitive hits on every slice and not just the current 
> one.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8901) Load frequencies lazily for postings and impacts

2019-07-02 Thread Mayya Sharipova (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayya Sharipova resolved LUCENE-8901.
-
Resolution: Fixed

> Load frequencies lazily for postings and impacts
> 
>
> Key: LUCENE-8901
> URL: https://issues.apache.org/jira/browse/LUCENE-8901
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Priority: Minor
>
> Allow frequencies blocks to be loaded lazily when they are not needed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8901) Load frequencies lazily for postings and impacts

2019-07-02 Thread Mayya Sharipova (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877080#comment-16877080
 ] 

Mayya Sharipova commented on LUCENE-8901:
-

PR: [https://github.com/apache/lucene-solr/pull/595]

> Load frequencies lazily for postings and impacts
> 
>
> Key: LUCENE-8901
> URL: https://issues.apache.org/jira/browse/LUCENE-8901
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Priority: Minor
>
> Allow frequencies blocks to be loaded lazily when they are not needed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8901) Load frequencies lazily for postings and impacts

2019-07-02 Thread Mayya Sharipova (JIRA)
Mayya Sharipova created LUCENE-8901:
---

 Summary: Load frequencies lazily for postings and impacts
 Key: LUCENE-8901
 URL: https://issues.apache.org/jira/browse/LUCENE-8901
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Mayya Sharipova


Allow frequencies blocks to be loaded lazily when they are not needed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2019-03-04 Thread Mayya Sharipova (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783921#comment-16783921
 ] 

Mayya Sharipova commented on LUCENE-6968:
-

[~andyhind] Thanks very much for your answer, it made things more clear.  I 
still have a couple of additional questions if you don't mind:

1)  The default settings of the filter will produce for each document 512 
tokens each of the size 16 bytes, that is approximately 8Kb. Isn't 8Kb too big 
of a size to be a document's signature?

2) What is the way to combine `min_hash` tokens to a query for similarity 
search? Do you have any examples? Is this a work in progress?

Thanks again!

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: 6.2, 7.0
>
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2018-12-02 Thread Mayya Sharipova (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706571#comment-16706571
 ] 

Mayya Sharipova commented on LUCENE-6968:
-

[~andyhind]  Hello Andy! I have several questions about the implementation of 
the *MinHashFilter*, and was wondering if you would be able to answer them. 
Thanks a lot in advance.

The implementation from 1st original patch where the minimum set is kept is 
very clear to me, and follows the classic idea of constructing MinHash 
signature and LSH search after it. But I am having a hard time understanding 
the final implementation for MinHashFilter.

1) What constitutes the signature of a document? Are these all values stored in 
the hash table? Doesn't it make a signature too large? Can you please refer the 
paper that describes this way of constructing minhash signatures.

2) What is the use of {{withRotation}} parameter? What is the advantage of 
using {{withRotation=true}}? In the paper you cited: 
[http://www.auai.org/uai2014/proceedings/individuals/225.pdf], they fill empty 
bins with "value of the closest non-empty bin in the clockwise direction 
(circular right hand side) added *with offset C*". In the {{MinHashFilter}} 
implementation values for empty buckets are just blindly copied from non-empty 
ones, so a lot of buckets with have the same value.

Hopefully the questions make sense. Thanks again in advance.

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: 6.2, 7.0
>
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8529) Use the completion key to tiebreak completion suggestion

2018-10-17 Thread Mayya Sharipova (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654125#comment-16654125
 ] 

Mayya Sharipova commented on LUCENE-8529:
-

Thanks [~jim.ferenczi]. I was wondering if there is any reason why we don't put 
all this comparison logic (scores, keys, docIds) inside 
`SuggestScoreDoc::compareTo`, and then just use `a.compare(b)`, where `a` and 
`b` are `SuggestScoreDoc`?

> Use the completion key to tiebreak completion suggestion
> 
>
> Key: LUCENE-8529
> URL: https://issues.apache.org/jira/browse/LUCENE-8529
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8529.patch
>
>
> Today the completion suggester uses the document id to tiebreak completion 
> suggestion with same scores. It would improve the stability of the sort to 
> use the surface form of suggestions as the first tiebreaker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3475) ShingleFilter should handle positionIncrement of zero, e.g. synonyms

2018-01-31 Thread Mayya Sharipova (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347069#comment-16347069
 ] 

Mayya Sharipova commented on LUCENE-3475:
-

[~jpountz] thank so much for suggestions. 

[~romseygeek] is going to work on this issue. I will study his solution.

> ShingleFilter should handle positionIncrement of zero, e.g. synonyms
> 
>
> Key: LUCENE-3475
> URL: https://issues.apache.org/jira/browse/LUCENE-3475
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 3.4
>Reporter: Cameron
>Priority: Minor
>  Labels: newdev
>
> ShingleFilter is creating shingles for a single term that has been expanded 
> by synonyms when it shouldn't. The position increment is 0.
> As an example, I have an Analyzer with a SynonymFilter followed by a 
> ShingleFilter. Assuming car and auto are synonyms, the SynonymFilter produces 
> two tokens and position 1: car, auto. The ShingleFilter is then producing 3 
> tokens, when there should only be two: car, car auto, auto. This behavior 
> seems incorrect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-3475) ShingleFilter should handle positionIncrement of zero, e.g. synonyms

2018-01-29 Thread Mayya Sharipova (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344192#comment-16344192
 ] 

Mayya Sharipova edited comment on LUCENE-3475 at 1/29/18 11:33 PM:
---

[~jpountz] Hi Adrien!  I wonder what would be the best approach to handle 
positionIncrement=0?

I was thinking that in  *ShingleFilter:getNextToken* we can do something like 
this:
{code:java}
if (input.incrementToken()) {
   while (posIncrAtt.getPositionIncrement()  == 0) { // we may have multiple 
synonyms
  if (input.incrementToken()) { // go to next token 
   
  // store synonyms tokens and following tokens somewhere and create a 
new input TokenStream from them? 
{code}
I guess I am wondering if we have any other reference code that recreates a 
tokenStream from synonym tokens?
  


was (Author: mayyas):
[~jpountz] Hi Adrien!  I wonder what would be the best approach to handle 
positionIncrement=0?

I was thinking that in  *ShingleFilter:getNextToken* we can do something like 
this:
{code:java}
if (input.incrementToken()) {
   while (posIncrAtt.getPositionIncrement()  == 0) { // we may have multiple 
synonyms
  if (input.incrementToken()) { // go to next token 
   
  // store synonyms tokens and following tokens somewhere and create a 
new input TokenStream from them? 
{code}

I guess I am wondering if we have any other sample code that already doing 
that, and which I can reference?
 

> ShingleFilter should handle positionIncrement of zero, e.g. synonyms
> 
>
> Key: LUCENE-3475
> URL: https://issues.apache.org/jira/browse/LUCENE-3475
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 3.4
>Reporter: Cameron
>Priority: Minor
>  Labels: newdev
>
> ShingleFilter is creating shingles for a single term that has been expanded 
> by synonyms when it shouldn't. The position increment is 0.
> As an example, I have an Analyzer with a SynonymFilter followed by a 
> ShingleFilter. Assuming car and auto are synonyms, the SynonymFilter produces 
> two tokens and position 1: car, auto. The ShingleFilter is then producing 3 
> tokens, when there should only be two: car, car auto, auto. This behavior 
> seems incorrect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3475) ShingleFilter should handle positionIncrement of zero, e.g. synonyms

2018-01-29 Thread Mayya Sharipova (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344192#comment-16344192
 ] 

Mayya Sharipova commented on LUCENE-3475:
-

[~jpountz] Hi Adrien!  I wonder what would be the best approach to handle 
positionIncrement=0?

I was thinking that in  *ShingleFilter:getNextToken* we can do something like 
this:
{code:java}
if (input.incrementToken()) {
   while (posIncrAtt.getPositionIncrement()  == 0) { // we may have multiple 
synonyms
  if (input.incrementToken()) { // go to next token 
   
  // store synonyms tokens and following tokens somewhere and create a 
new input TokenStream from them? 
{code}

I guess I am wondering if we have any other sample code that already doing 
that, and which I can reference?
 

> ShingleFilter should handle positionIncrement of zero, e.g. synonyms
> 
>
> Key: LUCENE-3475
> URL: https://issues.apache.org/jira/browse/LUCENE-3475
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 3.4
>Reporter: Cameron
>Priority: Minor
>  Labels: newdev
>
> ShingleFilter is creating shingles for a single term that has been expanded 
> by synonyms when it shouldn't. The position increment is 0.
> As an example, I have an Analyzer with a SynonymFilter followed by a 
> ShingleFilter. Assuming car and auto are synonyms, the SynonymFilter produces 
> two tokens and position 1: car, auto. The ShingleFilter is then producing 3 
> tokens, when there should only be two: car, car auto, auto. This behavior 
> seems incorrect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8100) Error on reindex using WordNet synonyms file

2017-12-15 Thread Mayya Sharipova (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayya Sharipova resolved LUCENE-8100.
-
Resolution: Won't Fix

Looks like it is an issue in the Elastic Search

> Error on reindex using WordNet synonyms file
> 
>
> Key: LUCENE-8100
> URL: https://issues.apache.org/jira/browse/LUCENE-8100
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 7.0.1
>Reporter: Mayya Sharipova
>Priority: Minor
>
> Originally reported in the ES issues: 
> https://github.com/elastic/elasticsearch/issues/27798#issuecomment-351838983
> but looks like the issue is introduced from the Lucene 7.0.X.
> Copying the user's issue here:
> --
> I'm encountering the following error on indexing when trying to use the 
> wn_s.pl synonyms file (which I've moved to /usr/local/etc/elasticsearch):
> {code:javascript}
> {
>   "error": {
>   "root_cause": [{
>   "type": "illegal_argument_exception",
>   "reason": "failed to build synonyms"
>   }],
>   "type": "illegal_argument_exception",
>   "reason": "failed to build synonyms",
>   "caused_by": {
>   "type": "parse_exception",
>   "reason": "Invalid synonym rule at line 2",
>   "caused_by": {
>   "type": "illegal_argument_exception",
>   "reason": "term: physical entity analyzed to a 
> token with posinc != 1"
>   }
>   }
>   }
> }
> {code}
> Here's the line it's objecting to:
> s(11930,1,'physical entity',n,1,0). 
> I'm using the WordNet Prolog synonyms file from 
> http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz2
> --
> Looks like the error comes from  Lucene's classes of *WordnetSynonymParser* 
> and *SynonymMap*, and changes introduced from Lucene 7.0 version.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8100) Error on reindex using WordNet synonyms file

2017-12-15 Thread Mayya Sharipova (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayya Sharipova updated LUCENE-8100:

Description: 
Originally reported in the ES issues: 
https://github.com/elastic/elasticsearch/issues/27798#issuecomment-351838983

but looks like the issue is introduced from the Lucene 7.0.X.

Copying the user's issue here:

--

I'm encountering the following error on indexing when trying to use the wn_s.pl 
synonyms file (which I've moved to /usr/local/etc/elasticsearch):


{code:javascript}
{
"error": {
"root_cause": [{
"type": "illegal_argument_exception",
"reason": "failed to build synonyms"
}],
"type": "illegal_argument_exception",
"reason": "failed to build synonyms",
"caused_by": {
"type": "parse_exception",
"reason": "Invalid synonym rule at line 2",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "term: physical entity analyzed to a 
token with posinc != 1"
}
}
}
}
{code}

Here's the line it's objecting to:

s(11930,1,'physical entity',n,1,0). 
I'm using the WordNet Prolog synonyms file from 
http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz2
--

Looks like the error comes from  Lucene's classes of *WordnetSynonymParser* and 
*SynonymMap*, and changes introduced from Lucene 7.0 version.


  was:
Originally reported in the ES issues: 
https://github.com/elastic/elasticsearch/issues/27798#issuecomment-351838983

but looks like the issue is introduced from the Lucene 7.0.X.

Copying the user's issue here:

--

I'm encountering the following error on indexing when trying to use the wn_s.pl 
synonyms file (which I've moved to /usr/local/etc/elasticsearch):


{code:javascript}
{
"error": {
"root_cause": [{
"type": "illegal_argument_exception",
"reason": "failed to build synonyms"
}],
"type": "illegal_argument_exception",
"reason": "failed to build synonyms",
"caused_by": {
"type": "parse_exception",
"reason": "Invalid synonym rule at line 2",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "term: physical entity analyzed to a 
token with posinc != 1"
}
}
}
}
{code}

Here's the line it's objecting to:

s(11930,1,'physical entity',n,1,0). 
I'm using the WordNet Prolog synonyms file from 
http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz2
--

Looks like the error comes from  Lucene's classes of *WordnetSynonymParser* and 
*SynonymMap*, and changes introduce from Lucene 7.0 version.



> Error on reindex using WordNet synonyms file
> 
>
> Key: LUCENE-8100
> URL: https://issues.apache.org/jira/browse/LUCENE-8100
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 7.0.1
>Reporter: Mayya Sharipova
>Priority: Minor
>
> Originally reported in the ES issues: 
> https://github.com/elastic/elasticsearch/issues/27798#issuecomment-351838983
> but looks like the issue is introduced from the Lucene 7.0.X.
> Copying the user's issue here:
> --
> I'm encountering the following error on indexing when trying to use the 
> wn_s.pl synonyms file (which I've moved to /usr/local/etc/elasticsearch):
> {code:javascript}
> {
>   "error": {
>   "root_cause": [{
>   "type": "illegal_argument_exception",
>   "reason": "failed to build synonyms"
>   }],
>   "type": "illegal_argument_exception",
>   "reason": "failed to build synonyms",
>   "caused_by": {
>   "type": "parse_exception",
>   "reason": "Invalid synonym rule at line 2",
>   "caused_by": {
>   "type": "illegal_argument_exception",
>   "reason": "term: physical entity analyzed to a 
> token with posinc != 1"
>   }
>   }
>   }
> }
> {code}
> Here's the line it's objecting to:
> s(11930,1,'physical 

[jira] [Created] (LUCENE-8100) Error on reindex using WordNet synonyms file

2017-12-15 Thread Mayya Sharipova (JIRA)
Mayya Sharipova created LUCENE-8100:
---

 Summary: Error on reindex using WordNet synonyms file
 Key: LUCENE-8100
 URL: https://issues.apache.org/jira/browse/LUCENE-8100
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Affects Versions: 7.0.1
Reporter: Mayya Sharipova
Priority: Minor


Originally reported in the ES issues: 
https://github.com/elastic/elasticsearch/issues/27798#issuecomment-351838983

but looks like the issue is introduced from the Lucene 7.0.X.

Copying the user's issue here:

--

I'm encountering the following error on indexing when trying to use the wn_s.pl 
synonyms file (which I've moved to /usr/local/etc/elasticsearch):


{code:javascript}
{
"error": {
"root_cause": [{
"type": "illegal_argument_exception",
"reason": "failed to build synonyms"
}],
"type": "illegal_argument_exception",
"reason": "failed to build synonyms",
"caused_by": {
"type": "parse_exception",
"reason": "Invalid synonym rule at line 2",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "term: physical entity analyzed to a 
token with posinc != 1"
}
}
}
}
{code}

Here's the line it's objecting to:

s(11930,1,'physical entity',n,1,0). 
I'm using the WordNet Prolog synonyms file from 
http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz2
--

Looks like the error comes from  Lucene's classes of *WordnetSynonymParser* and 
*SynonymMap*, and changes introduce from Lucene 7.0 version.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8011) Improve similarity explanations

2017-11-29 Thread Mayya Sharipova (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270661#comment-16270661
 ] 

Mayya Sharipova commented on LUCENE-8011:
-

thanks [~jpountz], will work on the classes you suggested

> Improve similarity explanations
> ---
>
> Key: LUCENE-8011
> URL: https://issues.apache.org/jira/browse/LUCENE-8011
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>  Labels: newdev
>
> LUCENE-7997 improves BM25 and Classic explains to better explain:
> {noformat}
> product of:
>   2.2 = scaling factor, k1 + 1
>   9.388654 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
> 1.0 = n, number of documents containing term
> 17927.0 = N, total number of documents with field
>   0.9987758 = tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) 
> from:
> 979.0 = freq, occurrences of term within document
> 1.2 = k1, term saturation parameter
> 0.75 = b, length normalization parameter
> 1.0 = dl, length of field
> 1.0 = avgdl, average length of field
> {noformat}
> Previously it was pretty cryptic and used confusing terminology like 
> docCount/docFreq without explanation: 
> {noformat}
> product of:
>   0.016547536 = idf, computed as log(1 + (docCount - docFreq + 0.5) / 
> (docFreq + 0.5)) from:
> 449.0 = docFreq
> 456.0 = docCount
>   2.1920826 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b 
> * fieldLength / avgFieldLength)) from:
> 113659.0 = freq=113658
> 1.2 = parameter k1
> 0.75 = parameter b
> 2300.5593 = avgFieldLength
> 1048600.0 = fieldLength
> {noformat}
> We should fix other similarities too in the same way, they should be more 
> practical.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8011) Improve similarity explanations

2017-11-26 Thread Mayya Sharipova (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266230#comment-16266230
 ] 

Mayya Sharipova commented on LUCENE-8011:
-

Hello! What other specific similarity classes we would like to tackle here?

Are for example {{AfterEffect}}, {{AfterEffectB}}, {{Normalization}} be good 
candidates?

> Improve similarity explanations
> ---
>
> Key: LUCENE-8011
> URL: https://issues.apache.org/jira/browse/LUCENE-8011
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>  Labels: newdev
>
> LUCENE-7997 improves BM25 and Classic explains to better explain:
> {noformat}
> product of:
>   2.2 = scaling factor, k1 + 1
>   9.388654 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
> 1.0 = n, number of documents containing term
> 17927.0 = N, total number of documents with field
>   0.9987758 = tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) 
> from:
> 979.0 = freq, occurrences of term within document
> 1.2 = k1, term saturation parameter
> 0.75 = b, length normalization parameter
> 1.0 = dl, length of field
> 1.0 = avgdl, average length of field
> {noformat}
> Previously it was pretty cryptic and used confusing terminology like 
> docCount/docFreq without explanation: 
> {noformat}
> product of:
>   0.016547536 = idf, computed as log(1 + (docCount - docFreq + 0.5) / 
> (docFreq + 0.5)) from:
> 449.0 = docFreq
> 456.0 = docCount
>   2.1920826 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b 
> * fieldLength / avgFieldLength)) from:
> 113659.0 = freq=113658
> 1.2 = parameter k1
> 0.75 = parameter b
> 2300.5593 = avgFieldLength
> 1048600.0 = fieldLength
> {noformat}
> We should fix other similarities too in the same way, they should be more 
> practical.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org