[jira] [Comment Edited] (LUCENE-8727) IndexSearcher#search(Query,int) should operate on a shared priority queue when configured with an executor
[ https://issues.apache.org/jira/browse/LUCENE-8727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889181#comment-16889181 ] Mayya Sharipova edited comment on LUCENE-8727 at 7/19/19 9:28 PM: -- Some comments about design option # 1. I think we should just share min competitive score(it could be AtomicLong or something) between collectors, and not the top hits. The reason for not sharing top hits is that Collectors expect leaves in the [sequential order.|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java#L240-L242] And if it happens that we start processing leaves with higher doc Ids first in the executor, we may populate the global priority queue with docs with higher ids and set the global min competitive score to the next float. Next, when we process leaves with smaller doc Ids, as our global priority queue is full and as we use this updated global min competitive score, we will have to skip all these docs with smaller doc Ids even if they have the same scores as docs with higher doc Ids and should be selected instead. If all collectors have their own priority queues, they will make sure first to fill them to N and only after that set min competitive score. was (Author: mayyas): Some comments about design option # 1. I think we should just share min competitive score(it could be AtomicLong or something) between collectors, and not the top hits. The reason for not sharing top hits is that Collectors expect leaves in [the sequential order|[https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java#L240-L242]]. And if it happens that we start processing leaves with higher doc Ids first in the executor, we may populate the global priority queue with docs with higher ids and set the global min competitive score to the next float. Next, when we process leaves with smaller doc Ids, as our global priority queue is full and as we use this updated global min competitive score, we will have to skip all these docs with smaller doc Ids even if they have the same scores as docs with higher doc Ids and should be selected instead. If all collectors have their own priority queues, they will make sure first to fill them to N and only after that set min competitive score. > IndexSearcher#search(Query,int) should operate on a shared priority queue > when configured with an executor > -- > > Key: LUCENE-8727 > URL: https://issues.apache.org/jira/browse/LUCENE-8727 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > If IndexSearcher is configured with an executor, then the top docs for each > slice are computed separately before being merged once the top docs for all > slices are computed. With block-max WAND this is a bit of a waste of > resources: it would be better if an increase of the min competitive score > could help skip non-competitive hits on every slice and not just the current > one. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8727) IndexSearcher#search(Query,int) should operate on a shared priority queue when configured with an executor
[ https://issues.apache.org/jira/browse/LUCENE-8727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889181#comment-16889181 ] Mayya Sharipova commented on LUCENE-8727: - Some comments about design option # 1. I think we should just share min competitive score(it could be AtomicLong or something) between collectors, and not the top hits. The reason for not sharing top hits is that Collectors expect leaves in [the sequential order|[https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java#L240-L242]]. And if it happens that we start processing leaves with higher doc Ids first in the executor, we may populate the global priority queue with docs with higher ids and set the global min competitive score to the next float. Next, when we process leaves with smaller doc Ids, as our global priority queue is full and as we use this updated global min competitive score, we will have to skip all these docs with smaller doc Ids even if they have the same scores as docs with higher doc Ids and should be selected instead. If all collectors have their own priority queues, they will make sure first to fill them to N and only after that set min competitive score. > IndexSearcher#search(Query,int) should operate on a shared priority queue > when configured with an executor > -- > > Key: LUCENE-8727 > URL: https://issues.apache.org/jira/browse/LUCENE-8727 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > If IndexSearcher is configured with an executor, then the top docs for each > slice are computed separately before being merged once the top docs for all > slices are computed. With block-max WAND this is a bit of a waste of > resources: it would be better if an increase of the min competitive score > could help skip non-competitive hits on every slice and not just the current > one. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8901) Load frequencies lazily for postings and impacts
[ https://issues.apache.org/jira/browse/LUCENE-8901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova resolved LUCENE-8901. - Resolution: Fixed > Load frequencies lazily for postings and impacts > > > Key: LUCENE-8901 > URL: https://issues.apache.org/jira/browse/LUCENE-8901 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mayya Sharipova >Priority: Minor > > Allow frequencies blocks to be loaded lazily when they are not needed -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8901) Load frequencies lazily for postings and impacts
[ https://issues.apache.org/jira/browse/LUCENE-8901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877080#comment-16877080 ] Mayya Sharipova commented on LUCENE-8901: - PR: [https://github.com/apache/lucene-solr/pull/595] > Load frequencies lazily for postings and impacts > > > Key: LUCENE-8901 > URL: https://issues.apache.org/jira/browse/LUCENE-8901 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mayya Sharipova >Priority: Minor > > Allow frequencies blocks to be loaded lazily when they are not needed -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8901) Load frequencies lazily for postings and impacts
Mayya Sharipova created LUCENE-8901: --- Summary: Load frequencies lazily for postings and impacts Key: LUCENE-8901 URL: https://issues.apache.org/jira/browse/LUCENE-8901 Project: Lucene - Core Issue Type: Improvement Reporter: Mayya Sharipova Allow frequencies blocks to be loaded lazily when they are not needed -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783921#comment-16783921 ] Mayya Sharipova commented on LUCENE-6968: - [~andyhind] Thanks very much for your answer, it made things more clear. I still have a couple of additional questions if you don't mind: 1) The default settings of the filter will produce for each document 512 tokens each of the size 16 bytes, that is approximately 8Kb. Isn't 8Kb too big of a size to be a document's signature? 2) What is the way to combine `min_hash` tokens to a query for similarity search? Do you have any examples? Is this a work in progress? Thanks again! > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili >Priority: Major > Fix For: 6.2, 7.0 > > Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, > LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706571#comment-16706571 ] Mayya Sharipova commented on LUCENE-6968: - [~andyhind] Hello Andy! I have several questions about the implementation of the *MinHashFilter*, and was wondering if you would be able to answer them. Thanks a lot in advance. The implementation from 1st original patch where the minimum set is kept is very clear to me, and follows the classic idea of constructing MinHash signature and LSH search after it. But I am having a hard time understanding the final implementation for MinHashFilter. 1) What constitutes the signature of a document? Are these all values stored in the hash table? Doesn't it make a signature too large? Can you please refer the paper that describes this way of constructing minhash signatures. 2) What is the use of {{withRotation}} parameter? What is the advantage of using {{withRotation=true}}? In the paper you cited: [http://www.auai.org/uai2014/proceedings/individuals/225.pdf], they fill empty bins with "value of the closest non-empty bin in the clockwise direction (circular right hand side) added *with offset C*". In the {{MinHashFilter}} implementation values for empty buckets are just blindly copied from non-empty ones, so a lot of buckets with have the same value. Hopefully the questions make sense. Thanks again in advance. > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili >Priority: Major > Fix For: 6.2, 7.0 > > Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, > LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8529) Use the completion key to tiebreak completion suggestion
[ https://issues.apache.org/jira/browse/LUCENE-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654125#comment-16654125 ] Mayya Sharipova commented on LUCENE-8529: - Thanks [~jim.ferenczi]. I was wondering if there is any reason why we don't put all this comparison logic (scores, keys, docIds) inside `SuggestScoreDoc::compareTo`, and then just use `a.compare(b)`, where `a` and `b` are `SuggestScoreDoc`? > Use the completion key to tiebreak completion suggestion > > > Key: LUCENE-8529 > URL: https://issues.apache.org/jira/browse/LUCENE-8529 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > Attachments: LUCENE-8529.patch > > > Today the completion suggester uses the document id to tiebreak completion > suggestion with same scores. It would improve the stability of the sort to > use the surface form of suggestions as the first tiebreaker. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3475) ShingleFilter should handle positionIncrement of zero, e.g. synonyms
[ https://issues.apache.org/jira/browse/LUCENE-3475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347069#comment-16347069 ] Mayya Sharipova commented on LUCENE-3475: - [~jpountz] thank so much for suggestions. [~romseygeek] is going to work on this issue. I will study his solution. > ShingleFilter should handle positionIncrement of zero, e.g. synonyms > > > Key: LUCENE-3475 > URL: https://issues.apache.org/jira/browse/LUCENE-3475 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 3.4 >Reporter: Cameron >Priority: Minor > Labels: newdev > > ShingleFilter is creating shingles for a single term that has been expanded > by synonyms when it shouldn't. The position increment is 0. > As an example, I have an Analyzer with a SynonymFilter followed by a > ShingleFilter. Assuming car and auto are synonyms, the SynonymFilter produces > two tokens and position 1: car, auto. The ShingleFilter is then producing 3 > tokens, when there should only be two: car, car auto, auto. This behavior > seems incorrect. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-3475) ShingleFilter should handle positionIncrement of zero, e.g. synonyms
[ https://issues.apache.org/jira/browse/LUCENE-3475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344192#comment-16344192 ] Mayya Sharipova edited comment on LUCENE-3475 at 1/29/18 11:33 PM: --- [~jpountz] Hi Adrien! I wonder what would be the best approach to handle positionIncrement=0? I was thinking that in *ShingleFilter:getNextToken* we can do something like this: {code:java} if (input.incrementToken()) { while (posIncrAtt.getPositionIncrement() == 0) { // we may have multiple synonyms if (input.incrementToken()) { // go to next token // store synonyms tokens and following tokens somewhere and create a new input TokenStream from them? {code} I guess I am wondering if we have any other reference code that recreates a tokenStream from synonym tokens? was (Author: mayyas): [~jpountz] Hi Adrien! I wonder what would be the best approach to handle positionIncrement=0? I was thinking that in *ShingleFilter:getNextToken* we can do something like this: {code:java} if (input.incrementToken()) { while (posIncrAtt.getPositionIncrement() == 0) { // we may have multiple synonyms if (input.incrementToken()) { // go to next token // store synonyms tokens and following tokens somewhere and create a new input TokenStream from them? {code} I guess I am wondering if we have any other sample code that already doing that, and which I can reference? > ShingleFilter should handle positionIncrement of zero, e.g. synonyms > > > Key: LUCENE-3475 > URL: https://issues.apache.org/jira/browse/LUCENE-3475 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 3.4 >Reporter: Cameron >Priority: Minor > Labels: newdev > > ShingleFilter is creating shingles for a single term that has been expanded > by synonyms when it shouldn't. The position increment is 0. > As an example, I have an Analyzer with a SynonymFilter followed by a > ShingleFilter. Assuming car and auto are synonyms, the SynonymFilter produces > two tokens and position 1: car, auto. The ShingleFilter is then producing 3 > tokens, when there should only be two: car, car auto, auto. This behavior > seems incorrect. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3475) ShingleFilter should handle positionIncrement of zero, e.g. synonyms
[ https://issues.apache.org/jira/browse/LUCENE-3475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344192#comment-16344192 ] Mayya Sharipova commented on LUCENE-3475: - [~jpountz] Hi Adrien! I wonder what would be the best approach to handle positionIncrement=0? I was thinking that in *ShingleFilter:getNextToken* we can do something like this: {code:java} if (input.incrementToken()) { while (posIncrAtt.getPositionIncrement() == 0) { // we may have multiple synonyms if (input.incrementToken()) { // go to next token // store synonyms tokens and following tokens somewhere and create a new input TokenStream from them? {code} I guess I am wondering if we have any other sample code that already doing that, and which I can reference? > ShingleFilter should handle positionIncrement of zero, e.g. synonyms > > > Key: LUCENE-3475 > URL: https://issues.apache.org/jira/browse/LUCENE-3475 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Affects Versions: 3.4 >Reporter: Cameron >Priority: Minor > Labels: newdev > > ShingleFilter is creating shingles for a single term that has been expanded > by synonyms when it shouldn't. The position increment is 0. > As an example, I have an Analyzer with a SynonymFilter followed by a > ShingleFilter. Assuming car and auto are synonyms, the SynonymFilter produces > two tokens and position 1: car, auto. The ShingleFilter is then producing 3 > tokens, when there should only be two: car, car auto, auto. This behavior > seems incorrect. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8100) Error on reindex using WordNet synonyms file
[ https://issues.apache.org/jira/browse/LUCENE-8100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova resolved LUCENE-8100. - Resolution: Won't Fix Looks like it is an issue in the Elastic Search > Error on reindex using WordNet synonyms file > > > Key: LUCENE-8100 > URL: https://issues.apache.org/jira/browse/LUCENE-8100 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 7.0.1 >Reporter: Mayya Sharipova >Priority: Minor > > Originally reported in the ES issues: > https://github.com/elastic/elasticsearch/issues/27798#issuecomment-351838983 > but looks like the issue is introduced from the Lucene 7.0.X. > Copying the user's issue here: > -- > I'm encountering the following error on indexing when trying to use the > wn_s.pl synonyms file (which I've moved to /usr/local/etc/elasticsearch): > {code:javascript} > { > "error": { > "root_cause": [{ > "type": "illegal_argument_exception", > "reason": "failed to build synonyms" > }], > "type": "illegal_argument_exception", > "reason": "failed to build synonyms", > "caused_by": { > "type": "parse_exception", > "reason": "Invalid synonym rule at line 2", > "caused_by": { > "type": "illegal_argument_exception", > "reason": "term: physical entity analyzed to a > token with posinc != 1" > } > } > } > } > {code} > Here's the line it's objecting to: > s(11930,1,'physical entity',n,1,0). > I'm using the WordNet Prolog synonyms file from > http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz2 > -- > Looks like the error comes from Lucene's classes of *WordnetSynonymParser* > and *SynonymMap*, and changes introduced from Lucene 7.0 version. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8100) Error on reindex using WordNet synonyms file
[ https://issues.apache.org/jira/browse/LUCENE-8100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayya Sharipova updated LUCENE-8100: Description: Originally reported in the ES issues: https://github.com/elastic/elasticsearch/issues/27798#issuecomment-351838983 but looks like the issue is introduced from the Lucene 7.0.X. Copying the user's issue here: -- I'm encountering the following error on indexing when trying to use the wn_s.pl synonyms file (which I've moved to /usr/local/etc/elasticsearch): {code:javascript} { "error": { "root_cause": [{ "type": "illegal_argument_exception", "reason": "failed to build synonyms" }], "type": "illegal_argument_exception", "reason": "failed to build synonyms", "caused_by": { "type": "parse_exception", "reason": "Invalid synonym rule at line 2", "caused_by": { "type": "illegal_argument_exception", "reason": "term: physical entity analyzed to a token with posinc != 1" } } } } {code} Here's the line it's objecting to: s(11930,1,'physical entity',n,1,0). I'm using the WordNet Prolog synonyms file from http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz2 -- Looks like the error comes from Lucene's classes of *WordnetSynonymParser* and *SynonymMap*, and changes introduced from Lucene 7.0 version. was: Originally reported in the ES issues: https://github.com/elastic/elasticsearch/issues/27798#issuecomment-351838983 but looks like the issue is introduced from the Lucene 7.0.X. Copying the user's issue here: -- I'm encountering the following error on indexing when trying to use the wn_s.pl synonyms file (which I've moved to /usr/local/etc/elasticsearch): {code:javascript} { "error": { "root_cause": [{ "type": "illegal_argument_exception", "reason": "failed to build synonyms" }], "type": "illegal_argument_exception", "reason": "failed to build synonyms", "caused_by": { "type": "parse_exception", "reason": "Invalid synonym rule at line 2", "caused_by": { "type": "illegal_argument_exception", "reason": "term: physical entity analyzed to a token with posinc != 1" } } } } {code} Here's the line it's objecting to: s(11930,1,'physical entity',n,1,0). I'm using the WordNet Prolog synonyms file from http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz2 -- Looks like the error comes from Lucene's classes of *WordnetSynonymParser* and *SynonymMap*, and changes introduce from Lucene 7.0 version. > Error on reindex using WordNet synonyms file > > > Key: LUCENE-8100 > URL: https://issues.apache.org/jira/browse/LUCENE-8100 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 7.0.1 >Reporter: Mayya Sharipova >Priority: Minor > > Originally reported in the ES issues: > https://github.com/elastic/elasticsearch/issues/27798#issuecomment-351838983 > but looks like the issue is introduced from the Lucene 7.0.X. > Copying the user's issue here: > -- > I'm encountering the following error on indexing when trying to use the > wn_s.pl synonyms file (which I've moved to /usr/local/etc/elasticsearch): > {code:javascript} > { > "error": { > "root_cause": [{ > "type": "illegal_argument_exception", > "reason": "failed to build synonyms" > }], > "type": "illegal_argument_exception", > "reason": "failed to build synonyms", > "caused_by": { > "type": "parse_exception", > "reason": "Invalid synonym rule at line 2", > "caused_by": { > "type": "illegal_argument_exception", > "reason": "term: physical entity analyzed to a > token with posinc != 1" > } > } > } > } > {code} > Here's the line it's objecting to: > s(11930,1,'physical
[jira] [Created] (LUCENE-8100) Error on reindex using WordNet synonyms file
Mayya Sharipova created LUCENE-8100: --- Summary: Error on reindex using WordNet synonyms file Key: LUCENE-8100 URL: https://issues.apache.org/jira/browse/LUCENE-8100 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 7.0.1 Reporter: Mayya Sharipova Priority: Minor Originally reported in the ES issues: https://github.com/elastic/elasticsearch/issues/27798#issuecomment-351838983 but looks like the issue is introduced from the Lucene 7.0.X. Copying the user's issue here: -- I'm encountering the following error on indexing when trying to use the wn_s.pl synonyms file (which I've moved to /usr/local/etc/elasticsearch): {code:javascript} { "error": { "root_cause": [{ "type": "illegal_argument_exception", "reason": "failed to build synonyms" }], "type": "illegal_argument_exception", "reason": "failed to build synonyms", "caused_by": { "type": "parse_exception", "reason": "Invalid synonym rule at line 2", "caused_by": { "type": "illegal_argument_exception", "reason": "term: physical entity analyzed to a token with posinc != 1" } } } } {code} Here's the line it's objecting to: s(11930,1,'physical entity',n,1,0). I'm using the WordNet Prolog synonyms file from http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz2 -- Looks like the error comes from Lucene's classes of *WordnetSynonymParser* and *SynonymMap*, and changes introduce from Lucene 7.0 version. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8011) Improve similarity explanations
[ https://issues.apache.org/jira/browse/LUCENE-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270661#comment-16270661 ] Mayya Sharipova commented on LUCENE-8011: - thanks [~jpountz], will work on the classes you suggested > Improve similarity explanations > --- > > Key: LUCENE-8011 > URL: https://issues.apache.org/jira/browse/LUCENE-8011 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Robert Muir > Labels: newdev > > LUCENE-7997 improves BM25 and Classic explains to better explain: > {noformat} > product of: > 2.2 = scaling factor, k1 + 1 > 9.388654 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from: > 1.0 = n, number of documents containing term > 17927.0 = N, total number of documents with field > 0.9987758 = tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) > from: > 979.0 = freq, occurrences of term within document > 1.2 = k1, term saturation parameter > 0.75 = b, length normalization parameter > 1.0 = dl, length of field > 1.0 = avgdl, average length of field > {noformat} > Previously it was pretty cryptic and used confusing terminology like > docCount/docFreq without explanation: > {noformat} > product of: > 0.016547536 = idf, computed as log(1 + (docCount - docFreq + 0.5) / > (docFreq + 0.5)) from: > 449.0 = docFreq > 456.0 = docCount > 2.1920826 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b > * fieldLength / avgFieldLength)) from: > 113659.0 = freq=113658 > 1.2 = parameter k1 > 0.75 = parameter b > 2300.5593 = avgFieldLength > 1048600.0 = fieldLength > {noformat} > We should fix other similarities too in the same way, they should be more > practical. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8011) Improve similarity explanations
[ https://issues.apache.org/jira/browse/LUCENE-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266230#comment-16266230 ] Mayya Sharipova commented on LUCENE-8011: - Hello! What other specific similarity classes we would like to tackle here? Are for example {{AfterEffect}}, {{AfterEffectB}}, {{Normalization}} be good candidates? > Improve similarity explanations > --- > > Key: LUCENE-8011 > URL: https://issues.apache.org/jira/browse/LUCENE-8011 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Robert Muir > Labels: newdev > > LUCENE-7997 improves BM25 and Classic explains to better explain: > {noformat} > product of: > 2.2 = scaling factor, k1 + 1 > 9.388654 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from: > 1.0 = n, number of documents containing term > 17927.0 = N, total number of documents with field > 0.9987758 = tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) > from: > 979.0 = freq, occurrences of term within document > 1.2 = k1, term saturation parameter > 0.75 = b, length normalization parameter > 1.0 = dl, length of field > 1.0 = avgdl, average length of field > {noformat} > Previously it was pretty cryptic and used confusing terminology like > docCount/docFreq without explanation: > {noformat} > product of: > 0.016547536 = idf, computed as log(1 + (docCount - docFreq + 0.5) / > (docFreq + 0.5)) from: > 449.0 = docFreq > 456.0 = docCount > 2.1920826 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b > * fieldLength / avgFieldLength)) from: > 113659.0 = freq=113658 > 1.2 = parameter k1 > 0.75 = parameter b > 2300.5593 = avgFieldLength > 1048600.0 = fieldLength > {noformat} > We should fix other similarities too in the same way, they should be more > practical. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org