[jira] [Commented] (LUCENE-8464) Implement ConstantScoreScorer#setMinCompetitiveScore
[ https://issues.apache.org/jira/browse/LUCENE-8464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719272#comment-16719272 ] Christophe Bismuth commented on LUCENE-8464: Thanks a lot [~romseygeek], you made my day :D [~jim.ferenczi] has made some really great mentoring with me on this one (y) I hope to find some other great issues to work on! > Implement ConstantScoreScorer#setMinCompetitiveScore > > > Key: LUCENE-8464 > URL: https://issues.apache.org/jira/browse/LUCENE-8464 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Labels: newdev > Fix For: master (8.0) > > Time Spent: 8h > Remaining Estimate: 0h > > We should make it so the iterator returns NO_MORE_DOCS after > setMinCompetitiveScore is called with a value that is greater than the > constant score. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707497#comment-16707497 ] Christophe Bismuth commented on LUCENE-8548: That's great, thanks [~jim.ferenczi] for all the details (y) > Reevaluate scripts boundary break in Nori's tokenizer > - > > Key: LUCENE-8548 > URL: https://issues.apache.org/jira/browse/LUCENE-8548 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > Fix For: master (8.0), 7.7 > > Attachments: LUCENE-8548.patch, screenshot-1.png, > testCyrillicWord.dot.png > > Time Spent: 10m > Remaining Estimate: 0h > > This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526: > {noformat} > Tokens are split on different character POS types (which seem to not quite > line up with Unicode character blocks), which leads to weird results for > non-CJK tokens: > εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other > symbol) + μί/SL(Foreign language) > ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + > ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) > Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + > лтичко/SL(Foreign language) + ̄/SY(Other symbol) > don't is tokenized as don + t; same for don't (with a curly apostrophe). > אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol) > Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow > While it is still possible to find these words using Nori, there are many > more chances for false positives when the tokens are split up like this. In > particular, individual numbers and combining diacritics are indexed > separately (e.g., in the Cyrillic example above), which can lead to a > performance hit on large corpora like Wiktionary or Wikipedia. > Work around: use a character filter to get rid of combining diacritics before > Nori processes the text. This doesn't solve the Greek, Hebrew, or English > cases, though. > Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek > Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. > Combining diacritics should not trigger token splits. Non-CJK text should be > tokenized on spaces and punctuation, not by character type shifts. > Apostrophe-like characters should not trigger token splits (though I could > see someone disagreeing on this one).{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701958#comment-16701958 ] Christophe Bismuth commented on LUCENE-8548: Thanks a lot for sharing this [~jim.ferenczi], and no worries at all as the first iteration was an interesting journey! I think taking time to read about Viterbi would help me some more, let's add it to my pretty own todo list :D I diffed your patch with {{master}} and debugged new tests step-by-step, and I think I understand the big picture. Among others, I totally missed the {{if (isCommonOrInherited(scriptCode) && isCommonOrInherited(sc) == false)}} condition which is essential. I still have one more question, could you please explain what information is contained in the {{wordIdRef}} variable and what the {{unkDictionary.lookupWordIds(characterId, wordIdRef)}} statement does? The debugger tells me {{wordIdRef.length}} is always equal to 36 or 42 and even though 42 is a really great number, I'm a tiny lost in there ... > Reevaluate scripts boundary break in Nori's tokenizer > - > > Key: LUCENE-8548 > URL: https://issues.apache.org/jira/browse/LUCENE-8548 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > Attachments: LUCENE-8548.patch, screenshot-1.png, > testCyrillicWord.dot.png > > Time Spent: 10m > Remaining Estimate: 0h > > This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526: > {noformat} > Tokens are split on different character POS types (which seem to not quite > line up with Unicode character blocks), which leads to weird results for > non-CJK tokens: > εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other > symbol) + μί/SL(Foreign language) > ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + > ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) > Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + > лтичко/SL(Foreign language) + ̄/SY(Other symbol) > don't is tokenized as don + t; same for don't (with a curly apostrophe). > אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol) > Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow > While it is still possible to find these words using Nori, there are many > more chances for false positives when the tokens are split up like this. In > particular, individual numbers and combining diacritics are indexed > separately (e.g., in the Cyrillic example above), which can lead to a > performance hit on large corpora like Wiktionary or Wikipedia. > Work around: use a character filter to get rid of combining diacritics before > Nori processes the text. This doesn't solve the Greek, Hebrew, or English > cases, though. > Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek > Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. > Combining diacritics should not trigger token splits. Non-CJK text should be > tokenized on spaces and punctuation, not by character type shifts. > Apostrophe-like characters should not trigger token splits (though I could > see someone disagreeing on this one).{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697336#comment-16697336 ] Christophe Bismuth commented on LUCENE-8548: I've made some progress and opened PR [#505|https://github.com/apache/lucene-solr/pull/505] to share them with you. Feel free to stop me as I don't want to make you loose your time. Here is what has been done so far: * Break on script boundaries with built-in JDK API, * Track character classes in a growing byte array, * I feel a tiny bit lost when it comes to extract costs: should I call {{unkDictionary.lookupWordIds(characterId, wordIdRef)}} for each tracked character class? * {{мoscow}} word is correctly parsed in the Graphviz output below ... * ... but test failed on this [line|https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/lucene/test-framework/src/java/org/apache/lucene/analysis/BaseTokenStreamTestCase.java#L199] and I still have to understand why. !screenshot-1.png! > Reevaluate scripts boundary break in Nori's tokenizer > - > > Key: LUCENE-8548 > URL: https://issues.apache.org/jira/browse/LUCENE-8548 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > Attachments: screenshot-1.png, testCyrillicWord.dot.png > > Time Spent: 10m > Remaining Estimate: 0h > > This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526: > {noformat} > Tokens are split on different character POS types (which seem to not quite > line up with Unicode character blocks), which leads to weird results for > non-CJK tokens: > εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other > symbol) + μί/SL(Foreign language) > ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + > ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) > Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + > лтичко/SL(Foreign language) + ̄/SY(Other symbol) > don't is tokenized as don + t; same for don't (with a curly apostrophe). > אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol) > Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow > While it is still possible to find these words using Nori, there are many > more chances for false positives when the tokens are split up like this. In > particular, individual numbers and combining diacritics are indexed > separately (e.g., in the Cyrillic example above), which can lead to a > performance hit on large corpora like Wiktionary or Wikipedia. > Work around: use a character filter to get rid of combining diacritics before > Nori processes the text. This doesn't solve the Greek, Hebrew, or English > cases, though. > Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek > Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. > Combining diacritics should not trigger token splits. Non-CJK text should be > tokenized on spaces and punctuation, not by character type shifts. > Apostrophe-like characters should not trigger token splits (though I could > see someone disagreeing on this one).{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christophe Bismuth updated LUCENE-8548: --- Attachment: screenshot-1.png > Reevaluate scripts boundary break in Nori's tokenizer > - > > Key: LUCENE-8548 > URL: https://issues.apache.org/jira/browse/LUCENE-8548 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > Attachments: screenshot-1.png, testCyrillicWord.dot.png > > Time Spent: 10m > Remaining Estimate: 0h > > This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526: > {noformat} > Tokens are split on different character POS types (which seem to not quite > line up with Unicode character blocks), which leads to weird results for > non-CJK tokens: > εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other > symbol) + μί/SL(Foreign language) > ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + > ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) > Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + > лтичко/SL(Foreign language) + ̄/SY(Other symbol) > don't is tokenized as don + t; same for don't (with a curly apostrophe). > אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol) > Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow > While it is still possible to find these words using Nori, there are many > more chances for false positives when the tokens are split up like this. In > particular, individual numbers and combining diacritics are indexed > separately (e.g., in the Cyrillic example above), which can lead to a > performance hit on large corpora like Wiktionary or Wikipedia. > Work around: use a character filter to get rid of combining diacritics before > Nori processes the text. This doesn't solve the Greek, Hebrew, or English > cases, though. > Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek > Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. > Combining diacritics should not trigger token splits. Non-CJK text should be > tokenized on spaces and punctuation, not by character type shifts. > Apostrophe-like characters should not trigger token splits (though I could > see someone disagreeing on this one).{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696093#comment-16696093 ] Christophe Bismuth commented on LUCENE-8548: That is really nice, thank you [~jim.ferenczi] and [~rcmuir], I should be able to start patch, thanks again! > Reevaluate scripts boundary break in Nori's tokenizer > - > > Key: LUCENE-8548 > URL: https://issues.apache.org/jira/browse/LUCENE-8548 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > Attachments: testCyrillicWord.dot.png > > > This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526: > {noformat} > Tokens are split on different character POS types (which seem to not quite > line up with Unicode character blocks), which leads to weird results for > non-CJK tokens: > εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other > symbol) + μί/SL(Foreign language) > ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + > ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) > Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + > лтичко/SL(Foreign language) + ̄/SY(Other symbol) > don't is tokenized as don + t; same for don't (with a curly apostrophe). > אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol) > Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow > While it is still possible to find these words using Nori, there are many > more chances for false positives when the tokens are split up like this. In > particular, individual numbers and combining diacritics are indexed > separately (e.g., in the Cyrillic example above), which can lead to a > performance hit on large corpora like Wiktionary or Wikipedia. > Work around: use a character filter to get rid of combining diacritics before > Nori processes the text. This doesn't solve the Greek, Hebrew, or English > cases, though. > Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek > Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. > Combining diacritics should not trigger token splits. Non-CJK text should be > tokenized on spaces and punctuation, not by character type shifts. > Apostrophe-like characters should not trigger token splits (though I could > see someone disagreeing on this one).{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696052#comment-16696052 ] Christophe Bismuth commented on LUCENE-8548: Great! Thank you [~rcmuir], I'll dig into this (y) > Reevaluate scripts boundary break in Nori's tokenizer > - > > Key: LUCENE-8548 > URL: https://issues.apache.org/jira/browse/LUCENE-8548 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > Attachments: testCyrillicWord.dot.png > > > This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526: > {noformat} > Tokens are split on different character POS types (which seem to not quite > line up with Unicode character blocks), which leads to weird results for > non-CJK tokens: > εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other > symbol) + μί/SL(Foreign language) > ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + > ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) > Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + > лтичко/SL(Foreign language) + ̄/SY(Other symbol) > don't is tokenized as don + t; same for don't (with a curly apostrophe). > אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol) > Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow > While it is still possible to find these words using Nori, there are many > more chances for false positives when the tokens are split up like this. In > particular, individual numbers and combining diacritics are indexed > separately (e.g., in the Cyrillic example above), which can lead to a > performance hit on large corpora like Wiktionary or Wikipedia. > Work around: use a character filter to get rid of combining diacritics before > Nori processes the text. This doesn't solve the Greek, Hebrew, or English > cases, though. > Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek > Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. > Combining diacritics should not trigger token splits. Non-CJK text should be > tokenized on spaces and punctuation, not by character type shifts. > Apostrophe-like characters should not trigger token splits (though I could > see someone disagreeing on this one).{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694827#comment-16694827 ] Christophe Bismuth edited comment on LUCENE-8548 at 11/21/18 3:12 PM: -- I'm hacking around in the {{KoreanTokenizer}} class, but I'll need some mentoring to keep going on. Here is what I've done so far: * Implement a Cyrillic test failure (see previous comment) * Locate the {{KoreanAnalyzer}} and {{KoreanTokenizer}} classes * Locate the {{ICUTokenizer}} and its {{CompositeBreakIterator}} class attribute (following UAX #29: Unicode Text Segmentation) * Try to make Ant {{nori}} module depend on {{icu}} module to try to reuse some {{ICUTokenizer}} logic parts (but I failed to tweak Ant scripts) * Enable verbose output (see output below) * Enable Graphiz ouput (see attached picture) * Debug step by step the {{org.apache.lucene.analysis.ko.KoreanTokenizer#parse}} method * Add a breakpoint in the {{DictionaryToken}} constructor to try to understand how and when tokens are built (I also played with {{outputUnknownUnigrams}} parameter) I would need some code or documentation pointers when you have time. !testCyrillicWord.dot.png! Tokenizer verbose output: {noformat} PARSE extend @ pos=0 char=м hex=43c 1 arcs in UNKNOWN word len=1 1 wordIDs fromIDX=0: cost=138 (prevCost=0 wordCost=795 bgCost=138 spacePenalty=0) leftID=1793 leftPOS=SL) ** + cost=933 wordID=36 leftID=1793 leastIDX=0 toPos=1 toPos.idx=0 backtrace: endPos=1 pos=1; 1 characters; last=0 cost=933 add token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) freeBefore pos=1 TEST-TestKoreanAnalyzer.testCyrillicWord-seed#[9AA9487A32EFEB]:incToken: return token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) PARSE extend @ pos=1 char=o hex=6f 1 arcs in UNKNOWN word len=6 1 wordIDs fromIDX=0: cost=-1030 (prevCost=0 wordCost=795 bgCost=-1030 spacePenalty=0) leftID=1793 leftPOS=SL) ** + cost=-235 wordID=30 leftID=1793 leastIDX=0 toPos=7 toPos.idx=0 no arcs in; skip pos=2 no arcs in; skip pos=3 no arcs in; skip pos=4 no arcs in; skip pos=5 no arcs in; skip pos=6 end: 1 nodes backtrace: endPos=7 pos=7; 6 characters; last=1 cost=-235 add token=DictionaryToken("w" pos=6 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("o" pos=5 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("c" pos=4 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("s" pos=3 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("s" pos=2 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("o" pos=1 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) freeBefore pos=7 {noformat} was (Author: cbismuth): I'm hacking around in the {{KoreanTokenizer}} class, but I'll need some mentoring to keep going on. Here is what I've done so far: * Implement a Cyrillic test failure (see previous comment) * Locate the {{KoreanAnalyzer}} and {{KoreanTokenizer}} classes * Locate the {{ICUTokenizer}} and its {{CompositeBreakIterator}} class attribute (following UAX #29: Unicode Text Segmentation) * Try to make Ant {{nori}} module depend on {{icu}} module to try to reuse some {{ICUTokenizer}} logic parts (but I failed to tweak Ant scripts) * Enable verbose output (see output below) * Enable Graphiz ouput (see attached picture) * Debug step by step the {{org.apache.lucene.analysis.ko.KoreanTokenizer#parse}} method * Add a breakpoint in the {{DictionaryToken}} constructor to try to understand how and when tokens are built (I also played with {{outputUnknownUnigrams}} parameter) I would need some code or documentation pointers when you have time. !testCyrillicWord.dot.png! Tokenizer verbose output below: {noformat} PARSE extend @ pos=0 char=м hex=43c 1 arcs in UNKNOWN word len=1 1 wordIDs fromIDX=0: cost=138 (prevCost=0 wordCost=795 bgCost=138 spacePenalty=0) leftID=1793 leftPOS=SL) ** + cost=933 wordID=36 leftID=1793 leastIDX=0 toPos=1 toPos.idx=0 backtrace: endPos=1 pos=1; 1 characters; last=0 cost=933 add token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) freeBefore pos=1 TEST-TestKoreanAnalyzer.testCyrillicWord-seed#[9AA9487A32EFEB]:incToken: return token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) PARSE extend @ pos=1 char=o hex=6f 1 arcs in UNKNOWN word len=6 1 wordIDs fromIDX=0: cost=-1030 (prevCost=0 wordCost=795 bgCost=-1030 spacePenalty=0) leftID=1793 leftPOS=SL) ** + cost=-235 wordID=30 leftID=1793 leastIDX=0 toPos=7 toPos.idx=0 no arcs in; skip pos=2 no
[jira] [Comment Edited] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694827#comment-16694827 ] Christophe Bismuth edited comment on LUCENE-8548 at 11/21/18 3:12 PM: -- I'm hacking around in the {{KoreanTokenizer}} class, but I'll need some mentoring to keep going on. Here is what I've done so far: * Implement a Cyrillic test failure (see previous comment) * Locate the {{KoreanAnalyzer}} and {{KoreanTokenizer}} classes * Locate the {{ICUTokenizer}} and its {{CompositeBreakIterator}} class attribute (following UAX #29: Unicode Text Segmentation) * Try to make Ant {{nori}} module depend on {{icu}} module to try to reuse some {{ICUTokenizer}} logic parts (but I failed to tweak Ant scripts) * Enable verbose output (see output below) * Enable Graphiz ouput (see attached picture) * Debug step by step the {{org.apache.lucene.analysis.ko.KoreanTokenizer#parse}} method * Add a breakpoint in the {{DictionaryToken}} constructor to try to understand how and when tokens are built (I also played with {{outputUnknownUnigrams}} parameter) I would need some code or documentation pointers when you have time. !testCyrillicWord.dot.png! Tokenizer verbose output below: {noformat} PARSE extend @ pos=0 char=м hex=43c 1 arcs in UNKNOWN word len=1 1 wordIDs fromIDX=0: cost=138 (prevCost=0 wordCost=795 bgCost=138 spacePenalty=0) leftID=1793 leftPOS=SL) ** + cost=933 wordID=36 leftID=1793 leastIDX=0 toPos=1 toPos.idx=0 backtrace: endPos=1 pos=1; 1 characters; last=0 cost=933 add token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) freeBefore pos=1 TEST-TestKoreanAnalyzer.testCyrillicWord-seed#[9AA9487A32EFEB]:incToken: return token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) PARSE extend @ pos=1 char=o hex=6f 1 arcs in UNKNOWN word len=6 1 wordIDs fromIDX=0: cost=-1030 (prevCost=0 wordCost=795 bgCost=-1030 spacePenalty=0) leftID=1793 leftPOS=SL) ** + cost=-235 wordID=30 leftID=1793 leastIDX=0 toPos=7 toPos.idx=0 no arcs in; skip pos=2 no arcs in; skip pos=3 no arcs in; skip pos=4 no arcs in; skip pos=5 no arcs in; skip pos=6 end: 1 nodes backtrace: endPos=7 pos=7; 6 characters; last=1 cost=-235 add token=DictionaryToken("w" pos=6 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("o" pos=5 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("c" pos=4 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("s" pos=3 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("s" pos=2 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("o" pos=1 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) freeBefore pos=7 {noformat} was (Author: cbismuth): I'm hacking around in the {{KoreanTokenizer}} class, but I'll need some mentoring to keep going on. Here is what I've done so far: * Implement a Cyrillic test failure (see previous comment) * Locate the {{KoreanAnalyzer}} and {{KoreanTokenizer}} classes * Locate the {{ICUTokenizer}} and its {{CompositeBreakIterator}} class attribute (following UAX #29: Unicode Text Segmentation) * Try to make Ant {{nori}} module depend on {{icu}} module to try to reuse some {{ICUTokenizer}} logic parts (but I failed to tweak Ant scripts) * Enable verbose output (see output below) * Enable Graphiz ouput (see attached picture) * Debug step by step the {{org.apache.lucene.analysis.ko.KoreanTokenizer#parse}} method * Add a breakpoint in the {{DictionaryToken}} constructor to try to understand how and when tokens are built (I also played with {{outputUnknownUnigrams}} parameter) I would need some code or documentation pointers when you have time. !testCyrillicWord.dot.png! Tokenizer verbose output below. {noformat} PARSE extend @ pos=0 char=м hex=43c 1 arcs in UNKNOWN word len=1 1 wordIDs fromIDX=0: cost=138 (prevCost=0 wordCost=795 bgCost=138 spacePenalty=0) leftID=1793 leftPOS=SL) ** + cost=933 wordID=36 leftID=1793 leastIDX=0 toPos=1 toPos.idx=0 backtrace: endPos=1 pos=1; 1 characters; last=0 cost=933 add token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) freeBefore pos=1 TEST-TestKoreanAnalyzer.testCyrillicWord-seed#[9AA9487A32EFEB]:incToken: return token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) PARSE extend @ pos=1 char=o hex=6f 1 arcs in UNKNOWN word len=6 1 wordIDs fromIDX=0: cost=-1030 (prevCost=0 wordCost=795 bgCost=-1030 spacePenalty=0) leftID=1793 leftPOS=SL) ** + cost=-235 wordID=30 leftID=1793 leastIDX=0 toPos=7 toPos.idx=0 no arcs in; skip pos=2
[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694827#comment-16694827 ] Christophe Bismuth commented on LUCENE-8548: I'm hacking around in the {{KoreanTokenizer}} class, but I'll need some mentoring to keep going on. Here is what I've done so far: * Implement a Cyrillic test failure (see previous comment) * Locate the {{KoreanAnalyzer}} and {{KoreanTokenizer}} classes * Locate the {{ICUTokenizer}} and its {{CompositeBreakIterator}} class attribute (following UAX #29: Unicode Text Segmentation) * Try to make Ant {{nori}} module depend on {{icu}} module to try to reuse some {{ICUTokenizer}} logic parts (but I failed to tweak Ant scripts) * Enable verbose output (see output below) * Enable Graphiz ouput (see attached picture) * Debug step by step the {{org.apache.lucene.analysis.ko.KoreanTokenizer#parse}} method * Add a breakpoint in the {{DictionaryToken}} constructor to try to understand how and when tokens are built (I also played with {{outputUnknownUnigrams}} parameters) I would need some code or documentation pointers when you have time. !testCyrillicWord.dot.png! Tokenizer verbose output below. {noformat} PARSE extend @ pos=0 char=м hex=43c 1 arcs in UNKNOWN word len=1 1 wordIDs fromIDX=0: cost=138 (prevCost=0 wordCost=795 bgCost=138 spacePenalty=0) leftID=1793 leftPOS=SL) ** + cost=933 wordID=36 leftID=1793 leastIDX=0 toPos=1 toPos.idx=0 backtrace: endPos=1 pos=1; 1 characters; last=0 cost=933 add token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) freeBefore pos=1 TEST-TestKoreanAnalyzer.testCyrillicWord-seed#[9AA9487A32EFEB]:incToken: return token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) PARSE extend @ pos=1 char=o hex=6f 1 arcs in UNKNOWN word len=6 1 wordIDs fromIDX=0: cost=-1030 (prevCost=0 wordCost=795 bgCost=-1030 spacePenalty=0) leftID=1793 leftPOS=SL) ** + cost=-235 wordID=30 leftID=1793 leastIDX=0 toPos=7 toPos.idx=0 no arcs in; skip pos=2 no arcs in; skip pos=3 no arcs in; skip pos=4 no arcs in; skip pos=5 no arcs in; skip pos=6 end: 1 nodes backtrace: endPos=7 pos=7; 6 characters; last=1 cost=-235 add token=DictionaryToken("w" pos=6 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("o" pos=5 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("c" pos=4 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("s" pos=3 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("s" pos=2 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("o" pos=1 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) freeBefore pos=7 {noformat} > Reevaluate scripts boundary break in Nori's tokenizer > - > > Key: LUCENE-8548 > URL: https://issues.apache.org/jira/browse/LUCENE-8548 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > Attachments: testCyrillicWord.dot.png > > > This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526: > {noformat} > Tokens are split on different character POS types (which seem to not quite > line up with Unicode character blocks), which leads to weird results for > non-CJK tokens: > εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other > symbol) + μί/SL(Foreign language) > ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + > ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) > Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + > лтичко/SL(Foreign language) + ̄/SY(Other symbol) > don't is tokenized as don + t; same for don't (with a curly apostrophe). > אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol) > Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow > While it is still possible to find these words using Nori, there are many > more chances for false positives when the tokens are split up like this. In > particular, individual numbers and combining diacritics are indexed > separately (e.g., in the Cyrillic example above), which can lead to a > performance hit on large corpora like Wiktionary or Wikipedia. > Work around: use a character filter to get rid of combining diacritics before > Nori processes the text. This doesn't solve the Greek, Hebrew, or English > cases, though. > Suggested fix: Characters in related Unicode blocks—like "Greek" and
[jira] [Comment Edited] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694827#comment-16694827 ] Christophe Bismuth edited comment on LUCENE-8548 at 11/21/18 3:11 PM: -- I'm hacking around in the {{KoreanTokenizer}} class, but I'll need some mentoring to keep going on. Here is what I've done so far: * Implement a Cyrillic test failure (see previous comment) * Locate the {{KoreanAnalyzer}} and {{KoreanTokenizer}} classes * Locate the {{ICUTokenizer}} and its {{CompositeBreakIterator}} class attribute (following UAX #29: Unicode Text Segmentation) * Try to make Ant {{nori}} module depend on {{icu}} module to try to reuse some {{ICUTokenizer}} logic parts (but I failed to tweak Ant scripts) * Enable verbose output (see output below) * Enable Graphiz ouput (see attached picture) * Debug step by step the {{org.apache.lucene.analysis.ko.KoreanTokenizer#parse}} method * Add a breakpoint in the {{DictionaryToken}} constructor to try to understand how and when tokens are built (I also played with {{outputUnknownUnigrams}} parameter) I would need some code or documentation pointers when you have time. !testCyrillicWord.dot.png! Tokenizer verbose output below. {noformat} PARSE extend @ pos=0 char=м hex=43c 1 arcs in UNKNOWN word len=1 1 wordIDs fromIDX=0: cost=138 (prevCost=0 wordCost=795 bgCost=138 spacePenalty=0) leftID=1793 leftPOS=SL) ** + cost=933 wordID=36 leftID=1793 leastIDX=0 toPos=1 toPos.idx=0 backtrace: endPos=1 pos=1; 1 characters; last=0 cost=933 add token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) freeBefore pos=1 TEST-TestKoreanAnalyzer.testCyrillicWord-seed#[9AA9487A32EFEB]:incToken: return token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) PARSE extend @ pos=1 char=o hex=6f 1 arcs in UNKNOWN word len=6 1 wordIDs fromIDX=0: cost=-1030 (prevCost=0 wordCost=795 bgCost=-1030 spacePenalty=0) leftID=1793 leftPOS=SL) ** + cost=-235 wordID=30 leftID=1793 leastIDX=0 toPos=7 toPos.idx=0 no arcs in; skip pos=2 no arcs in; skip pos=3 no arcs in; skip pos=4 no arcs in; skip pos=5 no arcs in; skip pos=6 end: 1 nodes backtrace: endPos=7 pos=7; 6 characters; last=1 cost=-235 add token=DictionaryToken("w" pos=6 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("o" pos=5 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("c" pos=4 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("s" pos=3 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("s" pos=2 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("o" pos=1 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) freeBefore pos=7 {noformat} was (Author: cbismuth): I'm hacking around in the {{KoreanTokenizer}} class, but I'll need some mentoring to keep going on. Here is what I've done so far: * Implement a Cyrillic test failure (see previous comment) * Locate the {{KoreanAnalyzer}} and {{KoreanTokenizer}} classes * Locate the {{ICUTokenizer}} and its {{CompositeBreakIterator}} class attribute (following UAX #29: Unicode Text Segmentation) * Try to make Ant {{nori}} module depend on {{icu}} module to try to reuse some {{ICUTokenizer}} logic parts (but I failed to tweak Ant scripts) * Enable verbose output (see output below) * Enable Graphiz ouput (see attached picture) * Debug step by step the {{org.apache.lucene.analysis.ko.KoreanTokenizer#parse}} method * Add a breakpoint in the {{DictionaryToken}} constructor to try to understand how and when tokens are built (I also played with {{outputUnknownUnigrams}} parameters) I would need some code or documentation pointers when you have time. !testCyrillicWord.dot.png! Tokenizer verbose output below. {noformat} PARSE extend @ pos=0 char=м hex=43c 1 arcs in UNKNOWN word len=1 1 wordIDs fromIDX=0: cost=138 (prevCost=0 wordCost=795 bgCost=138 spacePenalty=0) leftID=1793 leftPOS=SL) ** + cost=933 wordID=36 leftID=1793 leastIDX=0 toPos=1 toPos.idx=0 backtrace: endPos=1 pos=1; 1 characters; last=0 cost=933 add token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) freeBefore pos=1 TEST-TestKoreanAnalyzer.testCyrillicWord-seed#[9AA9487A32EFEB]:incToken: return token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) PARSE extend @ pos=1 char=o hex=6f 1 arcs in UNKNOWN word len=6 1 wordIDs fromIDX=0: cost=-1030 (prevCost=0 wordCost=795 bgCost=-1030 spacePenalty=0) leftID=1793 leftPOS=SL) ** + cost=-235 wordID=30 leftID=1793 leastIDX=0 toPos=7 toPos.idx=0 no arcs in; skip pos=2
[jira] [Updated] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christophe Bismuth updated LUCENE-8548: --- Attachment: testCyrillicWord.dot.png > Reevaluate scripts boundary break in Nori's tokenizer > - > > Key: LUCENE-8548 > URL: https://issues.apache.org/jira/browse/LUCENE-8548 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > Attachments: testCyrillicWord.dot.png > > > This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526: > {noformat} > Tokens are split on different character POS types (which seem to not quite > line up with Unicode character blocks), which leads to weird results for > non-CJK tokens: > εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other > symbol) + μί/SL(Foreign language) > ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + > ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) > Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + > лтичко/SL(Foreign language) + ̄/SY(Other symbol) > don't is tokenized as don + t; same for don't (with a curly apostrophe). > אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol) > Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow > While it is still possible to find these words using Nori, there are many > more chances for false positives when the tokens are split up like this. In > particular, individual numbers and combining diacritics are indexed > separately (e.g., in the Cyrillic example above), which can lead to a > performance hit on large corpora like Wiktionary or Wikipedia. > Work around: use a character filter to get rid of combining diacritics before > Nori processes the text. This doesn't solve the Greek, Hebrew, or English > cases, though. > Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek > Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. > Combining diacritics should not trigger token splits. Non-CJK text should be > tokenized on spaces and punctuation, not by character type shifts. > Apostrophe-like characters should not trigger token splits (though I could > see someone disagreeing on this one).{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693478#comment-16693478 ] Christophe Bismuth commented on LUCENE-8548: I'll use the test failure below as a starting point. {code:java} // LUCENE-8548 - file TestKoreanAnalyzer.java public void testCyrillicWord() throws IOException { final Analyzer analyzer = new KoreanAnalyzer(TestKoreanTokenizer.readDict(), KoreanTokenizer.DEFAULT_DECOMPOUND, KoreanPartOfSpeechStopFilter.DEFAULT_STOP_TAGS, false); assertAnalyzesTo(analyzer, "мoscow", new String[]{"мoscow"}, new int[]{0}, new int[]{6}, new int[]{1} ); } {code} > Reevaluate scripts boundary break in Nori's tokenizer > - > > Key: LUCENE-8548 > URL: https://issues.apache.org/jira/browse/LUCENE-8548 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > > This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526: > {noformat} > Tokens are split on different character POS types (which seem to not quite > line up with Unicode character blocks), which leads to weird results for > non-CJK tokens: > εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other > symbol) + μί/SL(Foreign language) > ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + > ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) > Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + > лтичко/SL(Foreign language) + ̄/SY(Other symbol) > don't is tokenized as don + t; same for don't (with a curly apostrophe). > אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol) > Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow > While it is still possible to find these words using Nori, there are many > more chances for false positives when the tokens are split up like this. In > particular, individual numbers and combining diacritics are indexed > separately (e.g., in the Cyrillic example above), which can lead to a > performance hit on large corpora like Wiktionary or Wikipedia. > Work around: use a character filter to get rid of combining diacritics before > Nori processes the text. This doesn't solve the Greek, Hebrew, or English > cases, though. > Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek > Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. > Combining diacritics should not trigger token splits. Non-CJK text should be > tokenized on spaces and punctuation, not by character type shifts. > Apostrophe-like characters should not trigger token splits (though I could > see someone disagreeing on this one).{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8552) optimize getMergedFieldInfos for one-segment FieldInfos
[ https://issues.apache.org/jira/browse/LUCENE-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690466#comment-16690466 ] Christophe Bismuth commented on LUCENE-8552: Thank you for your help [~dsmiley] (y) > optimize getMergedFieldInfos for one-segment FieldInfos > --- > > Key: LUCENE-8552 > URL: https://issues.apache.org/jira/browse/LUCENE-8552 > Project: Lucene - Core > Issue Type: New Feature >Reporter: David Smiley >Assignee: David Smiley >Priority: Minor > Fix For: 7.7 > > Time Spent: 10m > Remaining Estimate: 0h > > FieldInfos.getMergedFieldInfos could trivially return the FieldInfos of the > first and only LeafReader if there is only one LeafReader. > Also... if there is more than one LeafReader, and if FieldInfos & FieldInfo > implemented equals() & hashCode() (including a cached hashCode), maybe we > could also call equals() iterating through the FieldInfos to see if we should > bother adding it to the FieldInfos.Builder? Admittedly this is speculative; > may not be worth the bother. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690004#comment-16690004 ] Christophe Bismuth commented on LUCENE-8548: Yes, I'm interested in this issue (y) I'll start to work on it and let you know. > Reevaluate scripts boundary break in Nori's tokenizer > - > > Key: LUCENE-8548 > URL: https://issues.apache.org/jira/browse/LUCENE-8548 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > > This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526: > {noformat} > Tokens are split on different character POS types (which seem to not quite > line up with Unicode character blocks), which leads to weird results for > non-CJK tokens: > εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other > symbol) + μί/SL(Foreign language) > ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + > ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) > Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + > лтичко/SL(Foreign language) + ̄/SY(Other symbol) > don't is tokenized as don + t; same for don't (with a curly apostrophe). > אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol) > Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow > While it is still possible to find these words using Nori, there are many > more chances for false positives when the tokens are split up like this. In > particular, individual numbers and combining diacritics are indexed > separately (e.g., in the Cyrillic example above), which can lead to a > performance hit on large corpora like Wiktionary or Wikipedia. > Work around: use a character filter to get rid of combining diacritics before > Nori processes the text. This doesn't solve the Greek, Hebrew, or English > cases, though. > Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek > Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. > Combining diacritics should not trigger token splits. Non-CJK text should be > tokenized on spaces and punctuation, not by character type shifts. > Apostrophe-like characters should not trigger token splits (though I could > see someone disagreeing on this one).{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8463) Early-terminate queries sorted by SortField.DOC
[ https://issues.apache.org/jira/browse/LUCENE-8463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689987#comment-16689987 ] Christophe Bismuth commented on LUCENE-8463: Thanks a lot [~jim.ferenczi] :D > Early-terminate queries sorted by SortField.DOC > --- > > Key: LUCENE-8463 > URL: https://issues.apache.org/jira/browse/LUCENE-8463 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Labels: newdev > Fix For: master (8.0), 7.7 > > Time Spent: 2h > Remaining Estimate: 0h > > Currently TopFieldCollector only early-terminates when the search sort is a > prefix of the index sort, but it could also early-terminate when sorting by > doc id. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8464) Implement ConstantScoreScorer#setMinCompetitiveScore
[ https://issues.apache.org/jira/browse/LUCENE-8464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689497#comment-16689497 ] Christophe Bismuth commented on LUCENE-8464: Thank you [~jim.ferenczi], it was a great experience (y) > Implement ConstantScoreScorer#setMinCompetitiveScore > > > Key: LUCENE-8464 > URL: https://issues.apache.org/jira/browse/LUCENE-8464 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Labels: newdev > Fix For: master (8.0) > > Time Spent: 7h 50m > Remaining Estimate: 0h > > We should make it so the iterator returns NO_MORE_DOCS after > setMinCompetitiveScore is called with a value that is greater than the > constant score. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688242#comment-16688242 ] Christophe Bismuth commented on LUCENE-8548: Hi [~jim.ferenczi], have you started to work on patch or maybe I could help? I'm not an Unicode guru but I can read the docs and learn. Feel free to let me know. > Reevaluate scripts boundary break in Nori's tokenizer > - > > Key: LUCENE-8548 > URL: https://issues.apache.org/jira/browse/LUCENE-8548 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > > This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526: > {noformat} > Tokens are split on different character POS types (which seem to not quite > line up with Unicode character blocks), which leads to weird results for > non-CJK tokens: > εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other > symbol) + μί/SL(Foreign language) > ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + > ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) > Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + > лтичко/SL(Foreign language) + ̄/SY(Other symbol) > don't is tokenized as don + t; same for don't (with a curly apostrophe). > אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol) > Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow > While it is still possible to find these words using Nori, there are many > more chances for false positives when the tokens are split up like this. In > particular, individual numbers and combining diacritics are indexed > separately (e.g., in the Cyrillic example above), which can lead to a > performance hit on large corpora like Wiktionary or Wikipedia. > Work around: use a character filter to get rid of combining diacritics before > Nori processes the text. This doesn't solve the Greek, Hebrew, or English > cases, though. > Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek > Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. > Combining diacritics should not trigger token splits. Non-CJK text should be > tokenized on spaces and punctuation, not by character type shifts. > Apostrophe-like characters should not trigger token splits (though I could > see someone disagreeing on this one).{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8551) Purge unused FieldInfo on segment merge
[ https://issues.apache.org/jira/browse/LUCENE-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688170#comment-16688170 ] Christophe Bismuth commented on LUCENE-8551: The overhead makes me think a dedicated optimize/purge API would be wiser. But, I don't know NRT internals enough to have a valuable opinion on the second point. > Purge unused FieldInfo on segment merge > --- > > Key: LUCENE-8551 > URL: https://issues.apache.org/jira/browse/LUCENE-8551 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: David Smiley >Priority: Major > > If a field is effectively unused (no norms, terms index, term vectors, > docValues, stored value, points index), it will nonetheless hang around in > FieldInfos indefinitely. It would be nice to be able to recognize an unused > FieldInfo and allow it to disappear after a merge (or two). > SegmentMerger merges FieldInfo (from each segment) as nearly the first thing > it does. After that, the different index parts, before it's known what's > "used" or not. After writing, we theoretically know which fields are used or > not, though we're not doing any bookkeeping to track it. Maybe we should > track the fields used during writing so we write a filtered merged fieldInfo > at the end instead of unfiltered up front? Or perhaps upon reading a > segment, we make it cheap/easy for each index type (e.g. terms index, stored > fields, ...) to know which fields have data for the corresponding type. > Then, on a subsequent merge, we know up front to filter the FieldInfos. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8551) Purge unused FieldInfo on segment merge
[ https://issues.apache.org/jira/browse/LUCENE-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687703#comment-16687703 ] Christophe Bismuth edited comment on LUCENE-8551 at 11/15/18 2:40 PM: -- Sounds challenging, I'd like to work on it! was (Author: cbismuth): Sounds challenging, I'd like to work in it! > Purge unused FieldInfo on segment merge > --- > > Key: LUCENE-8551 > URL: https://issues.apache.org/jira/browse/LUCENE-8551 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: David Smiley >Priority: Major > > If a field is effectively unused (no norms, terms index, term vectors, > docValues, stored value, points index), it will nonetheless hang around in > FieldInfos indefinitely. It would be nice to be able to recognize an unused > FieldInfo and allow it to disappear after a merge (or two). > SegmentMerger merges FieldInfo (from each segment) as nearly the first thing > it does. After that, the different index parts, before it's known what's > "used" or not. After writing, we theoretically know which fields are used or > not, though we're not doing any bookkeeping to track it. Maybe we should > track the fields used during writing so we write a filtered merged fieldInfo > at the end instead of unfiltered up front? Or perhaps upon reading a > segment, we make it cheap/easy for each index type (e.g. terms index, stored > fields, ...) to know which fields have data for the corresponding type. > Then, on a subsequent merge, we know up front to filter the FieldInfos. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8551) Purge unused FieldInfo on segment merge
[ https://issues.apache.org/jira/browse/LUCENE-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687705#comment-16687705 ] Christophe Bismuth commented on LUCENE-8551: I'll first implement unused {{FieldInfo}} tracking and let you know. > Purge unused FieldInfo on segment merge > --- > > Key: LUCENE-8551 > URL: https://issues.apache.org/jira/browse/LUCENE-8551 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: David Smiley >Priority: Major > > If a field is effectively unused (no norms, terms index, term vectors, > docValues, stored value, points index), it will nonetheless hang around in > FieldInfos indefinitely. It would be nice to be able to recognize an unused > FieldInfo and allow it to disappear after a merge (or two). > SegmentMerger merges FieldInfo (from each segment) as nearly the first thing > it does. After that, the different index parts, before it's known what's > "used" or not. After writing, we theoretically know which fields are used or > not, though we're not doing any bookkeeping to track it. Maybe we should > track the fields used during writing so we write a filtered merged fieldInfo > at the end instead of unfiltered up front? Or perhaps upon reading a > segment, we make it cheap/easy for each index type (e.g. terms index, stored > fields, ...) to know which fields have data for the corresponding type. > Then, on a subsequent merge, we know up front to filter the FieldInfos. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8551) Purge unused FieldInfo on segment merge
[ https://issues.apache.org/jira/browse/LUCENE-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687703#comment-16687703 ] Christophe Bismuth commented on LUCENE-8551: Sounds challenging, I'd like to work in it! > Purge unused FieldInfo on segment merge > --- > > Key: LUCENE-8551 > URL: https://issues.apache.org/jira/browse/LUCENE-8551 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: David Smiley >Priority: Major > > If a field is effectively unused (no norms, terms index, term vectors, > docValues, stored value, points index), it will nonetheless hang around in > FieldInfos indefinitely. It would be nice to be able to recognize an unused > FieldInfo and allow it to disappear after a merge (or two). > SegmentMerger merges FieldInfo (from each segment) as nearly the first thing > it does. After that, the different index parts, before it's known what's > "used" or not. After writing, we theoretically know which fields are used or > not, though we're not doing any bookkeeping to track it. Maybe we should > track the fields used during writing so we write a filtered merged fieldInfo > at the end instead of unfiltered up front? Or perhaps upon reading a > segment, we make it cheap/easy for each index type (e.g. terms index, stored > fields, ...) to know which fields have data for the corresponding type. > Then, on a subsequent merge, we know up front to filter the FieldInfos. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8552) optimize getMergedFieldInfos for one-segment FieldInfos
[ https://issues.apache.org/jira/browse/LUCENE-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687694#comment-16687694 ] Christophe Bismuth commented on LUCENE-8552: Hi [~dsmiley], I've opened PR [#8552|https://github.com/apache/lucene-solr/pull/499] on GitHub to implement this improvement. > optimize getMergedFieldInfos for one-segment FieldInfos > --- > > Key: LUCENE-8552 > URL: https://issues.apache.org/jira/browse/LUCENE-8552 > Project: Lucene - Core > Issue Type: New Feature >Reporter: David Smiley >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > FieldInfos.getMergedFieldInfos could trivially return the FieldInfos of the > first and only LeafReader if there is only one LeafReader. > Also... if there is more than one LeafReader, and if FieldInfos & FieldInfo > implemented equals() & hashCode() (including a cached hashCode), maybe we > could also call equals() iterating through the FieldInfos to see if we should > bother adding it to the FieldInfos.Builder? Admittedly this is speculative; > may not be worth the bother. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8552) optimize getMergedFieldInfos for one-segment FieldInfos
[ https://issues.apache.org/jira/browse/LUCENE-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687694#comment-16687694 ] Christophe Bismuth edited comment on LUCENE-8552 at 11/15/18 9:21 AM: -- Hi [~dsmiley], I've opened PR [#8552|https://github.com/apache/lucene-solr/pull/499] on GitHub to implement this feature. was (Author: cbismuth): Hi [~dsmiley], I've opened PR [#8552|https://github.com/apache/lucene-solr/pull/499] on GitHub to implement this improvement. > optimize getMergedFieldInfos for one-segment FieldInfos > --- > > Key: LUCENE-8552 > URL: https://issues.apache.org/jira/browse/LUCENE-8552 > Project: Lucene - Core > Issue Type: New Feature >Reporter: David Smiley >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > FieldInfos.getMergedFieldInfos could trivially return the FieldInfos of the > first and only LeafReader if there is only one LeafReader. > Also... if there is more than one LeafReader, and if FieldInfos & FieldInfo > implemented equals() & hashCode() (including a cached hashCode), maybe we > could also call equals() iterating through the FieldInfos to see if we should > bother adding it to the FieldInfos.Builder? Admittedly this is speculative; > may not be worth the bother. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8552) optimize getMergedFieldInfos for one-segment FieldInfos
[ https://issues.apache.org/jira/browse/LUCENE-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683801#comment-16683801 ] Christophe Bismuth edited comment on LUCENE-8552 at 11/15/18 8:42 AM: -- Is the underlying idea to limit the number of {{FieldInfo}} instances added to the {{FieldInfos.Builder}} for performances purpose? was (Author: cbismuth): Is the underlying idea to limit the number of {{FieldInfos}} instances added to the {{FieldInfos.Builder}} for performances purpose? > optimize getMergedFieldInfos for one-segment FieldInfos > --- > > Key: LUCENE-8552 > URL: https://issues.apache.org/jira/browse/LUCENE-8552 > Project: Lucene - Core > Issue Type: New Feature >Reporter: David Smiley >Priority: Minor > > FieldInfos.getMergedFieldInfos could trivially return the FieldInfos of the > first and only LeafReader if there is only one LeafReader. > Also... if there is more than one LeafReader, and if FieldInfos & FieldInfo > implemented equals() & hashCode() (including a cached hashCode), maybe we > could also call equals() iterating through the FieldInfos to see if we should > bother adding it to the FieldInfos.Builder? Admittedly this is speculative; > may not be worth the bother. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8552) optimize getMergedFieldInfos for one-segment FieldInfos
[ https://issues.apache.org/jira/browse/LUCENE-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686470#comment-16686470 ] Christophe Bismuth commented on LUCENE-8552: Thanks a lot [~dsmiley]! I'll come back to you as soon as I have a patch. > optimize getMergedFieldInfos for one-segment FieldInfos > --- > > Key: LUCENE-8552 > URL: https://issues.apache.org/jira/browse/LUCENE-8552 > Project: Lucene - Core > Issue Type: New Feature >Reporter: David Smiley >Priority: Minor > > FieldInfos.getMergedFieldInfos could trivially return the FieldInfos of the > first and only LeafReader if there is only one LeafReader. > Also... if there is more than one LeafReader, and if FieldInfos & FieldInfo > implemented equals() & hashCode() (including a cached hashCode), maybe we > could also call equals() iterating through the FieldInfos to see if we should > bother adding it to the FieldInfos.Builder? Admittedly this is speculative; > may not be worth the bother. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8294) KeywordTokenizer hangs with user misconfigured inputs
[ https://issues.apache.org/jira/browse/LUCENE-8294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683907#comment-16683907 ] Christophe Bismuth commented on LUCENE-8294: Issue can be closed as fixed in [906679adc80f0fad1e5c311b03023c7bd95633d7|https://github.com/apache/lucene-solr/commit/906679adc80f0fad1e5c311b03023c7bd95633d7]. > KeywordTokenizer hangs with user misconfigured inputs > - > > Key: LUCENE-8294 > URL: https://issues.apache.org/jira/browse/LUCENE-8294 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 2.1 >Reporter: John Doe >Priority: Minor > > When a user configures the bufferSize to be 0, the while loop in > KeywordTokenizer.next() function hangs endlessly. Here is the code snippet. > {code:java} > public KeywordTokenizer(Reader input, int bufferSize) { > super(input); > this.buffer = new char[bufferSize];//bufferSize is misconfigured with 0 > this.done = false; > } > public Token next() throws IOException { > if (!done) { > done = true; > StringBuffer buffer = new StringBuffer(); > int length; > while (true) { > length = input.read(this.buffer); //length is always 0 when the > buffer.size == 0 > if (length == -1) break; > buffer.append(this.buffer, 0, length); > } > String text = buffer.toString(); > return new Token(text, 0, text.length()); > } > return null; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8552) optimize getMergedFieldInfos for one-segment FieldInfos
[ https://issues.apache.org/jira/browse/LUCENE-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683801#comment-16683801 ] Christophe Bismuth commented on LUCENE-8552: Is the underlying idea to limit the number of {{FieldInfos}} instances added to the {{FieldInfos.Builder}} for performances purpose? > optimize getMergedFieldInfos for one-segment FieldInfos > --- > > Key: LUCENE-8552 > URL: https://issues.apache.org/jira/browse/LUCENE-8552 > Project: Lucene - Core > Issue Type: New Feature >Reporter: David Smiley >Priority: Minor > > FieldInfos.getMergedFieldInfos could trivially return the FieldInfos of the > first and only LeafReader if there is only one LeafReader. > Also... if there is more than one LeafReader, and if FieldInfos & FieldInfo > implemented equals() & hashCode() (including a cached hashCode), maybe we > could also call equals() iterating through the FieldInfos to see if we should > bother adding it to the FieldInfos.Builder? Admittedly this is speculative; > may not be worth the bother. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8552) optimize getMergedFieldInfos for one-segment FieldInfos
[ https://issues.apache.org/jira/browse/LUCENE-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683779#comment-16683779 ] Christophe Bismuth commented on LUCENE-8552: Hi, I'd like to work on this one. > optimize getMergedFieldInfos for one-segment FieldInfos > --- > > Key: LUCENE-8552 > URL: https://issues.apache.org/jira/browse/LUCENE-8552 > Project: Lucene - Core > Issue Type: New Feature >Reporter: David Smiley >Priority: Minor > > FieldInfos.getMergedFieldInfos could trivially return the FieldInfos of the > first and only LeafReader if there is only one LeafReader. > Also... if there is more than one LeafReader, and if FieldInfos & FieldInfo > implemented equals() & hashCode() (including a cached hashCode), maybe we > could also call equals() iterating through the FieldInfos to see if we should > bother adding it to the FieldInfos.Builder? Admittedly this is speculative; > may not be worth the bother. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8026) ExitableDirectoryReader does not instrument points
[ https://issues.apache.org/jira/browse/LUCENE-8026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683763#comment-16683763 ] Christophe Bismuth commented on LUCENE-8026: Hi, I've opened PR [#497|https://github.com/apache/lucene-solr/pull/497] to fix this bug. > ExitableDirectoryReader does not instrument points > -- > > Key: LUCENE-8026 > URL: https://issues.apache.org/jira/browse/LUCENE-8026 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Trivial > Labels: newdev > Time Spent: 10m > Remaining Estimate: 0h > > This means it cannot interrupt range or geo queries. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8463) Early-terminate queries sorted by SortField.DOC
[ https://issues.apache.org/jira/browse/LUCENE-8463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683584#comment-16683584 ] Christophe Bismuth commented on LUCENE-8463: Hi, I've opened PR [#496|https://github.com/apache/lucene-solr/pull/496] to implement this improvement. > Early-terminate queries sorted by SortField.DOC > --- > > Key: LUCENE-8463 > URL: https://issues.apache.org/jira/browse/LUCENE-8463 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Labels: newdev > Time Spent: 10m > Remaining Estimate: 0h > > Currently TopFieldCollector only early-terminates when the search sort is a > prefix of the index sort, but it could also early-terminate when sorting by > doc id. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8463) Early-terminate queries sorted by SortField.DOC
[ https://issues.apache.org/jira/browse/LUCENE-8463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681232#comment-16681232 ] Christophe Bismuth commented on LUCENE-8463: Hi, I'd like to work on this one. > Early-terminate queries sorted by SortField.DOC > --- > > Key: LUCENE-8463 > URL: https://issues.apache.org/jira/browse/LUCENE-8463 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Labels: newdev > > Currently TopFieldCollector only early-terminates when the search sort is a > prefix of the index sort, but it could also early-terminate when sorting by > doc id. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8464) Implement ConstantScoreScorer#setMinCompetitiveScore
[ https://issues.apache.org/jira/browse/LUCENE-8464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680002#comment-16680002 ] Christophe Bismuth commented on LUCENE-8464: Hi, I've opened PR [#495|https://github.com/apache/lucene-solr/pull/495] on GitHub to implement this, could you please tell me if this implementation could fit? Thank you. > Implement ConstantScoreScorer#setMinCompetitiveScore > > > Key: LUCENE-8464 > URL: https://issues.apache.org/jira/browse/LUCENE-8464 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Labels: newdev > Time Spent: 10m > Remaining Estimate: 0h > > We should make it so the iterator returns NO_MORE_DOCS after > setMinCompetitiveScore is called with a value that is greater than the > constant score. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8464) Implement ConstantScoreScorer#setMinCompetitiveScore
[ https://issues.apache.org/jira/browse/LUCENE-8464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678314#comment-16678314 ] Christophe Bismuth commented on LUCENE-8464: Hi, I'd like to work on this one. > Implement ConstantScoreScorer#setMinCompetitiveScore > > > Key: LUCENE-8464 > URL: https://issues.apache.org/jira/browse/LUCENE-8464 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Labels: newdev > > We should make it so the iterator returns NO_MORE_DOCS after > setMinCompetitiveScore is called with a value that is greater than the > constant score. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org