[jira] [Commented] (LUCENE-8464) Implement ConstantScoreScorer#setMinCompetitiveScore

2018-12-12 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719272#comment-16719272
 ] 

Christophe Bismuth commented on LUCENE-8464:


Thanks a lot [~romseygeek], you made my day :D
 [~jim.ferenczi] has made some really great mentoring with me on this one (y) I 
hope to find some other great issues to work on!

> Implement ConstantScoreScorer#setMinCompetitiveScore
> 
>
> Key: LUCENE-8464
> URL: https://issues.apache.org/jira/browse/LUCENE-8464
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Labels: newdev
> Fix For: master (8.0)
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> We should make it so the iterator returns NO_MORE_DOCS after 
> setMinCompetitiveScore is called with a value that is greater than the 
> constant score.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-12-03 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707497#comment-16707497
 ] 

Christophe Bismuth commented on LUCENE-8548:


That's great, thanks [~jim.ferenczi] for all the details (y)

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Fix For: master (8.0), 7.7
>
> Attachments: LUCENE-8548.patch, screenshot-1.png, 
> testCyrillicWord.dot.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-28 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701958#comment-16701958
 ] 

Christophe Bismuth commented on LUCENE-8548:


Thanks a lot for sharing this [~jim.ferenczi], and no worries at all as the 
first iteration was an interesting journey! I think taking time to read about 
Viterbi would help me some more, let's add it to my pretty own todo list :D

I diffed your patch with {{master}} and debugged new tests step-by-step, and I 
think I understand the big picture. Among others, I totally missed the {{if 
(isCommonOrInherited(scriptCode) && isCommonOrInherited(sc) == false)}} 
condition which is essential.

I still have one more question, could you please explain what information is 
contained in the {{wordIdRef}} variable and what the 
{{unkDictionary.lookupWordIds(characterId, wordIdRef)}} statement does? The 
debugger tells me {{wordIdRef.length}} is always equal to 36 or 42 and even 
though 42 is a really great number, I'm a tiny lost in there ...

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8548.patch, screenshot-1.png, 
> testCyrillicWord.dot.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-23 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697336#comment-16697336
 ] 

Christophe Bismuth commented on LUCENE-8548:


I've made some progress and opened PR 
[#505|https://github.com/apache/lucene-solr/pull/505] to share them with you. 
Feel free to stop me as I don't want to make you loose your time.

Here is what has been done so far:

 * Break on script boundaries with built-in JDK API,
 * Track character classes in a growing byte array,
 * I feel a tiny bit lost when it comes to extract costs: should I call 
{{unkDictionary.lookupWordIds(characterId, wordIdRef)}} for each tracked 
character class?
 * {{мoscow}} word is correctly parsed in the Graphviz output below ...
 * ... but test failed on this 
[line|https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/lucene/test-framework/src/java/org/apache/lucene/analysis/BaseTokenStreamTestCase.java#L199]
 and I still have to understand why.

!screenshot-1.png!

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: screenshot-1.png, testCyrillicWord.dot.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-23 Thread Christophe Bismuth (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christophe Bismuth updated LUCENE-8548:
---
Attachment: screenshot-1.png

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: screenshot-1.png, testCyrillicWord.dot.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-22 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696093#comment-16696093
 ] 

Christophe Bismuth commented on LUCENE-8548:


That is really nice, thank you [~jim.ferenczi] and [~rcmuir], I should be able 
to start patch, thanks again!

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: testCyrillicWord.dot.png
>
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-22 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696052#comment-16696052
 ] 

Christophe Bismuth commented on LUCENE-8548:


Great! Thank you [~rcmuir], I'll dig into this (y)

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: testCyrillicWord.dot.png
>
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-21 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694827#comment-16694827
 ] 

Christophe Bismuth edited comment on LUCENE-8548 at 11/21/18 3:12 PM:
--

I'm hacking around in the {{KoreanTokenizer}} class, but I'll need some 
mentoring to keep going on.

Here is what I've done so far:
 * Implement a Cyrillic test failure (see previous comment)
 * Locate the {{KoreanAnalyzer}} and {{KoreanTokenizer}} classes
 * Locate the {{ICUTokenizer}} and its {{CompositeBreakIterator}} class 
attribute (following UAX #29: Unicode Text Segmentation)
 * Try to make Ant {{nori}} module depend on {{icu}} module to try to reuse 
some {{ICUTokenizer}} logic parts (but I failed to tweak Ant scripts)
 * Enable verbose output (see output below)
 * Enable Graphiz ouput (see attached picture)
 * Debug step by step the 
{{org.apache.lucene.analysis.ko.KoreanTokenizer#parse}} method
 * Add a breakpoint in the {{DictionaryToken}} constructor to try to understand 
how and when tokens are built (I also played with {{outputUnknownUnigrams}} 
parameter)

I would need some code or documentation pointers when you have time.

!testCyrillicWord.dot.png!

Tokenizer verbose output:

{noformat}
PARSE

  extend @ pos=0 char=м hex=43c
1 arcs in
UNKNOWN word len=1 1 wordIDs
  fromIDX=0: cost=138 (prevCost=0 wordCost=795 bgCost=138 spacePenalty=0) 
leftID=1793 leftPOS=SL)
**
  + cost=933 wordID=36 leftID=1793 leastIDX=0 toPos=1 toPos.idx=0

  backtrace: endPos=1 pos=1; 1 characters; last=0 cost=933
add token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
  freeBefore pos=1
TEST-TestKoreanAnalyzer.testCyrillicWord-seed#[9AA9487A32EFEB]:incToken: 
return token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)

PARSE

  extend @ pos=1 char=o hex=6f
1 arcs in
UNKNOWN word len=6 1 wordIDs
  fromIDX=0: cost=-1030 (prevCost=0 wordCost=795 bgCost=-1030 
spacePenalty=0) leftID=1793 leftPOS=SL)
**
  + cost=-235 wordID=30 leftID=1793 leastIDX=0 toPos=7 toPos.idx=0
no arcs in; skip pos=2
no arcs in; skip pos=3
no arcs in; skip pos=4
no arcs in; skip pos=5
no arcs in; skip pos=6
  end: 1 nodes

  backtrace: endPos=7 pos=7; 6 characters; last=1 cost=-235
add token=DictionaryToken("w" pos=6 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("o" pos=5 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("c" pos=4 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("s" pos=3 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("s" pos=2 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("o" pos=1 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
  freeBefore pos=7
{noformat}


was (Author: cbismuth):
I'm hacking around in the {{KoreanTokenizer}} class, but I'll need some 
mentoring to keep going on.

Here is what I've done so far:
 * Implement a Cyrillic test failure (see previous comment)
 * Locate the {{KoreanAnalyzer}} and {{KoreanTokenizer}} classes
 * Locate the {{ICUTokenizer}} and its {{CompositeBreakIterator}} class 
attribute (following UAX #29: Unicode Text Segmentation)
 * Try to make Ant {{nori}} module depend on {{icu}} module to try to reuse 
some {{ICUTokenizer}} logic parts (but I failed to tweak Ant scripts)
 * Enable verbose output (see output below)
 * Enable Graphiz ouput (see attached picture)
 * Debug step by step the 
{{org.apache.lucene.analysis.ko.KoreanTokenizer#parse}} method
 * Add a breakpoint in the {{DictionaryToken}} constructor to try to understand 
how and when tokens are built (I also played with {{outputUnknownUnigrams}} 
parameter)

I would need some code or documentation pointers when you have time.

!testCyrillicWord.dot.png!

Tokenizer verbose output below:

{noformat}
PARSE

  extend @ pos=0 char=м hex=43c
1 arcs in
UNKNOWN word len=1 1 wordIDs
  fromIDX=0: cost=138 (prevCost=0 wordCost=795 bgCost=138 spacePenalty=0) 
leftID=1793 leftPOS=SL)
**
  + cost=933 wordID=36 leftID=1793 leastIDX=0 toPos=1 toPos.idx=0

  backtrace: endPos=1 pos=1; 1 characters; last=0 cost=933
add token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
  freeBefore pos=1
TEST-TestKoreanAnalyzer.testCyrillicWord-seed#[9AA9487A32EFEB]:incToken: 
return token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)

PARSE

  extend @ pos=1 char=o hex=6f
1 arcs in
UNKNOWN word len=6 1 wordIDs
  fromIDX=0: cost=-1030 (prevCost=0 wordCost=795 bgCost=-1030 
spacePenalty=0) leftID=1793 leftPOS=SL)
**
  + cost=-235 wordID=30 leftID=1793 leastIDX=0 toPos=7 toPos.idx=0
no arcs in; skip pos=2
no 

[jira] [Comment Edited] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-21 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694827#comment-16694827
 ] 

Christophe Bismuth edited comment on LUCENE-8548 at 11/21/18 3:12 PM:
--

I'm hacking around in the {{KoreanTokenizer}} class, but I'll need some 
mentoring to keep going on.

Here is what I've done so far:
 * Implement a Cyrillic test failure (see previous comment)
 * Locate the {{KoreanAnalyzer}} and {{KoreanTokenizer}} classes
 * Locate the {{ICUTokenizer}} and its {{CompositeBreakIterator}} class 
attribute (following UAX #29: Unicode Text Segmentation)
 * Try to make Ant {{nori}} module depend on {{icu}} module to try to reuse 
some {{ICUTokenizer}} logic parts (but I failed to tweak Ant scripts)
 * Enable verbose output (see output below)
 * Enable Graphiz ouput (see attached picture)
 * Debug step by step the 
{{org.apache.lucene.analysis.ko.KoreanTokenizer#parse}} method
 * Add a breakpoint in the {{DictionaryToken}} constructor to try to understand 
how and when tokens are built (I also played with {{outputUnknownUnigrams}} 
parameter)

I would need some code or documentation pointers when you have time.

!testCyrillicWord.dot.png!

Tokenizer verbose output below:

{noformat}
PARSE

  extend @ pos=0 char=м hex=43c
1 arcs in
UNKNOWN word len=1 1 wordIDs
  fromIDX=0: cost=138 (prevCost=0 wordCost=795 bgCost=138 spacePenalty=0) 
leftID=1793 leftPOS=SL)
**
  + cost=933 wordID=36 leftID=1793 leastIDX=0 toPos=1 toPos.idx=0

  backtrace: endPos=1 pos=1; 1 characters; last=0 cost=933
add token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
  freeBefore pos=1
TEST-TestKoreanAnalyzer.testCyrillicWord-seed#[9AA9487A32EFEB]:incToken: 
return token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)

PARSE

  extend @ pos=1 char=o hex=6f
1 arcs in
UNKNOWN word len=6 1 wordIDs
  fromIDX=0: cost=-1030 (prevCost=0 wordCost=795 bgCost=-1030 
spacePenalty=0) leftID=1793 leftPOS=SL)
**
  + cost=-235 wordID=30 leftID=1793 leastIDX=0 toPos=7 toPos.idx=0
no arcs in; skip pos=2
no arcs in; skip pos=3
no arcs in; skip pos=4
no arcs in; skip pos=5
no arcs in; skip pos=6
  end: 1 nodes

  backtrace: endPos=7 pos=7; 6 characters; last=1 cost=-235
add token=DictionaryToken("w" pos=6 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("o" pos=5 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("c" pos=4 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("s" pos=3 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("s" pos=2 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("o" pos=1 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
  freeBefore pos=7
{noformat}


was (Author: cbismuth):
I'm hacking around in the {{KoreanTokenizer}} class, but I'll need some 
mentoring to keep going on.

Here is what I've done so far:
 * Implement a Cyrillic test failure (see previous comment)
 * Locate the {{KoreanAnalyzer}} and {{KoreanTokenizer}} classes
 * Locate the {{ICUTokenizer}} and its {{CompositeBreakIterator}} class 
attribute (following UAX #29: Unicode Text Segmentation)
 * Try to make Ant {{nori}} module depend on {{icu}} module to try to reuse 
some {{ICUTokenizer}} logic parts (but I failed to tweak Ant scripts)
 * Enable verbose output (see output below)
 * Enable Graphiz ouput (see attached picture)
 * Debug step by step the 
{{org.apache.lucene.analysis.ko.KoreanTokenizer#parse}} method
 * Add a breakpoint in the {{DictionaryToken}} constructor to try to understand 
how and when tokens are built (I also played with {{outputUnknownUnigrams}} 
parameter)

I would need some code or documentation pointers when you have time.

!testCyrillicWord.dot.png!

Tokenizer verbose output below.
{noformat}
PARSE

  extend @ pos=0 char=м hex=43c
1 arcs in
UNKNOWN word len=1 1 wordIDs
  fromIDX=0: cost=138 (prevCost=0 wordCost=795 bgCost=138 spacePenalty=0) 
leftID=1793 leftPOS=SL)
**
  + cost=933 wordID=36 leftID=1793 leastIDX=0 toPos=1 toPos.idx=0

  backtrace: endPos=1 pos=1; 1 characters; last=0 cost=933
add token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
  freeBefore pos=1
TEST-TestKoreanAnalyzer.testCyrillicWord-seed#[9AA9487A32EFEB]:incToken: 
return token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)

PARSE

  extend @ pos=1 char=o hex=6f
1 arcs in
UNKNOWN word len=6 1 wordIDs
  fromIDX=0: cost=-1030 (prevCost=0 wordCost=795 bgCost=-1030 
spacePenalty=0) leftID=1793 leftPOS=SL)
**
  + cost=-235 wordID=30 leftID=1793 leastIDX=0 toPos=7 toPos.idx=0
no arcs in; skip pos=2

[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-21 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694827#comment-16694827
 ] 

Christophe Bismuth commented on LUCENE-8548:


I'm hacking around in the {{KoreanTokenizer}} class, but I'll need some 
mentoring to keep going on.

Here is what I've done so far:
 * Implement a Cyrillic test failure (see previous comment)
 * Locate the {{KoreanAnalyzer}} and {{KoreanTokenizer}} classes
 * Locate the {{ICUTokenizer}} and its {{CompositeBreakIterator}} class 
attribute (following UAX #29: Unicode Text Segmentation)
 * Try to make Ant {{nori}} module depend on {{icu}} module to try to reuse 
some {{ICUTokenizer}} logic parts (but I failed to tweak Ant scripts)
 * Enable verbose output (see output below)
 * Enable Graphiz ouput (see attached picture)
 * Debug step by step the 
{{org.apache.lucene.analysis.ko.KoreanTokenizer#parse}} method
 * Add a breakpoint in the {{DictionaryToken}} constructor to try to understand 
how and when tokens are built (I also played with {{outputUnknownUnigrams}} 
parameters)

I would need some code or documentation pointers when you have time.

!testCyrillicWord.dot.png!

Tokenizer verbose output below.
{noformat}
PARSE

  extend @ pos=0 char=м hex=43c
1 arcs in
UNKNOWN word len=1 1 wordIDs
  fromIDX=0: cost=138 (prevCost=0 wordCost=795 bgCost=138 spacePenalty=0) 
leftID=1793 leftPOS=SL)
**
  + cost=933 wordID=36 leftID=1793 leastIDX=0 toPos=1 toPos.idx=0

  backtrace: endPos=1 pos=1; 1 characters; last=0 cost=933
add token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
  freeBefore pos=1
TEST-TestKoreanAnalyzer.testCyrillicWord-seed#[9AA9487A32EFEB]:incToken: 
return token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)

PARSE

  extend @ pos=1 char=o hex=6f
1 arcs in
UNKNOWN word len=6 1 wordIDs
  fromIDX=0: cost=-1030 (prevCost=0 wordCost=795 bgCost=-1030 
spacePenalty=0) leftID=1793 leftPOS=SL)
**
  + cost=-235 wordID=30 leftID=1793 leastIDX=0 toPos=7 toPos.idx=0
no arcs in; skip pos=2
no arcs in; skip pos=3
no arcs in; skip pos=4
no arcs in; skip pos=5
no arcs in; skip pos=6
  end: 1 nodes

  backtrace: endPos=7 pos=7; 6 characters; last=1 cost=-235
add token=DictionaryToken("w" pos=6 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("o" pos=5 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("c" pos=4 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("s" pos=3 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("s" pos=2 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("o" pos=1 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
  freeBefore pos=7
{noformat}

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: testCyrillicWord.dot.png
>
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and 

[jira] [Comment Edited] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-21 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694827#comment-16694827
 ] 

Christophe Bismuth edited comment on LUCENE-8548 at 11/21/18 3:11 PM:
--

I'm hacking around in the {{KoreanTokenizer}} class, but I'll need some 
mentoring to keep going on.

Here is what I've done so far:
 * Implement a Cyrillic test failure (see previous comment)
 * Locate the {{KoreanAnalyzer}} and {{KoreanTokenizer}} classes
 * Locate the {{ICUTokenizer}} and its {{CompositeBreakIterator}} class 
attribute (following UAX #29: Unicode Text Segmentation)
 * Try to make Ant {{nori}} module depend on {{icu}} module to try to reuse 
some {{ICUTokenizer}} logic parts (but I failed to tweak Ant scripts)
 * Enable verbose output (see output below)
 * Enable Graphiz ouput (see attached picture)
 * Debug step by step the 
{{org.apache.lucene.analysis.ko.KoreanTokenizer#parse}} method
 * Add a breakpoint in the {{DictionaryToken}} constructor to try to understand 
how and when tokens are built (I also played with {{outputUnknownUnigrams}} 
parameter)

I would need some code or documentation pointers when you have time.

!testCyrillicWord.dot.png!

Tokenizer verbose output below.
{noformat}
PARSE

  extend @ pos=0 char=м hex=43c
1 arcs in
UNKNOWN word len=1 1 wordIDs
  fromIDX=0: cost=138 (prevCost=0 wordCost=795 bgCost=138 spacePenalty=0) 
leftID=1793 leftPOS=SL)
**
  + cost=933 wordID=36 leftID=1793 leastIDX=0 toPos=1 toPos.idx=0

  backtrace: endPos=1 pos=1; 1 characters; last=0 cost=933
add token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
  freeBefore pos=1
TEST-TestKoreanAnalyzer.testCyrillicWord-seed#[9AA9487A32EFEB]:incToken: 
return token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)

PARSE

  extend @ pos=1 char=o hex=6f
1 arcs in
UNKNOWN word len=6 1 wordIDs
  fromIDX=0: cost=-1030 (prevCost=0 wordCost=795 bgCost=-1030 
spacePenalty=0) leftID=1793 leftPOS=SL)
**
  + cost=-235 wordID=30 leftID=1793 leastIDX=0 toPos=7 toPos.idx=0
no arcs in; skip pos=2
no arcs in; skip pos=3
no arcs in; skip pos=4
no arcs in; skip pos=5
no arcs in; skip pos=6
  end: 1 nodes

  backtrace: endPos=7 pos=7; 6 characters; last=1 cost=-235
add token=DictionaryToken("w" pos=6 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("o" pos=5 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("c" pos=4 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("s" pos=3 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("s" pos=2 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
add token=DictionaryToken("o" pos=1 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
  freeBefore pos=7
{noformat}


was (Author: cbismuth):
I'm hacking around in the {{KoreanTokenizer}} class, but I'll need some 
mentoring to keep going on.

Here is what I've done so far:
 * Implement a Cyrillic test failure (see previous comment)
 * Locate the {{KoreanAnalyzer}} and {{KoreanTokenizer}} classes
 * Locate the {{ICUTokenizer}} and its {{CompositeBreakIterator}} class 
attribute (following UAX #29: Unicode Text Segmentation)
 * Try to make Ant {{nori}} module depend on {{icu}} module to try to reuse 
some {{ICUTokenizer}} logic parts (but I failed to tweak Ant scripts)
 * Enable verbose output (see output below)
 * Enable Graphiz ouput (see attached picture)
 * Debug step by step the 
{{org.apache.lucene.analysis.ko.KoreanTokenizer#parse}} method
 * Add a breakpoint in the {{DictionaryToken}} constructor to try to understand 
how and when tokens are built (I also played with {{outputUnknownUnigrams}} 
parameters)

I would need some code or documentation pointers when you have time.

!testCyrillicWord.dot.png!

Tokenizer verbose output below.
{noformat}
PARSE

  extend @ pos=0 char=м hex=43c
1 arcs in
UNKNOWN word len=1 1 wordIDs
  fromIDX=0: cost=138 (prevCost=0 wordCost=795 bgCost=138 spacePenalty=0) 
leftID=1793 leftPOS=SL)
**
  + cost=933 wordID=36 leftID=1793 leastIDX=0 toPos=1 toPos.idx=0

  backtrace: endPos=1 pos=1; 1 characters; last=0 cost=933
add token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
  freeBefore pos=1
TEST-TestKoreanAnalyzer.testCyrillicWord-seed#[9AA9487A32EFEB]:incToken: 
return token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)

PARSE

  extend @ pos=1 char=o hex=6f
1 arcs in
UNKNOWN word len=6 1 wordIDs
  fromIDX=0: cost=-1030 (prevCost=0 wordCost=795 bgCost=-1030 
spacePenalty=0) leftID=1793 leftPOS=SL)
**
  + cost=-235 wordID=30 leftID=1793 leastIDX=0 toPos=7 toPos.idx=0
no arcs in; skip pos=2

[jira] [Updated] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-21 Thread Christophe Bismuth (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christophe Bismuth updated LUCENE-8548:
---
Attachment: testCyrillicWord.dot.png

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: testCyrillicWord.dot.png
>
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-20 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693478#comment-16693478
 ] 

Christophe Bismuth commented on LUCENE-8548:


I'll use the test failure below as a starting point.

{code:java}
  // LUCENE-8548 - file TestKoreanAnalyzer.java
  public void testCyrillicWord() throws IOException {
final Analyzer analyzer = new KoreanAnalyzer(TestKoreanTokenizer.readDict(),
KoreanTokenizer.DEFAULT_DECOMPOUND, 
KoreanPartOfSpeechStopFilter.DEFAULT_STOP_TAGS, false);
assertAnalyzesTo(analyzer, "мoscow",
new String[]{"мoscow"},
new int[]{0},
new int[]{6},
new int[]{1}
);
  }
{code}

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8552) optimize getMergedFieldInfos for one-segment FieldInfos

2018-11-17 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690466#comment-16690466
 ] 

Christophe Bismuth commented on LUCENE-8552:


Thank you for your help [~dsmiley] (y)

> optimize getMergedFieldInfos for one-segment FieldInfos
> ---
>
> Key: LUCENE-8552
> URL: https://issues.apache.org/jira/browse/LUCENE-8552
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Minor
> Fix For: 7.7
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> FieldInfos.getMergedFieldInfos could trivially return the FieldInfos of the 
> first and only LeafReader if there is only one LeafReader.
> Also... if there is more than one LeafReader, and if FieldInfos & FieldInfo 
> implemented equals() & hashCode() (including a cached hashCode), maybe we 
> could also call equals() iterating through the FieldInfos to see if we should 
> bother adding it to the FieldInfos.Builder?  Admittedly this is speculative; 
> may not be worth the bother.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-16 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690004#comment-16690004
 ] 

Christophe Bismuth commented on LUCENE-8548:


Yes, I'm interested in this issue (y) I'll start to work on it and let you know.

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8463) Early-terminate queries sorted by SortField.DOC

2018-11-16 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689987#comment-16689987
 ] 

Christophe Bismuth commented on LUCENE-8463:


Thanks a lot [~jim.ferenczi] :D

> Early-terminate queries sorted by SortField.DOC
> ---
>
> Key: LUCENE-8463
> URL: https://issues.apache.org/jira/browse/LUCENE-8463
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Labels: newdev
> Fix For: master (8.0), 7.7
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Currently TopFieldCollector only early-terminates when the search sort is a 
> prefix of the index sort, but it could also early-terminate when sorting by 
> doc id.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8464) Implement ConstantScoreScorer#setMinCompetitiveScore

2018-11-16 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689497#comment-16689497
 ] 

Christophe Bismuth commented on LUCENE-8464:


Thank you [~jim.ferenczi], it was a great experience (y)

> Implement ConstantScoreScorer#setMinCompetitiveScore
> 
>
> Key: LUCENE-8464
> URL: https://issues.apache.org/jira/browse/LUCENE-8464
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Labels: newdev
> Fix For: master (8.0)
>
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> We should make it so the iterator returns NO_MORE_DOCS after 
> setMinCompetitiveScore is called with a value that is greater than the 
> constant score.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-11-15 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688242#comment-16688242
 ] 

Christophe Bismuth commented on LUCENE-8548:


Hi [~jim.ferenczi], have you started to work on patch or maybe I could help? 
I'm not an Unicode guru but I can read the docs and learn. Feel free to let me 
know.

> Reevaluate scripts boundary break in Nori's tokenizer
> -
>
> Key: LUCENE-8548
> URL: https://issues.apache.org/jira/browse/LUCENE-8548
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8551) Purge unused FieldInfo on segment merge

2018-11-15 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688170#comment-16688170
 ] 

Christophe Bismuth commented on LUCENE-8551:


The overhead makes me think a dedicated optimize/purge API would be wiser. But, 
I don't know NRT internals enough to have a valuable opinion on the second 
point.

> Purge unused FieldInfo on segment merge
> ---
>
> Key: LUCENE-8551
> URL: https://issues.apache.org/jira/browse/LUCENE-8551
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: David Smiley
>Priority: Major
>
> If a field is effectively unused (no norms, terms index, term vectors, 
> docValues, stored value, points index), it will nonetheless hang around in 
> FieldInfos indefinitely.  It would be nice to be able to recognize an unused 
> FieldInfo and allow it to disappear after a merge (or two).
> SegmentMerger merges FieldInfo (from each segment) as nearly the first thing 
> it does.  After that, the different index parts, before it's known what's 
> "used" or not.  After writing, we theoretically know which fields are used or 
> not, though we're not doing any bookkeeping to track it.  Maybe we should 
> track the fields used during writing so we write a filtered merged fieldInfo 
> at the end instead of unfiltered up front?  Or perhaps upon reading a 
> segment, we make it cheap/easy for each index type (e.g. terms index, stored 
> fields, ...) to know which fields have data for the corresponding type.  
> Then, on a subsequent merge, we know up front to filter the FieldInfos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8551) Purge unused FieldInfo on segment merge

2018-11-15 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687703#comment-16687703
 ] 

Christophe Bismuth edited comment on LUCENE-8551 at 11/15/18 2:40 PM:
--

Sounds challenging, I'd like to work on it!


was (Author: cbismuth):
Sounds challenging, I'd like to work in it!

> Purge unused FieldInfo on segment merge
> ---
>
> Key: LUCENE-8551
> URL: https://issues.apache.org/jira/browse/LUCENE-8551
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: David Smiley
>Priority: Major
>
> If a field is effectively unused (no norms, terms index, term vectors, 
> docValues, stored value, points index), it will nonetheless hang around in 
> FieldInfos indefinitely.  It would be nice to be able to recognize an unused 
> FieldInfo and allow it to disappear after a merge (or two).
> SegmentMerger merges FieldInfo (from each segment) as nearly the first thing 
> it does.  After that, the different index parts, before it's known what's 
> "used" or not.  After writing, we theoretically know which fields are used or 
> not, though we're not doing any bookkeeping to track it.  Maybe we should 
> track the fields used during writing so we write a filtered merged fieldInfo 
> at the end instead of unfiltered up front?  Or perhaps upon reading a 
> segment, we make it cheap/easy for each index type (e.g. terms index, stored 
> fields, ...) to know which fields have data for the corresponding type.  
> Then, on a subsequent merge, we know up front to filter the FieldInfos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8551) Purge unused FieldInfo on segment merge

2018-11-15 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687705#comment-16687705
 ] 

Christophe Bismuth commented on LUCENE-8551:


I'll first implement unused {{FieldInfo}} tracking and let you know.

> Purge unused FieldInfo on segment merge
> ---
>
> Key: LUCENE-8551
> URL: https://issues.apache.org/jira/browse/LUCENE-8551
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: David Smiley
>Priority: Major
>
> If a field is effectively unused (no norms, terms index, term vectors, 
> docValues, stored value, points index), it will nonetheless hang around in 
> FieldInfos indefinitely.  It would be nice to be able to recognize an unused 
> FieldInfo and allow it to disappear after a merge (or two).
> SegmentMerger merges FieldInfo (from each segment) as nearly the first thing 
> it does.  After that, the different index parts, before it's known what's 
> "used" or not.  After writing, we theoretically know which fields are used or 
> not, though we're not doing any bookkeeping to track it.  Maybe we should 
> track the fields used during writing so we write a filtered merged fieldInfo 
> at the end instead of unfiltered up front?  Or perhaps upon reading a 
> segment, we make it cheap/easy for each index type (e.g. terms index, stored 
> fields, ...) to know which fields have data for the corresponding type.  
> Then, on a subsequent merge, we know up front to filter the FieldInfos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8551) Purge unused FieldInfo on segment merge

2018-11-15 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687703#comment-16687703
 ] 

Christophe Bismuth commented on LUCENE-8551:


Sounds challenging, I'd like to work in it!

> Purge unused FieldInfo on segment merge
> ---
>
> Key: LUCENE-8551
> URL: https://issues.apache.org/jira/browse/LUCENE-8551
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: David Smiley
>Priority: Major
>
> If a field is effectively unused (no norms, terms index, term vectors, 
> docValues, stored value, points index), it will nonetheless hang around in 
> FieldInfos indefinitely.  It would be nice to be able to recognize an unused 
> FieldInfo and allow it to disappear after a merge (or two).
> SegmentMerger merges FieldInfo (from each segment) as nearly the first thing 
> it does.  After that, the different index parts, before it's known what's 
> "used" or not.  After writing, we theoretically know which fields are used or 
> not, though we're not doing any bookkeeping to track it.  Maybe we should 
> track the fields used during writing so we write a filtered merged fieldInfo 
> at the end instead of unfiltered up front?  Or perhaps upon reading a 
> segment, we make it cheap/easy for each index type (e.g. terms index, stored 
> fields, ...) to know which fields have data for the corresponding type.  
> Then, on a subsequent merge, we know up front to filter the FieldInfos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8552) optimize getMergedFieldInfos for one-segment FieldInfos

2018-11-15 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687694#comment-16687694
 ] 

Christophe Bismuth commented on LUCENE-8552:


Hi [~dsmiley], I've opened PR 
[#8552|https://github.com/apache/lucene-solr/pull/499] on GitHub to implement 
this improvement.

> optimize getMergedFieldInfos for one-segment FieldInfos
> ---
>
> Key: LUCENE-8552
> URL: https://issues.apache.org/jira/browse/LUCENE-8552
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: David Smiley
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> FieldInfos.getMergedFieldInfos could trivially return the FieldInfos of the 
> first and only LeafReader if there is only one LeafReader.
> Also... if there is more than one LeafReader, and if FieldInfos & FieldInfo 
> implemented equals() & hashCode() (including a cached hashCode), maybe we 
> could also call equals() iterating through the FieldInfos to see if we should 
> bother adding it to the FieldInfos.Builder?  Admittedly this is speculative; 
> may not be worth the bother.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8552) optimize getMergedFieldInfos for one-segment FieldInfos

2018-11-15 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687694#comment-16687694
 ] 

Christophe Bismuth edited comment on LUCENE-8552 at 11/15/18 9:21 AM:
--

Hi [~dsmiley], I've opened PR 
[#8552|https://github.com/apache/lucene-solr/pull/499] on GitHub to implement 
this feature.


was (Author: cbismuth):
Hi [~dsmiley], I've opened PR 
[#8552|https://github.com/apache/lucene-solr/pull/499] on GitHub to implement 
this improvement.

> optimize getMergedFieldInfos for one-segment FieldInfos
> ---
>
> Key: LUCENE-8552
> URL: https://issues.apache.org/jira/browse/LUCENE-8552
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: David Smiley
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> FieldInfos.getMergedFieldInfos could trivially return the FieldInfos of the 
> first and only LeafReader if there is only one LeafReader.
> Also... if there is more than one LeafReader, and if FieldInfos & FieldInfo 
> implemented equals() & hashCode() (including a cached hashCode), maybe we 
> could also call equals() iterating through the FieldInfos to see if we should 
> bother adding it to the FieldInfos.Builder?  Admittedly this is speculative; 
> may not be worth the bother.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8552) optimize getMergedFieldInfos for one-segment FieldInfos

2018-11-15 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683801#comment-16683801
 ] 

Christophe Bismuth edited comment on LUCENE-8552 at 11/15/18 8:42 AM:
--

Is the underlying idea to limit the number of {{FieldInfo}} instances added to 
the {{FieldInfos.Builder}} for performances purpose?


was (Author: cbismuth):
Is the underlying idea to limit the number of {{FieldInfos}} instances added to 
the {{FieldInfos.Builder}} for performances purpose?

> optimize getMergedFieldInfos for one-segment FieldInfos
> ---
>
> Key: LUCENE-8552
> URL: https://issues.apache.org/jira/browse/LUCENE-8552
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: David Smiley
>Priority: Minor
>
> FieldInfos.getMergedFieldInfos could trivially return the FieldInfos of the 
> first and only LeafReader if there is only one LeafReader.
> Also... if there is more than one LeafReader, and if FieldInfos & FieldInfo 
> implemented equals() & hashCode() (including a cached hashCode), maybe we 
> could also call equals() iterating through the FieldInfos to see if we should 
> bother adding it to the FieldInfos.Builder?  Admittedly this is speculative; 
> may not be worth the bother.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8552) optimize getMergedFieldInfos for one-segment FieldInfos

2018-11-14 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686470#comment-16686470
 ] 

Christophe Bismuth commented on LUCENE-8552:


Thanks a lot [~dsmiley]! I'll come back to you as soon as I have a patch.

> optimize getMergedFieldInfos for one-segment FieldInfos
> ---
>
> Key: LUCENE-8552
> URL: https://issues.apache.org/jira/browse/LUCENE-8552
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: David Smiley
>Priority: Minor
>
> FieldInfos.getMergedFieldInfos could trivially return the FieldInfos of the 
> first and only LeafReader if there is only one LeafReader.
> Also... if there is more than one LeafReader, and if FieldInfos & FieldInfo 
> implemented equals() & hashCode() (including a cached hashCode), maybe we 
> could also call equals() iterating through the FieldInfos to see if we should 
> bother adding it to the FieldInfos.Builder?  Admittedly this is speculative; 
> may not be worth the bother.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8294) KeywordTokenizer hangs with user misconfigured inputs

2018-11-12 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683907#comment-16683907
 ] 

Christophe Bismuth commented on LUCENE-8294:


Issue can be closed as fixed in 
[906679adc80f0fad1e5c311b03023c7bd95633d7|https://github.com/apache/lucene-solr/commit/906679adc80f0fad1e5c311b03023c7bd95633d7].

> KeywordTokenizer hangs with user misconfigured inputs
> -
>
> Key: LUCENE-8294
> URL: https://issues.apache.org/jira/browse/LUCENE-8294
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 2.1
>Reporter: John Doe
>Priority: Minor
>
> When a user configures the bufferSize to be 0, the while loop in 
> KeywordTokenizer.next() function hangs endlessly. Here is the code snippet.
> {code:java}
>   public KeywordTokenizer(Reader input, int bufferSize) {
> super(input);
> this.buffer = new char[bufferSize];//bufferSize is misconfigured with 0
> this.done = false;
>   }
>   public Token next() throws IOException {
> if (!done) {
>   done = true;
>   StringBuffer buffer = new StringBuffer();
>   int length;
>   while (true) {
> length = input.read(this.buffer); //length is always 0 when the 
> buffer.size == 0
> if (length == -1) break;
> buffer.append(this.buffer, 0, length);
>   }
>   String text = buffer.toString();
>   return new Token(text, 0, text.length());
> }
> return null;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8552) optimize getMergedFieldInfos for one-segment FieldInfos

2018-11-12 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683801#comment-16683801
 ] 

Christophe Bismuth commented on LUCENE-8552:


Is the underlying idea to limit the number of {{FieldInfos}} instances added to 
the {{FieldInfos.Builder}} for performances purpose?

> optimize getMergedFieldInfos for one-segment FieldInfos
> ---
>
> Key: LUCENE-8552
> URL: https://issues.apache.org/jira/browse/LUCENE-8552
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: David Smiley
>Priority: Minor
>
> FieldInfos.getMergedFieldInfos could trivially return the FieldInfos of the 
> first and only LeafReader if there is only one LeafReader.
> Also... if there is more than one LeafReader, and if FieldInfos & FieldInfo 
> implemented equals() & hashCode() (including a cached hashCode), maybe we 
> could also call equals() iterating through the FieldInfos to see if we should 
> bother adding it to the FieldInfos.Builder?  Admittedly this is speculative; 
> may not be worth the bother.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8552) optimize getMergedFieldInfos for one-segment FieldInfos

2018-11-12 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683779#comment-16683779
 ] 

Christophe Bismuth commented on LUCENE-8552:


Hi, I'd like to work on this one.

> optimize getMergedFieldInfos for one-segment FieldInfos
> ---
>
> Key: LUCENE-8552
> URL: https://issues.apache.org/jira/browse/LUCENE-8552
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: David Smiley
>Priority: Minor
>
> FieldInfos.getMergedFieldInfos could trivially return the FieldInfos of the 
> first and only LeafReader if there is only one LeafReader.
> Also... if there is more than one LeafReader, and if FieldInfos & FieldInfo 
> implemented equals() & hashCode() (including a cached hashCode), maybe we 
> could also call equals() iterating through the FieldInfos to see if we should 
> bother adding it to the FieldInfos.Builder?  Admittedly this is speculative; 
> may not be worth the bother.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8026) ExitableDirectoryReader does not instrument points

2018-11-12 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683763#comment-16683763
 ] 

Christophe Bismuth commented on LUCENE-8026:


Hi, I've opened PR [#497|https://github.com/apache/lucene-solr/pull/497] to fix 
this bug.

> ExitableDirectoryReader does not instrument points
> --
>
> Key: LUCENE-8026
> URL: https://issues.apache.org/jira/browse/LUCENE-8026
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Trivial
>  Labels: newdev
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This means it cannot interrupt range or geo queries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8463) Early-terminate queries sorted by SortField.DOC

2018-11-12 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683584#comment-16683584
 ] 

Christophe Bismuth commented on LUCENE-8463:


Hi, I've opened PR [#496|https://github.com/apache/lucene-solr/pull/496] to 
implement this improvement.

> Early-terminate queries sorted by SortField.DOC
> ---
>
> Key: LUCENE-8463
> URL: https://issues.apache.org/jira/browse/LUCENE-8463
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Labels: newdev
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently TopFieldCollector only early-terminates when the search sort is a 
> prefix of the index sort, but it could also early-terminate when sorting by 
> doc id.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8463) Early-terminate queries sorted by SortField.DOC

2018-11-09 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681232#comment-16681232
 ] 

Christophe Bismuth commented on LUCENE-8463:


Hi, I'd like to work on this one.

> Early-terminate queries sorted by SortField.DOC
> ---
>
> Key: LUCENE-8463
> URL: https://issues.apache.org/jira/browse/LUCENE-8463
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Labels: newdev
>
> Currently TopFieldCollector only early-terminates when the search sort is a 
> prefix of the index sort, but it could also early-terminate when sorting by 
> doc id.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8464) Implement ConstantScoreScorer#setMinCompetitiveScore

2018-11-08 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680002#comment-16680002
 ] 

Christophe Bismuth commented on LUCENE-8464:


Hi, I've opened PR [#495|https://github.com/apache/lucene-solr/pull/495] on 
GitHub to implement this, could you please tell me if this implementation could 
fit? Thank you.

> Implement ConstantScoreScorer#setMinCompetitiveScore
> 
>
> Key: LUCENE-8464
> URL: https://issues.apache.org/jira/browse/LUCENE-8464
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Labels: newdev
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We should make it so the iterator returns NO_MORE_DOCS after 
> setMinCompetitiveScore is called with a value that is greater than the 
> constant score.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8464) Implement ConstantScoreScorer#setMinCompetitiveScore

2018-11-07 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678314#comment-16678314
 ] 

Christophe Bismuth commented on LUCENE-8464:


Hi, I'd like to work on this one.

> Implement ConstantScoreScorer#setMinCompetitiveScore
> 
>
> Key: LUCENE-8464
> URL: https://issues.apache.org/jira/browse/LUCENE-8464
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Labels: newdev
>
> We should make it so the iterator returns NO_MORE_DOCS after 
> setMinCompetitiveScore is called with a value that is greater than the 
> constant score.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org