[jira] [Commented] (LUCENE-8204) ReqOptSumScorer should leverage sub scorers' per-block max scores

2018-08-08 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16573030#comment-16573030
 ] 

Jim Ferenczi commented on LUCENE-8204:
--

Thanks Adrien, I pushed a new patch that addresses your comments.

> ReqOptSumScorer should leverage sub scorers' per-block max scores
> -
>
> Key: LUCENE-8204
> URL: https://issues.apache.org/jira/browse/LUCENE-8204
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8204.patch, LUCENE-8204.patch, LUCENE-8204.patch, 
> LUCENE-8204.patch, LUCENE-8204.patch
>
>
> Currently it only looks at max scores on the entire segment. Given that 
> per-block max scores usually give lower upper bounds of the score, this 
> should help.
> This is especially important for LUCENE-8197 to work well since the main 
> query would typically be added as a MUST clauses of a boolean query while the 
> query that scores on features would be a SHOULD clause.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8439) DisjunctionMaxScorer should leverage sub scorers' per-block max scores

2018-08-08 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-8439.
--
   Resolution: Fixed
Fix Version/s: master (8.0)

Thanks [~jpountz] !

> DisjunctionMaxScorer should leverage sub scorers' per-block max scores
> --
>
> Key: LUCENE-8439
> URL: https://issues.apache.org/jira/browse/LUCENE-8439
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Fix For: master (8.0)
>
> Attachments: LUCENE-8439.patch, LUCENE-8439.patch, LUCENE-8439.patch, 
> LUCENE-8439.patch, LUCENE-8439.patch
>
>
> This issue is similar to https://issues.apache.org/jira/browse/LUCENE-8204 
> but for the DisjunctionMaxScorer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8448) Slowdown of nested boolean queries after LUCENE-8060

2018-08-14 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579721#comment-16579721
 ] 

Jim Ferenczi commented on LUCENE-8448:
--

We've tried several things with Adrien to optimize the nested boolean case. 
Currently boolean queries don't propagate the minimum score to their sub 
scorers. However in the first version of max score, the MaxScoreSumPropagator 
used to compute a minimum score per sub clause based on the sum of maximum 
scores of the other clauses. This optimization was removed at some point 
because it had a bad effect on simple boolean queries that contains terms 
clause only. A lot of things changed in the meantime (max scores are computed 
per blocks, ...) so we've tried to revive this optimization and applied it to 
all boolean scorers to run some benchmarks. We used wikimediumall and added the 
problematic queries from the nightly benchmark, the results are below:
{noformat}
TaskQPS lucene_baseline StdDevQPS lucene_candidate StdDev Pct diff
OrHighMed 28.08 (7.7%) 27.37 (8.8%) -2.5% ( -17% - 15%)
AndHighHigh 21.15 (9.5%) 20.99 (10.2%) -0.8% ( -18% - 20%)
AndHighMed 58.19 (8.8%) 57.80 (9.2%) -0.7% ( -17% - 18%)
OrHighHigh 11.92 (7.7%) 11.90 (9.2%) -0.1% ( -15% - 18%)
OrHighLow 259.35 (7.2%) 261.80 (8.7%) 0.9% ( -13% - 18%)
OrNotHighLow 582.99 (7.8%) 588.83 (9.8%) 1.0% ( -15% - 20%)
Fuzzy2 56.86 (6.8%) 57.67 (8.2%) 1.4% ( -12% - 17%)
AndHighLow 340.56 (7.4%) 345.60 (9.7%) 1.5% ( -14% - 20%)
Fuzzy1 53.38 (6.9%) 54.22 (8.6%) 1.6% ( -13% - 18%)
Wildcard 17.41 (8.3%) 17.73 (9.4%) 1.8% ( -14% - 21%)
Prefix3 22.16 (8.4%) 22.57 (9.7%) 1.9% ( -14% - 21%)
OrNotHighMed 803.13 (8.2%) 818.85 (9.8%) 2.0% ( -14% - 21%)
HighTerm 1333.98 (8.1%) 1361.12 (10.1%) 2.0% ( -14% - 22%)
OrNotHighHigh 790.52 (7.7%) 806.66 (9.8%) 2.0% ( -14% - 21%)
OrHighNotLow 960.80 (8.8%) 981.56 (10.1%) 2.2% ( -15% - 22%)
Respell 42.76 (7.7%) 43.71 (9.6%) 2.2% ( -13% - 21%)
MedTerm 1568.86 (8.1%) 1603.71 (10.1%) 2.2% ( -14% - 22%)
OrHighNotMed 999.26 (8.5%) 1022.44 (9.8%) 2.3% ( -14% - 22%)
OrHighNotHigh 791.65 (8.5%) 811.37 (10.4%) 2.5% ( -15% - 23%)
LowTerm 1611.84 (8.5%) 1660.90 (10.1%) 3.0% ( -14% - 23%)
AndMedOrHighHigh 5.53 (6.6%) 8.94 (12.8%) 61.6% ( 39% - 86%)
AndHighOrMedMed 8.45 (7.3%) 29.90 (33.6%) 253.8% ( 198% - 318%)
AndHighOrMedLow 13.68 (7.4%) 58.86 (37.4%) 330.2% ( 265% - 405%)
AndMedOrHighLow 2.01 (6.1%) 24.43 (92.6%) 1118.1% ( 960% - 1295%){noformat}

The AndMedOrHighHigh and AndHighOrMedMed have a nice speedup, I also created 
AndHighOrMedLow and AndMedOrHighLow to show other types of speed up for nested 
boolean queries. 
We also tested other improvements but they didn't work as well as this one and 
would deserve specific issues (that I'll open in a follow up).

> Slowdown of nested boolean queries after LUCENE-8060
> 
>
> Key: LUCENE-8448
> URL: https://issues.apache.org/jira/browse/LUCENE-8448
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8448.patch
>
>
> Mike's nightly benchmarks revealed that disabling hit counts slowed down 
> nested boolean queries 
> http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html 
> http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html.
> We are probably not propagating max scores and/or blocks efficiently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8448) Slowdown of nested boolean queries after LUCENE-8060

2018-08-14 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8448:
-
Attachment: LUCENE-8448.patch

> Slowdown of nested boolean queries after LUCENE-8060
> 
>
> Key: LUCENE-8448
> URL: https://issues.apache.org/jira/browse/LUCENE-8448
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8448.patch
>
>
> Mike's nightly benchmarks revealed that disabling hit counts slowed down 
> nested boolean queries 
> http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html 
> http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html.
> We are probably not propagating max scores and/or blocks efficiently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8466) FrozenBufferedUpdates#apply*Deletes is incorrect when index sorting is enabled

2018-08-27 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594050#comment-16594050
 ] 

Jim Ferenczi commented on LUCENE-8466:
--

Thanks Tomás and sorry Vish for not adding you in the first place.

> FrozenBufferedUpdates#apply*Deletes is incorrect when index sorting is enabled
> --
>
> Key: LUCENE-8466
> URL: https://issues.apache.org/jira/browse/LUCENE-8466
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Critical
> Fix For: 7.5, master (8.0)
>
> Attachments: LUCENE-8466.patch
>
>
> This was reported by Vish Ramachandran at 
> https://markmail.org/message/w27h7n2isb5eogos. When deleting by term or 
> query, we record the term/query that is deleted and the current max doc id. 
> Deletes are later applied on flush by FrozenBufferedUpdates#apply*Deletes. 
> Unfortunately, this doesn't work when index sorting is enabled since 
> documents are renumbered between the time that the current max doc id is 
> computed and the time that deletes are applied.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8466) FrozenBufferedUpdates#apply*Deletes is incorrect when index sorting is enabled

2018-08-27 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-8466.
--
   Resolution: Fixed
Fix Version/s: master (8.0)
   7.5

Thanks Adrien and Vish !

> FrozenBufferedUpdates#apply*Deletes is incorrect when index sorting is enabled
> --
>
> Key: LUCENE-8466
> URL: https://issues.apache.org/jira/browse/LUCENE-8466
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Critical
> Fix For: 7.5, master (8.0)
>
> Attachments: LUCENE-8466.patch
>
>
> This was reported by Vish Ramachandran at 
> https://markmail.org/message/w27h7n2isb5eogos. When deleting by term or 
> query, we record the term/query that is deleted and the current max doc id. 
> Deletes are later applied on flush by FrozenBufferedUpdates#apply*Deletes. 
> Unfortunately, this doesn't work when index sorting is enabled since 
> documents are renumbered between the time that the current max doc id is 
> computed and the time that deletes are applied.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8466) FrozenBufferedUpdates#apply*Deletes is incorrect when index sorting is enabled

2018-08-27 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8466:
-
Attachment: LUCENE-8466.patch

> FrozenBufferedUpdates#apply*Deletes is incorrect when index sorting is enabled
> --
>
> Key: LUCENE-8466
> URL: https://issues.apache.org/jira/browse/LUCENE-8466
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Critical
> Attachments: LUCENE-8466.patch
>
>
> This was reported by Vish Ramachandran at 
> https://markmail.org/message/w27h7n2isb5eogos. When deleting by term or 
> query, we record the term/query that is deleted and the current max doc id. 
> Deletes are later applied on flush by FrozenBufferedUpdates#apply*Deletes. 
> Unfortunately, this doesn't work when index sorting is enabled since 
> documents are renumbered between the time that the current max doc id is 
> computed and the time that deletes are applied.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8466) FrozenBufferedUpdates#apply*Deletes is incorrect when index sorting is enabled

2018-08-27 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593610#comment-16593610
 ] 

Jim Ferenczi commented on LUCENE-8466:
--

Here is a patch that fixes delete by query. It seems that the issue only 
affects FrozenBufferedUpdates#applyQueryDeletes and by extension any usage of 
IndexWriter#deleteDocuments(Query... queries).  Other types of 
deletions/updates (by term or doc values update) are not affected and work as 
expected with index sorting.

> FrozenBufferedUpdates#apply*Deletes is incorrect when index sorting is enabled
> --
>
> Key: LUCENE-8466
> URL: https://issues.apache.org/jira/browse/LUCENE-8466
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Critical
> Attachments: LUCENE-8466.patch
>
>
> This was reported by Vish Ramachandran at 
> https://markmail.org/message/w27h7n2isb5eogos. When deleting by term or 
> query, we record the term/query that is deleted and the current max doc id. 
> Deletes are later applied on flush by FrozenBufferedUpdates#apply*Deletes. 
> Unfortunately, this doesn't work when index sorting is enabled since 
> documents are renumbered between the time that the current max doc id is 
> computed and the time that deletes are applied.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8306) Allow iteration over the term positions of a Match

2018-07-20 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550491#comment-16550491
 ] 

Jim Ferenczi commented on LUCENE-8306:
--

+1, thanks [~romseygeek] , the patch looks good. 

> Allow iteration over the term positions of a Match
> --
>
> Key: LUCENE-8306
> URL: https://issues.apache.org/jira/browse/LUCENE-8306
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8306.patch, LUCENE-8306.patch, LUCENE-8306.patch, 
> LUCENE-8306.patch
>
>
> For multi-term queries such as phrase queries, the matches API currently just 
> returns information about the span of the whole match.  It would be useful to 
> also expose information about the matching terms within the phrase.  The same 
> would apply to Spans and Interval queries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8402) TestPriorityQueue failures

2018-07-20 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-8402.
--
Resolution: Fixed

I removed the invalid assertions, thanks [~thetaphi].

> TestPriorityQueue failures
> --
>
> Key: LUCENE-8402
> URL: https://issues.apache.org/jira/browse/LUCENE-8402
> Project: Lucene - Core
>  Issue Type: Test
>  Components: core/other
>Reporter: Jim Ferenczi
>Priority: Major
> Fix For: master (8.0), 7.5
>
> Attachments: LUCENE-8402.patch
>
>
> Elastic CI found a couple of failures in TestPriorityQueue:
> {code}
> java.lang.AssertionError
>   at 
> __randomizedtesting.SeedInfo.seed([7116E1C3DFA51E99:7507110B3E9E9A3]:0)
>   at 
> org.apache.lucene.util.TestPriorityQueue$IntegerQueue.lessThan(TestPriorityQueue.java:36)
>   at 
> org.apache.lucene.util.TestPriorityQueue$IntegerQueue.lessThan(TestPriorityQueue.java:28)
>   at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:264)
>   at org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:141)
>   at 
> org.apache.lucene.util.TestPriorityQueue.testIteratorRandom(TestPriorityQueue.java:241)
> {code}
> It can be reproduced with the following seed: -Dtests.seed=7116E1C3DFA51E99
> It is due to https://issues.apache.org/jira/browse/LUCENE-8345 which removed 
> the deprecated call to "new Integer" despite the fact that the queue in the 
> tests (IntegerQueue#lessThan) does not allow to reuse Integers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8401) Add PassageBuilder to help construct highlights using MatchesIterator

2018-07-18 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547486#comment-16547486
 ] 

Jim Ferenczi commented on LUCENE-8401:
--

I like the approach here. A few comments:
* The text extraction for each passage should be delayed to the end. It's 
costly and should be done on the best passages only (assuming that only the 
best passages are kept).
* Maybe something that the PassageFormatter should handle but I wonder how 
interleaved hits should be handled. Since we don't split intervals we need to 
be careful to avoid highlighting the same term twice (from a term query and a 
phrase query for instance). 
* findPassageStart should be bounded to the end offset of the last passage.
* findPassageEnd should be bounded to the start offset of the next passage.


> Add PassageBuilder to help construct highlights using MatchesIterator
> -
>
> Key: LUCENE-8401
> URL: https://issues.apache.org/jira/browse/LUCENE-8401
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/highlighter
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8401.patch
>
>
> Jim and I discussed a while back the idea of adding highlighter components, 
> rather than a fully-fledged highlighter, which would allow users to build 
> their own specialised highlighters.  To that end, I'd like to add a 
> PassageBuilder class that uses the Matches API to break text up into passages 
> containing hits.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8306) Allow iteration over the term positions of a Match

2018-07-18 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16548104#comment-16548104
 ] 

Jim Ferenczi commented on LUCENE-8306:
--

Would it be easier if getSubMatches returns null when matches are already 
leaves ? I look at this sub matches as a way to split a big interval so it 
should be possible to split at multiple levels. I don't think we should assume 
that the first level is a top level and the next level is always for terms, 
e.g. if the top level matches is already a term, getSubMatches should return 
null or EMPTY. 

> Allow iteration over the term positions of a Match
> --
>
> Key: LUCENE-8306
> URL: https://issues.apache.org/jira/browse/LUCENE-8306
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8306.patch, LUCENE-8306.patch, LUCENE-8306.patch
>
>
> For multi-term queries such as phrase queries, the matches API currently just 
> returns information about the span of the whole match.  It would be useful to 
> also expose information about the matching terms within the phrase.  The same 
> would apply to Spans and Interval queries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8402) TestPriorityQueue failures

2018-07-16 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545634#comment-16545634
 ] 

Jim Ferenczi edited comment on LUCENE-8402 at 7/16/18 7:09 PM:
---

Since it is a deprecated function I don't think we should put it back, [~markh] 
can you explain the intent of forbidding the reuse of Integers in this test ? 
It doesn't seem to be required.


was (Author: jim.ferenczi):
Since it is a deprecated function I don't think we should put it back, [~markh] 
can you explain the intent of forbidding the reuse of Integers in this test. It 
doesn't seem to be required.

> TestPriorityQueue failures
> --
>
> Key: LUCENE-8402
> URL: https://issues.apache.org/jira/browse/LUCENE-8402
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8402.patch
>
>
> Elastic CI found a couple of failures in TestPriorityQueue:
> {code}
> java.lang.AssertionError
>   at 
> __randomizedtesting.SeedInfo.seed([7116E1C3DFA51E99:7507110B3E9E9A3]:0)
>   at 
> org.apache.lucene.util.TestPriorityQueue$IntegerQueue.lessThan(TestPriorityQueue.java:36)
>   at 
> org.apache.lucene.util.TestPriorityQueue$IntegerQueue.lessThan(TestPriorityQueue.java:28)
>   at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:264)
>   at org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:141)
>   at 
> org.apache.lucene.util.TestPriorityQueue.testIteratorRandom(TestPriorityQueue.java:241)
> {code}
> It can be reproduced with the following seed: -Dtests.seed=7116E1C3DFA51E99
> It is due to https://issues.apache.org/jira/browse/LUCENE-8345 which removed 
> the deprecated call to "new Integer" despite the fact that the queue in the 
> tests (IntegerQueue#lessThan) does not allow to reuse Integers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8402) TestPriorityQueue failures

2018-07-16 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545634#comment-16545634
 ] 

Jim Ferenczi commented on LUCENE-8402:
--

Since it is a deprecated function I don't think we should put it back, [~markh] 
can you explain the intent of forbidding the reuse of Integers in this test. It 
doesn't seem to be required.

> TestPriorityQueue failures
> --
>
> Key: LUCENE-8402
> URL: https://issues.apache.org/jira/browse/LUCENE-8402
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8402.patch
>
>
> Elastic CI found a couple of failures in TestPriorityQueue:
> {code}
> java.lang.AssertionError
>   at 
> __randomizedtesting.SeedInfo.seed([7116E1C3DFA51E99:7507110B3E9E9A3]:0)
>   at 
> org.apache.lucene.util.TestPriorityQueue$IntegerQueue.lessThan(TestPriorityQueue.java:36)
>   at 
> org.apache.lucene.util.TestPriorityQueue$IntegerQueue.lessThan(TestPriorityQueue.java:28)
>   at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:264)
>   at org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:141)
>   at 
> org.apache.lucene.util.TestPriorityQueue.testIteratorRandom(TestPriorityQueue.java:241)
> {code}
> It can be reproduced with the following seed: -Dtests.seed=7116E1C3DFA51E99
> It is due to https://issues.apache.org/jira/browse/LUCENE-8345 which removed 
> the deprecated call to "new Integer" despite the fact that the queue in the 
> tests (IntegerQueue#lessThan) does not allow to reuse Integers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Deleted] (LUCENE-8402) TestPriorityQueue failures

2018-07-16 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8402:
-
Comment: was deleted

(was: Found by the build in 
https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Linux/2330
I muted the test on master and branch_7x for now.)

> TestPriorityQueue failures
> --
>
> Key: LUCENE-8402
> URL: https://issues.apache.org/jira/browse/LUCENE-8402
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8402.patch
>
>
> Elastic CI found a couple of failures in TestPriorityQueue:
> {code}
> java.lang.AssertionError
>   at 
> __randomizedtesting.SeedInfo.seed([7116E1C3DFA51E99:7507110B3E9E9A3]:0)
>   at 
> org.apache.lucene.util.TestPriorityQueue$IntegerQueue.lessThan(TestPriorityQueue.java:36)
>   at 
> org.apache.lucene.util.TestPriorityQueue$IntegerQueue.lessThan(TestPriorityQueue.java:28)
>   at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:264)
>   at org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:141)
>   at 
> org.apache.lucene.util.TestPriorityQueue.testIteratorRandom(TestPriorityQueue.java:241)
> {code}
> It can be reproduced with the following seed: -Dtests.seed=7116E1C3DFA51E99
> It is due to https://issues.apache.org/jira/browse/LUCENE-8345 which removed 
> the deprecated call to "new Integer" despite the fact that the queue in the 
> tests (IntegerQueue#lessThan) does not allow to reuse Integers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8402) TestPriorityQueue failures

2018-07-16 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8402:
-
Attachment: LUCENE-8402.patch

> TestPriorityQueue failures
> --
>
> Key: LUCENE-8402
> URL: https://issues.apache.org/jira/browse/LUCENE-8402
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8402.patch
>
>
> Elastic CI found a couple of failures in TestPriorityQueue:
> {code}
> java.lang.AssertionError
>   at 
> __randomizedtesting.SeedInfo.seed([7116E1C3DFA51E99:7507110B3E9E9A3]:0)
>   at 
> org.apache.lucene.util.TestPriorityQueue$IntegerQueue.lessThan(TestPriorityQueue.java:36)
>   at 
> org.apache.lucene.util.TestPriorityQueue$IntegerQueue.lessThan(TestPriorityQueue.java:28)
>   at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:264)
>   at org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:141)
>   at 
> org.apache.lucene.util.TestPriorityQueue.testIteratorRandom(TestPriorityQueue.java:241)
> {code}
> It can be reproduced with the following seed: -Dtests.seed=7116E1C3DFA51E99
> It is due to https://issues.apache.org/jira/browse/LUCENE-8345 which removed 
> the deprecated call to "new Integer" despite the fact that the queue in the 
> tests (IntegerQueue#lessThan) does not allow to reuse Integers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8402) TestPriorityQueue failures

2018-07-16 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545121#comment-16545121
 ] 

Jim Ferenczi commented on LUCENE-8402:
--

Here is a patch that removes the assertions around reused Integers.

> TestPriorityQueue failures
> --
>
> Key: LUCENE-8402
> URL: https://issues.apache.org/jira/browse/LUCENE-8402
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8402.patch
>
>
> Elastic CI found a couple of failures in TestPriorityQueue:
> {code}
> java.lang.AssertionError
>   at 
> __randomizedtesting.SeedInfo.seed([7116E1C3DFA51E99:7507110B3E9E9A3]:0)
>   at 
> org.apache.lucene.util.TestPriorityQueue$IntegerQueue.lessThan(TestPriorityQueue.java:36)
>   at 
> org.apache.lucene.util.TestPriorityQueue$IntegerQueue.lessThan(TestPriorityQueue.java:28)
>   at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:264)
>   at org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:141)
>   at 
> org.apache.lucene.util.TestPriorityQueue.testIteratorRandom(TestPriorityQueue.java:241)
> {code}
> It can be reproduced with the following seed: -Dtests.seed=7116E1C3DFA51E99
> It is due to https://issues.apache.org/jira/browse/LUCENE-8345 which removed 
> the deprecated call to "new Integer" despite the fact that the queue in the 
> tests (IntegerQueue#lessThan) does not allow to reuse Integers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8402) TestPriorityQueue failures

2018-07-16 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545158#comment-16545158
 ] 

Jim Ferenczi commented on LUCENE-8402:
--

Found by the build in https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Linux/2330
I muted the test on master and branch_7x for now.

> TestPriorityQueue failures
> --
>
> Key: LUCENE-8402
> URL: https://issues.apache.org/jira/browse/LUCENE-8402
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8402.patch
>
>
> Elastic CI found a couple of failures in TestPriorityQueue:
> {code}
> java.lang.AssertionError
>   at 
> __randomizedtesting.SeedInfo.seed([7116E1C3DFA51E99:7507110B3E9E9A3]:0)
>   at 
> org.apache.lucene.util.TestPriorityQueue$IntegerQueue.lessThan(TestPriorityQueue.java:36)
>   at 
> org.apache.lucene.util.TestPriorityQueue$IntegerQueue.lessThan(TestPriorityQueue.java:28)
>   at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:264)
>   at org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:141)
>   at 
> org.apache.lucene.util.TestPriorityQueue.testIteratorRandom(TestPriorityQueue.java:241)
> {code}
> It can be reproduced with the following seed: -Dtests.seed=7116E1C3DFA51E99
> It is due to https://issues.apache.org/jira/browse/LUCENE-8345 which removed 
> the deprecated call to "new Integer" despite the fact that the queue in the 
> tests (IntegerQueue#lessThan) does not allow to reuse Integers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8402) TestPriorityQueue failures

2018-07-16 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545159#comment-16545159
 ] 

Jim Ferenczi commented on LUCENE-8402:
--

Found by the build in https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Linux/2330
I muted the test on master and branch_7x for now.

> TestPriorityQueue failures
> --
>
> Key: LUCENE-8402
> URL: https://issues.apache.org/jira/browse/LUCENE-8402
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8402.patch
>
>
> Elastic CI found a couple of failures in TestPriorityQueue:
> {code}
> java.lang.AssertionError
>   at 
> __randomizedtesting.SeedInfo.seed([7116E1C3DFA51E99:7507110B3E9E9A3]:0)
>   at 
> org.apache.lucene.util.TestPriorityQueue$IntegerQueue.lessThan(TestPriorityQueue.java:36)
>   at 
> org.apache.lucene.util.TestPriorityQueue$IntegerQueue.lessThan(TestPriorityQueue.java:28)
>   at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:264)
>   at org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:141)
>   at 
> org.apache.lucene.util.TestPriorityQueue.testIteratorRandom(TestPriorityQueue.java:241)
> {code}
> It can be reproduced with the following seed: -Dtests.seed=7116E1C3DFA51E99
> It is due to https://issues.apache.org/jira/browse/LUCENE-8345 which removed 
> the deprecated call to "new Integer" despite the fact that the queue in the 
> tests (IntegerQueue#lessThan) does not allow to reuse Integers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8402) TestPriorityQueue failures

2018-07-16 Thread Jim Ferenczi (JIRA)
Jim Ferenczi created LUCENE-8402:


 Summary: TestPriorityQueue failures
 Key: LUCENE-8402
 URL: https://issues.apache.org/jira/browse/LUCENE-8402
 Project: Lucene - Core
  Issue Type: Test
Reporter: Jim Ferenczi


Elastic CI found a couple of failures in TestPriorityQueue:
{code}
java.lang.AssertionError
at 
__randomizedtesting.SeedInfo.seed([7116E1C3DFA51E99:7507110B3E9E9A3]:0)
at 
org.apache.lucene.util.TestPriorityQueue$IntegerQueue.lessThan(TestPriorityQueue.java:36)
at 
org.apache.lucene.util.TestPriorityQueue$IntegerQueue.lessThan(TestPriorityQueue.java:28)
at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:264)
at org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:141)
at 
org.apache.lucene.util.TestPriorityQueue.testIteratorRandom(TestPriorityQueue.java:241)
{code}

It can be reproduced with the following seed: -Dtests.seed=7116E1C3DFA51E99
It is due to https://issues.apache.org/jira/browse/LUCENE-8345 which removed 
the deprecated call to "new Integer" despite the fact that the queue in the 
tests (IntegerQueue#lessThan) does not allow to reuse Integers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8204) ReqOptSumScorer should leverage sub scorers' per-block max scores

2018-07-25 Thread Jim Ferenczi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8204:
-
Attachment: LUCENE-8204.patch

> ReqOptSumScorer should leverage sub scorers' per-block max scores
> -
>
> Key: LUCENE-8204
> URL: https://issues.apache.org/jira/browse/LUCENE-8204
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8204.patch
>
>
> Currently it only looks at max scores on the entire segment. Given that 
> per-block max scores usually give lower upper bounds of the score, this 
> should help.
> This is especially important for LUCENE-8197 to work well since the main 
> query would typically be added as a MUST clauses of a boolean query while the 
> query that scores on features would be a SHOULD clause.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8204) ReqOptSumScorer should leverage sub scorers' per-block max scores

2018-07-25 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555729#comment-16555729
 ] 

Jim Ferenczi commented on LUCENE-8204:
--

Here is a patch that implements the block skipping logic. I had to modify the 
RandomApproximationQuery in the tests to make it compatible with 
advanceShallow. I also ran some benchmarks on wikimediumall, I used the 
HighLow, HighMed and HighHigh queries from the original benchmark and made the 
second clause optional to test this scorer:
{noformat}
  TaskQPS lucene_baseline  StdDevQPS lucene_candidate  StdDev   
 Pct diff
HighMed   37.51  (0.0%)   38.05  (0.0%)1.4% (   1% -
1%)
HighHigh  11.02  (0.0%)   16.47  (0.0%)   49.5% (  49% -   
49%)
HighLow  103.91  (0.0%)  219.08  (0.0%)  110.8% ( 110% -  
110%)
{noformat}

> ReqOptSumScorer should leverage sub scorers' per-block max scores
> -
>
> Key: LUCENE-8204
> URL: https://issues.apache.org/jira/browse/LUCENE-8204
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8204.patch
>
>
> Currently it only looks at max scores on the entire segment. Given that 
> per-block max scores usually give lower upper bounds of the score, this 
> should help.
> This is especially important for LUCENE-8197 to work well since the main 
> query would typically be added as a MUST clauses of a boolean query while the 
> query that scores on features would be a SHOULD clause.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8476) Optimizations in UserDictionary (KoreanAnalyzer)

2018-09-04 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603349#comment-16603349
 ] 

Jim Ferenczi commented on LUCENE-8476:
--

Thanks [~danmuzi] ! The new patch looks good, I'll commit shortly.

> Optimizations in UserDictionary (KoreanAnalyzer)
> 
>
> Key: LUCENE-8476
> URL: https://issues.apache.org/jira/browse/LUCENE-8476
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Namgyu Kim
>Priority: Major
>  Labels: optimization, patch-available
> Attachments: LUCENE-8476.patch, LUCENE-8476.patch
>
>
> -■ Bug fix-
> -1) BufferedReader's close method is not called.-  *(Wrong check)*
> {code:java}
> // Line 57 method
> public static UserDictionary open(Reader reader) throws IOException {
>   BufferedReader br = new BufferedReader(reader);
>   String line = null;
>   List entries = new ArrayList<>();
>   // text + optional segmentations
>   while ((line = br.readLine()) != null) {
> ...
>   }
>   if (entries.isEmpty()) {
> return null;
>   } else {
> return new UserDictionary(entries);
>   }
> }{code}
> If you look at the code above, there is no close() method for the "br" 
> variable.
>  As I know, BufferedReader can cause a +memory leak+ if the close method is 
> not called.
>  So I changed the code below.
> {code:java}
> // Line 57 method
> public static UserDictionary open(Reader reader) throws IOException {
>   String line = null;
>   List entries = new ArrayList<>();
>   // text + optional segmentations
>   try (BufferedReader br = new BufferedReader(reader)) {
> while ((line = br.readLine()) != null) {
>   ...
> }
>   }
>   if (entries.isEmpty()) {
> return null;
>   } else {
> return new UserDictionary(entries);
>   }
> }
> {code}
> I solved this problem with 
> "[try-with-resources|https://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html];
>  method available since Java 7.
>  
> ■ Optimizations
> 1) Change from Collections.sort to List.sort (UserDictionary constructor)
> {code:java}
> // Line 82 method
> private UserDictionary(List entries) throws IOException {
>   final CharacterDefinition charDef = CharacterDefinition.getInstance();
>   Collections.sort(entries,
>   Comparator.comparing(e -> e.split("\\s+")[0]));
>   PositiveIntOutputs fstOutput = PositiveIntOutputs.getSingleton();
>   ...
> }{code}
> List.sort in Java 8 is known to be faster than existing Collections.sort. 
> ([http://ankitsambyal.blogspot.com/2014/03/difference-between-listsort-and.html])
>  So I changed the code below.
> {code:java}
> // Line 82 method
> private UserDictionary(List entries) throws IOException {
>   final CharacterDefinition charDef = CharacterDefinition.getInstance();
>   entries.sort(Comparator.comparing(e -> e.split("\\s+")[0]));
>   PositiveIntOutputs fstOutput = PositiveIntOutputs.getSingleton();
>   ...
> }{code}
>  
> 2) Remove unnecessary null check (UserDictionary constructor)
> {code:java}
> // Line 82 method
> private UserDictionary(List entries) throws IOException {
>   ...
>   String lastToken = null;
>   ...
>   for (String entry : entries) {
> String[] splits = entry.split("\\s+");
> String token = splits[0];
> if (lastToken != null && token.equals(lastToken)) {
>   continue;
> }
> char lastChar = entry.charAt(entry.length()-1);
>   ...
> }{code}
> Looking at this part of the code,
> {code:java}
> if (lastToken != null && token.equals(lastToken)) {
>   continue;
> }{code}
> A null check for lastToken is unnecessary.
>  Because the equals method of the String class internally performs a null 
> check.
>  So I changed the code as below.
> {code:java}
> // Line 82 method
> private UserDictionary(List entries) throws IOException {
>   ...
>   String lastToken = null;
>   ...
>   for (String entry : entries) {
> String[] splits = entry.split("\\s+");
> String token = splits[0];
> if (token.equals(lastToken)) {
>   continue;
> }
> char lastChar = entry.charAt(entry.length()-1);
>   ...
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12655) Add Korean analyzer JAR file (NORI) and schema.xml example to Solr

2018-09-05 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604160#comment-16604160
 ] 

Jim Ferenczi commented on SOLR-12655:
-

[~y100421] we use the mecab-ko-dic-2.0.3-20170922 version for the build. 
mecab-ko-dic-2.0.1-20150920 has a different list of POS tags (UNT tag is not 
present in 2.0.3) and some POS tags have a different id so you'll need to 
modify the source to fix the build. If you add UNT to the list of POS tags and 
change line 35 of the UnknownDictionaryBuilder to:

{code:java}
private static final String NGRAM_DICTIONARY_ENTRY = 
"NGRAM,1801,3561,3668,SY,*,*,*,*,*,*,*";
{code}

... the build should work. We need this entry to annotate the ngrams that we 
add if the word is not recognized but the leftId, rightID for the SY POS tag 
has changed between 2.0.3 and 2.0.1. We could apply this switch automatically 
but can you explain why you need to use the old version of the dictionary 
instead of the new one ?


> Add Korean analyzer JAR file (NORI) and schema.xml example to Solr
> --
>
> Key: SOLR-12655
> URL: https://issues.apache.org/jira/browse/SOLR-12655
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Build, Schema and Analysis
>Affects Versions: 7.4
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Major
> Fix For: master (8.0), 7.5
>
> Attachments: SOLR-12655.patch, image-2018-09-05-17-42-09-983.png, 
> screenshot-1.png
>
>
> In Lucene 7.4 we added the NORI analyzer for Korean. In contrast to Kuromoji, 
> the JAR file is missing in the distribution (the analyzers-kuromoji is part 
> of main solr distribution). We should also add an updated/new "text_ko" field 
> in the default schema.
> See also SOLR-12255 about the documentation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8382) Don't propagate calls to setMinCompetitiveScore in MultiCollector

2018-07-04 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532681#comment-16532681
 ] 

Jim Ferenczi commented on LUCENE-8382:
--

+1

> Don't propagate calls to setMinCompetitiveScore in MultiCollector
> -
>
> Key: LUCENE-8382
> URL: https://issues.apache.org/jira/browse/LUCENE-8382
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8382.patch
>
>
> Currently it is propagated, which means that one collector can hide hits to 
> another collector. We could try to reconcile min scores across collectors to 
> take the maximum min score, but I don't think it's worth the effort as 
> combinations are most often going to include one collector that needs top 
> hits and another one that doesn't need scores at all (eg. facets or total 
> hits) rather than another collector that also needs top scoring documents.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Closed] (LUCENE-7638) Optimize graph query produced by QueryBuilder

2018-01-24 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi closed LUCENE-7638.


> Optimize graph query produced by QueryBuilder
> -
>
> Key: LUCENE-7638
> URL: https://issues.apache.org/jira/browse/LUCENE-7638
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Major
> Fix For: 6.5
>
> Attachments: LUCENE-7638.patch, LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph 
> TokenStream. For instance if you have a synonym like "ny, new york" and you 
> search for "new york city", the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient 
> to build a boolean query that merges all the intersections in the graph. So 
> instead of "new york city", "ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each 
> intersection in the graph. This is not done in this patch but we could also 
> create a specialized query that gives equivalent scores to multi terms 
> synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could 
> also use the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple 
> muli-terms synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Closed] (LUCENE-7699) Apply graph articulation points optimization to phrase graph queries

2018-01-24 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi closed LUCENE-7699.


> Apply graph articulation points optimization to phrase graph queries
> 
>
> Key: LUCENE-7699
> URL: https://issues.apache.org/jira/browse/LUCENE-7699
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Matt Weber
>Assignee: Jim Ferenczi
>Priority: Major
> Fix For: 6.5
>
> Attachments: LUCENE-7699.patch, LUCENE-7699.patch
>
>
> Follow-up to LUCENE-7638 that applies the same articulation point logic to 
> graph phrases using span queries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-8137) GraphTokenStreamFiniteStrings does not handle position inc > 1 in multi-word synoyms

2018-01-24 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi reassigned LUCENE-8137:


Assignee: Jim Ferenczi

> GraphTokenStreamFiniteStrings does not handle position inc > 1 in multi-word 
> synoyms
> 
>
> Key: LUCENE-8137
> URL: https://issues.apache.org/jira/browse/LUCENE-8137
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: master (8.0), 7.2.1
>Reporter: Jim Ferenczi
>Assignee: Jim Ferenczi
>Priority: Major
>
> The automaton built for graph queries that contain multiple multi-word 
> synonyms does not handle gaps if they appear in the middle of a multi-word 
> synonym. In such case the token next to the gap is considered as part of the 
> multi-word synonym. 
> Stop words that appear before or after multi-word synonyms are handled 
> correctly in the current version but the synonym rule "part of speech, pos" 
> for instance does not create the expected query if "of" is removed by a 
> filter that is set after the synonym_graph.  One solution would be to reuse 
> TokenStreamToAutomaton (with minor changes to add the ability to create token 
> transitions rather than chars) which preserves gaps (as a transition) in the 
> produced automaton.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8137) GraphTokenStreamFiniteStrings does not handle position inc > 1 in multi-word synoyms

2018-01-24 Thread Jim Ferenczi (JIRA)
Jim Ferenczi created LUCENE-8137:


 Summary: GraphTokenStreamFiniteStrings does not handle position 
inc > 1 in multi-word synoyms
 Key: LUCENE-8137
 URL: https://issues.apache.org/jira/browse/LUCENE-8137
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 7.2.1, master (8.0)
Reporter: Jim Ferenczi


The automaton built for graph queries that contain multiple multi-word synonyms 
does not handle gaps if they appear in the middle of a multi-word synonym. In 
such case the token next to the gap is considered as part of the multi-word 
synonym. 

Stop words that appear before or after multi-word synonyms are handled 
correctly in the current version but the synonym rule "part of speech, pos" for 
instance does not create the expected query if "of" is removed by a filter that 
is set after the synonym_graph.  One solution would be to reuse 
TokenStreamToAutomaton (with minor changes to add the ability to create token 
transitions rather than chars) which preserves gaps (as a transition) in the 
produced automaton.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8199) TestBackwardsCompatibility#testAllVersionsTested should fail if the version of a bwc index is missing

2018-03-12 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-8199.
--
Resolution: Won't Fix

> TestBackwardsCompatibility#testAllVersionsTested should fail if the version 
> of a bwc index is missing
> -
>
> Key: LUCENE-8199
> URL: https://issues.apache.org/jira/browse/LUCENE-8199
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Jim Ferenczi
>Priority: Major
>
> There is a leniency in the test that makes the test pass if a bwc index 
> doesn't have a version associated to it:
> {code:java}
> // we could be missing up to 1 file, which may be due to a release that is in 
> progress
> if (missingFiles.size() <= 1 && extraFiles.isEmpty()) {
>   // success
>   return;
> }
> {code}
> I think this test can be removed since we add the new released version in the 
> non-release branches only after the release.  Then we'd need to add the 
> released version *and*
>  the BWC test in the non-release branches in the same commit so that the test 
> never fails. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8199) TestBackwardsCompatibility#testAllVersionsTested should fail if the version of a bwc index is missing

2018-03-12 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395179#comment-16395179
 ] 

Jim Ferenczi commented on LUCENE-8199:
--

Argh, scratch that, this is only true for bugfix releases. For minor and major 
releases we have two versions that are not released in the non-release branches 
so the leniency is needed. There is no way to know if a minor release that is 
not the latest in the major has been released or not so this leniency is 
needed. The release howto has been updated with a specific section about adding 
version for bugfix releases in the non-release branches, I don't have a better 
solution so I'll close this issue for now.

> TestBackwardsCompatibility#testAllVersionsTested should fail if the version 
> of a bwc index is missing
> -
>
> Key: LUCENE-8199
> URL: https://issues.apache.org/jira/browse/LUCENE-8199
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Jim Ferenczi
>Priority: Major
>
> There is a leniency in the test that makes the test pass if a bwc index 
> doesn't have a version associated to it:
> {code:java}
> // we could be missing up to 1 file, which may be due to a release that is in 
> progress
> if (missingFiles.size() <= 1 && extraFiles.isEmpty()) {
>   // success
>   return;
> }
> {code}
> I think this test can be removed since we add the new released version in the 
> non-release branches only after the release.  Then we'd need to add the 
> released version *and*
>  the BWC test in the non-release branches in the same commit so that the test 
> never fails. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Closed] (LUCENE-8199) TestBackwardsCompatibility#testAllVersionsTested should fail if the version of a bwc index is missing

2018-03-12 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi closed LUCENE-8199.


> TestBackwardsCompatibility#testAllVersionsTested should fail if the version 
> of a bwc index is missing
> -
>
> Key: LUCENE-8199
> URL: https://issues.apache.org/jira/browse/LUCENE-8199
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Jim Ferenczi
>Priority: Major
>
> There is a leniency in the test that makes the test pass if a bwc index 
> doesn't have a version associated to it:
> {code:java}
> // we could be missing up to 1 file, which may be due to a release that is in 
> progress
> if (missingFiles.size() <= 1 && extraFiles.isEmpty()) {
>   // success
>   return;
> }
> {code}
> I think this test can be removed since we add the new released version in the 
> non-release branches only after the release.  Then we'd need to add the 
> released version *and*
>  the BWC test in the non-release branches in the same commit so that the test 
> never fails. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8199) TestBackwardsCompatibility#testAllVersionsTested should fail if the version of a bwc index is missing

2018-03-12 Thread Jim Ferenczi (JIRA)
Jim Ferenczi created LUCENE-8199:


 Summary: TestBackwardsCompatibility#testAllVersionsTested should 
fail if the version of a bwc index is missing
 Key: LUCENE-8199
 URL: https://issues.apache.org/jira/browse/LUCENE-8199
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Jim Ferenczi


There is a leniency in the test that makes the test pass if a bwc index doesn't 
have a version associated to it:

{code:java}
// we could be missing up to 1 file, which may be due to a release that is in 
progress
if (missingFiles.size() <= 1 && extraFiles.isEmpty()) {
  // success
  return;
}
{code}

I think this test can be removed since we add the new released version in the 
non-release branches only after the release.  Then we'd need to add the 
released version *and*
 the BWC test in the non-release branches in the same commit so that the test 
never fails. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8196) Add IntervalQuery and IntervalsSource to expose minimum interval semantics across term fields

2018-03-08 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391723#comment-16391723
 ] 

Jim Ferenczi commented on LUCENE-8196:
--

{quote}
I was a bit annoyed to see the field masking hack but actually those intervals 
source do not need term statistics which makes the hack less horrible. Could 
you still document it to make sure users are aware it is a hack and explain it 
which circumstances it might be ok?
{quote}
 
 I think that the proposed API should be more restrictive regarding the 
targeted field. Could we restrict the IntervalsSource to work on a single field 
? Something like:

{code:java}
public abstract class IntervalsSource {
 protected final String field;

 public IntervalsSource(String field) {
   this.field = field;
 }

 public abstract IntervalIterator intervals(LeafReaderContext ctx) throws 
IOException;
...
{code}
... and then we can check in each implementation that the sources are all 
targeting the same field.
I understand that it might be powerful to mix multiple fields in an interval 
query but with the current API that seems to be the norm rather than an 
exception. We can add the field masking hack afterward but for the first 
iteration I think it's better to focus on the main use case for this new query 
which is to provide a way to find the minimum intervals in a single field. 

Regarding the score of the intervals, it seems that the patch uses the inverse 
length of the interval rather than the slop within the interval like the sloppy 
phrase scorer. Could we compute the total slop of the current interval (as the 
sum of the slop of each interval source that composed this interval) and use 
its inverse to score each ? This would make different interval query more 
comparable in terms of score since an interval with few terms and a slop>0 
would score less that one with more terms but no slop.

I'll look deeper at the implementation of the different queries but I like the 
simplicity of the patch and the fact that there is a paper with a proof for 
each of them.



> Add IntervalQuery and IntervalsSource to expose minimum interval semantics 
> across term fields
> -
>
> Key: LUCENE-8196
> URL: https://issues.apache.org/jira/browse/LUCENE-8196
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8196.patch
>
>
> This ticket proposes an alternative implementation of the SpanQuery family 
> that uses minimum-interval semantics from 
> [http://vigna.di.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics.pdf]
>  to implement positional queries across term-based fields.  Rather than using 
> TermQueries to construct the interval operators, as in LUCENE-2878 or the 
> current Spans implementation, we instead use a new IntervalsSource object, 
> which will produce IntervalIterators over a particular segment and field.  
> These are constructed using various static helper methods, and can then be 
> passed to a new IntervalQuery which will return documents that contain one or 
> more intervals so defined.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8196) Add IntervalQuery and IntervalsSource to expose minimum interval semantics across term fields

2018-03-09 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16393188#comment-16393188
 ] 

Jim Ferenczi commented on LUCENE-8196:
--

{quote}
I'd rather keep the API as it is, with the field being passed to IntervalQuery 
and then recursing down the IntervalSource tree.  Otherwise you end up having 
to declare the field on all the created sources, which seems redundant.  I've 
removed the cross-field hack entirely for the moment.
{quote}

+1 to remove the cross-field hack, thanks. Regarding the API it's ok since 
IntervalQuery limits all sources to one field so I am fine with that (I 
misunderstood how the IntervalQuery can be used).

> Add IntervalQuery and IntervalsSource to expose minimum interval semantics 
> across term fields
> -
>
> Key: LUCENE-8196
> URL: https://issues.apache.org/jira/browse/LUCENE-8196
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8196.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This ticket proposes an alternative implementation of the SpanQuery family 
> that uses minimum-interval semantics from 
> [http://vigna.di.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics.pdf]
>  to implement positional queries across term-based fields.  Rather than using 
> TermQueries to construct the interval operators, as in LUCENE-2878 or the 
> current Spans implementation, we instead use a new IntervalsSource object, 
> which will produce IntervalIterators over a particular segment and field.  
> These are constructed using various static helper methods, and can then be 
> passed to a new IntervalQuery which will return documents that contain one or 
> more intervals so defined.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-10 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432078#comment-16432078
 ] 

Jim Ferenczi commented on LUCENE-8231:
--

I attached a new patch that fixes an issue with offsets of the compound nouns. 
Currently the patch outputs a single path and can also keep the original 
compound as well as the decompounds. I think we can add the N best paths in a 
follow up.

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not 

[jira] [Updated] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-10 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8231:
-
Attachment: LUCENE-8231.patch

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the module to play with the analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, 

[jira] [Updated] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-12 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8231:
-
Attachment: LUCENE-8231.patch

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the module to play with the analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-

[jira] [Commented] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-12 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435420#comment-16435420
 ] 

Jim Ferenczi commented on LUCENE-8231:
--

Thanks Robert.
I attached a new patch that changes the enum to attach a description to each 
tag and reflected the description in javadocs comments.
The toString reflection now returns the description of the POS tag:
{noformat}
KoreanTokenizer@22f9baea term=평창,bytes=[ed 8f 89 ec b0 
bd],startOffset=0,endOffset=2,positionIncrement=1,positionLength=1,type=word,termFrequency=1,posType=MORPHEME,leftPOS=NNP(Proper
 Noun),rightPOS=NNP(Proper Noun),morphemes=null,reading=null
{noformat}
... and the compounds are correctly rendered:
{noformat}
KoreanTokenizer@292528fd term=가락지나물,bytes=[ea b0 80 eb 9d bd ec a7 80 eb 82 98 
eb ac 
bc],startOffset=0,endOffset=5,positionIncrement=1,positionLength=1,type=word,termFrequency=1,posType=COMPOUND,leftPOS=NNG(General
 Noun),rightPOS=NNG(General Noun),morphemes=가락지/NNG(General 
Noun)+나물/NNG(General Noun),reading=null
{noformat}

I also change the format for the Preanalysis token, they are now compressed 
using the same technic than for Compounds which gives another 2MB improvement 
over the last patch.



> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work 

[jira] [Commented] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-12 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436001#comment-16436001
 ] 

Jim Ferenczi commented on LUCENE-8231:
--

I agree this will also simplify the understanding of these ctoes. I'll remove 
them and keep only the defaults and everything ctors. 
Regarding why the KoreanTokenizer takes stoptags, it is done to simplify the 
removal of tokens when we keep compounds since we need to set the position 
length of the compound token without the tokens that should be removed. 
Otherwise the stop tags filter should handle position length when it removes a 
token and I find it simpler to do it directly in the Tokenizer especially if we 
add the support for keeping the N best paths in a follow up.

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries 

[jira] [Commented] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-12 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436019#comment-16436019
 ] 

Jim Ferenczi commented on LUCENE-8231:
--

No because FilteringTokenFilter doesn't handle positionLength so if it removes 
a token from a compound it needs to change the posLength of the original 
compound. I tried to write something to handle this case in the filtering token 
filter but it's not trivial and requires a lot of code so I choose the simple 
path of removing the tokens directly in the tokenizer. 

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 

[jira] [Created] (LUCENE-8250) Should FilteringTokenFilter handle positionLength

2018-04-12 Thread Jim Ferenczi (JIRA)
Jim Ferenczi created LUCENE-8250:


 Summary: Should FilteringTokenFilter handle positionLength
 Key: LUCENE-8250
 URL: https://issues.apache.org/jira/browse/LUCENE-8250
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Jim Ferenczi


FilteringTokenFilter does not handle the position length graph attribute when 
removing a token from the stream. This doesn't work well with graph token 
stream that sets position length since removing a token from the stream can 
invalidate the position length set on the previous tokens. 
This issue was first discussed in 
https://issues.apache.org/jira/browse/LUCENE-4065 but it has a different 
purpose which is why I am opening a new issue here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-12 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8231:
-
Attachment: LUCENE-8231.patch

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the module to play with the analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-12 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436042#comment-16436042
 ] 

Jim Ferenczi commented on LUCENE-8231:
--

I think that the Japanese analyzer has the same issue and it's even worst since 
it can outputs multiple paths. If we have a compound "AB" that is decomposed 
into "A" and "B", if we remove "B" we need to change the posLength of "AB" to 
be 1 (instead of 2).

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> 

[jira] [Commented] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-12 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435922#comment-16435922
 ] 

Jim Ferenczi commented on LUCENE-8231:
--

Sure, I added two more ctr in the last patch, one with no-arg and one with only 
the AttributeFactory.

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in 

[jira] [Commented] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-12 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436066#comment-16436066
 ] 

Jim Ferenczi commented on LUCENE-8231:
--

Ok I'll restore the KoreanPartOfSpeechStopFilter then and we can discuss 
LUCENE-4065 separately.

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the 

[jira] [Updated] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-12 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8231:
-
Attachment: LUCENE-8231.patch

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the module to play with the analysis.



--
This message was sent by Atlassian JIRA

[jira] [Updated] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-12 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8231:
-
Attachment: LUCENE-8231.patch

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the module to play with the analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-12 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435802#comment-16435802
 ] 

Jim Ferenczi commented on LUCENE-8231:
--

Right, I changed the Analyzer but not the Tokenizer. I attached a new patch 
that adds two more ctor that use the default parameters.

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal 

[jira] [Comment Edited] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-12 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436123#comment-16436123
 ] 

Jim Ferenczi edited comment on LUCENE-8231 at 4/12/18 6:42 PM:
---

I attached a new patch that restores the KoreanPartOfSpeechFilter and changes 
the ctors for the KoreanTokenizer.

I also opened https://issues.apache.org/jira/browse/LUCENE-8250 for the stop 
filter issue with position length.


was (Author: jim.ferenczi):
I attached a new patch that restores the KoreanPartOfSpeechFilter and change 
the ctors for the KoreanTokenizer.

I also opened https://issues.apache.org/jira/browse/LUCENE-8250 for the stop 
filter issue with position length.

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic 

[jira] [Commented] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-12 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436123#comment-16436123
 ] 

Jim Ferenczi commented on LUCENE-8231:
--

I attached a new patch that restores the KoreanPartOfSpeechFilter and change 
the ctors for the KoreanTokenizer.

I also opened https://issues.apache.org/jira/browse/LUCENE-8250 for the stop 
filter issue with position length.

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> 

[jira] [Updated] (LUCENE-8250) Should FilteringTokenFilter handle positionLength

2018-04-13 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8250:
-
Attachment: LUCENE-8250.patch

> Should FilteringTokenFilter handle positionLength
> -
>
> Key: LUCENE-8250
> URL: https://issues.apache.org/jira/browse/LUCENE-8250
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8250.patch
>
>
> FilteringTokenFilter does not handle the position length graph attribute when 
> removing a token from the stream. This doesn't work well with graph token 
> stream that sets position length since removing a token from the stream can 
> invalidate the position length set on the previous tokens. 
> This issue was first discussed in 
> https://issues.apache.org/jira/browse/LUCENE-4065 but it has a different 
> purpose which is why I am opening a new issue here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8250) Should FilteringTokenFilter handle positionLength

2018-04-13 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436960#comment-16436960
 ] 

Jim Ferenczi commented on LUCENE-8250:
--

I attached a small test that I hope illustrate the issue. The synonym rule is 
"twd, the walking dead, the zombie show" and removing "the" from the stream 
after the synonym graph makes "zombie show" a following path of "walking" so 
the output of the graph is "twd, walking dead, walking zombie show". It's 
unclear to me if the FilteringTokenFilter is doing the right thing here. I 
added the dot output of TokenStreamToAutomaton in the test, this class is able 
to fill the hole when a stop filter removes a token but in this case I don't 
see how we can infer that "zombie show" is not after "walking". 

> Should FilteringTokenFilter handle positionLength
> -
>
> Key: LUCENE-8250
> URL: https://issues.apache.org/jira/browse/LUCENE-8250
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8250.patch
>
>
> FilteringTokenFilter does not handle the position length graph attribute when 
> removing a token from the stream. This doesn't work well with graph token 
> stream that sets position length since removing a token from the stream can 
> invalidate the position length set on the previous tokens. 
> This issue was first discussed in 
> https://issues.apache.org/jira/browse/LUCENE-4065 but it has a different 
> purpose which is why I am opening a new issue here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-13 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437065#comment-16437065
 ] 

Jim Ferenczi commented on LUCENE-8231:
--

Thanks a lot Robert ! Any objections to backport to 7x ?

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the module to play with the 

[jira] [Updated] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-12 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8231:
-
Attachment: LUCENE-8231.patch

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the module to play with the analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-12 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435755#comment-16435755
 ] 

Jim Ferenczi commented on LUCENE-8231:
--

I attached a new patch that passes precommit checks. The javadocs looks fine, 
all part of speech tags have a description attached. 
The DEFAULT_STOP_TAGS is now in the Tokenizer and is used by default when no 
tags are specified.

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since 

[jira] [Resolved] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-13 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-8231.
--
   Resolution: Fixed
Fix Version/s: master (8.0)
   7.4

Thanks Robert and Uwe.

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Fix For: 7.4, master (8.0)
>
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also 

[jira] [Commented] (LUCENE-8255) Can we make index sorting work for soft deletes

2018-04-16 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439948#comment-16439948
 ] 

Jim Ferenczi commented on LUCENE-8255:
--

{quote}
This also means that sorting such a segment on merge has a significant 
overhead. (I hope Jim Ferenczi can shed some light on it how much we would have 
to expect). 
{quote}

Now that we apply index sorting at flush time we guarantee that merges can 
perform a "merge sort" of already sorted segments. Before 
https://issues.apache.org/jira/browse/LUCENE-7579 we had to re-sort segments 
but only those that were produced by a flush. If we accept updatable values in 
the index sort fields this means that the worst case for a merge would be to 
re-sort all segments. The only way to sort a segment currently is to use a 
SortingLeafReader which basically holds everything in memory (doc_values, 
postings, ...). This worked fine before LUCENE-7579 because flushed segments 
are supposed to be small but if we need to sort big segments as well we'd need 
a more specialized class that don't load everything in memory. 

{quote}
We also need to add some special casing since we use "merge sorting" and can't 
go backwards in doc ID which would be violated if a segment received updates. 
(cc Adrien Grand)
{quote}

I think it's fine if we are able to produce a "sorted" view of the segments and 
then use a "merge sort" on top of these views like we used to do before 
LUCENE-7579. The merges of stored fields and fields should be equivalent to a 
merge with deleted documents but the merge of points fields would be 
problematic. 
Currently for single-dimension points we produce the new bkd tree from a sorted 
stream so the producer doesn't need to resort all the values. This is one of 
the main reason for the speed up that you can see here:
https://home.apache.org/~mikemccand/lucenebench/sparseResults.html#index_throughput
 (annotation V).
This is a bit counter intuitive that we need to resort the points values since 
they are already sorted but the reason is that we need to ensure that documents 
with the same values are sorted by doc ids in the tree and this assumption is 
invalidated if we need to remap doc ids during the merge. I guess it would be 
possible to optimize this by only resorting the documents that share the same 
values so that's not a blocker but something we need to keep in mind.

Regarding the overhead if we use index sorting as it is (or as it were) it 
should be at least the same 
https://home.apache.org/~mikemccand/lucenebench/sparseResults.html#index_throughput
 before annotation V and I expect it to be higher since all segments in a merge 
would need resorting. 


> Can we make index sorting work for soft deletes
> ---
>
> Key: LUCENE-8255
> URL: https://issues.apache.org/jira/browse/LUCENE-8255
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Simon Willnauer
>Priority: Major
>
> I phrased this as a question since it's mainly a discussion. I spoke to 
> [~rcmuir] on a couple of occasions about making index sorting work for soft 
> deletes. The issue that prevents this is that soft deletes use updateable DV 
> to mark docs as deleted. This basically means that a sorted segment is not 
> guaranteed to be sorted if it has received any updates. This also means that 
> sorting such a segment on merge has a significant overhead. (I hope [~jimczi] 
> can shed some light on it how much we would have to expect). We also need to 
> add some special casing since we use "merge sorting" and can't go backwards 
> in doc ID which would be violated if a segment received updates. (cc 
> [~jpountz])
> The main purpose of doing this is that "soft deleted" documents would either 
> be at the end or in the beginning of the segment such that compression is 
> better if these docs have larger retention policies. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8196) Add IntervalQuery and IntervalsSource to expose minimum interval semantics across term fields

2018-04-24 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449409#comment-16449409
 ] 

Jim Ferenczi commented on LUCENE-8196:
--

I don't think we should prevent anything ;). *unordered* is a conjunction 
operator so it should match if all terms match (which is the case in your 
example) so these results are expected IMO. Maybe we should rename *unordered* 
to *and* in order to avoid confusion ?
If you want to match the same term within a max width the ordered query works 
fine:
{code:java}
Query q = new IntervalQuery(field, Intervals.maxwidth(2, 
Intervals.ordered(Intervals.term("w3"), Intervals.term("w3";
{code}

[~romseygeek] while I was playing with *unordered* I realized that we don't 
protect against sources that match but don't have intervals.
For instance:
{code:java}
Query q = new IntervalQuery(query, Intervals.unordered(Intervals.term("w2"), 
Intervals.ordered(Intervals.term("w3"),Intervals.term("w3";
{code}
does not work because the *unordered* query doesn't check if the sub source has 
intervals when it adds it in the queue. 
I attached a patch that fixes this issue and added some tests that fail without 
the fix. Can you take a look ?


> Add IntervalQuery and IntervalsSource to expose minimum interval semantics 
> across term fields
> -
>
> Key: LUCENE-8196
> URL: https://issues.apache.org/jira/browse/LUCENE-8196
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8196.patch, LUCENE-8196.patch, LUCENE-8196.patch, 
> LUCENE-8196.patch, LUCENE-8196.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This ticket proposes an alternative implementation of the SpanQuery family 
> that uses minimum-interval semantics from 
> [http://vigna.di.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics.pdf]
>  to implement positional queries across term-based fields.  Rather than using 
> TermQueries to construct the interval operators, as in LUCENE-2878 or the 
> current Spans implementation, we instead use a new IntervalsSource object, 
> which will produce IntervalIterators over a particular segment and field.  
> These are constructed using various static helper methods, and can then be 
> passed to a new IntervalQuery which will return documents that contain one or 
> more intervals so defined.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8196) Add IntervalQuery and IntervalsSource to expose minimum interval semantics across term fields

2018-04-24 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8196:
-
Attachment: LUCENE-8196-debug.patch

> Add IntervalQuery and IntervalsSource to expose minimum interval semantics 
> across term fields
> -
>
> Key: LUCENE-8196
> URL: https://issues.apache.org/jira/browse/LUCENE-8196
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8196-debug.patch, LUCENE-8196.patch, 
> LUCENE-8196.patch, LUCENE-8196.patch, LUCENE-8196.patch, LUCENE-8196.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This ticket proposes an alternative implementation of the SpanQuery family 
> that uses minimum-interval semantics from 
> [http://vigna.di.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics.pdf]
>  to implement positional queries across term-based fields.  Rather than using 
> TermQueries to construct the interval operators, as in LUCENE-2878 or the 
> current Spans implementation, we instead use a new IntervalsSource object, 
> which will produce IntervalIterators over a particular segment and field.  
> These are constructed using various static helper methods, and can then be 
> passed to a new IntervalQuery which will return documents that contain one or 
> more intervals so defined.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8196) Add IntervalQuery and IntervalsSource to expose minimum interval semantics across term fields

2018-04-24 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16450039#comment-16450039
 ] 

Jim Ferenczi commented on LUCENE-8196:
--

I don't think an operator can prevent anything here, a query for 
*Intervals.ordered(Intervals.term("w3"), Intervals.term("w3"))* should always 
return all intervals of the term "w3" (it will not interleave successive 
intervals of "w3"). [~mattweber] why do you think that this "scenario" should 
be prevented ? When I do "foo AND foo" I don't expect it to match only document 
that have foo twice ?

> Add IntervalQuery and IntervalsSource to expose minimum interval semantics 
> across term fields
> -
>
> Key: LUCENE-8196
> URL: https://issues.apache.org/jira/browse/LUCENE-8196
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8196-debug.patch, LUCENE-8196.patch, 
> LUCENE-8196.patch, LUCENE-8196.patch, LUCENE-8196.patch, LUCENE-8196.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This ticket proposes an alternative implementation of the SpanQuery family 
> that uses minimum-interval semantics from 
> [http://vigna.di.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics.pdf]
>  to implement positional queries across term-based fields.  Rather than using 
> TermQueries to construct the interval operators, as in LUCENE-2878 or the 
> current Spans implementation, we instead use a new IntervalsSource object, 
> which will produce IntervalIterators over a particular segment and field.  
> These are constructed using various static helper methods, and can then be 
> passed to a new IntervalQuery which will return documents that contain one or 
> more intervals so defined.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-03-30 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420467#comment-16420467
 ] 

Jim Ferenczi commented on LUCENE-8231:
--

I attached a new patch that adds a better compression for the compounds (saves 
1MB on the total size) and fix the handling of the user dictionary. It is now 
possible to add common nouns or compounds in the user dictionary. I also added 
more tests to make sure that the dictionary contains valid data. In terms of 
feature I think it's ready, now I'll focus on cleanup and adding more tests.

[~thetaphi], thanks for volunteering ! I think it would be nice to share some 
code with the Kuromoji but I agree with Robert, this is a lot of work and I 
didn't want to change the format and the processing of the Kuromoji too early. 
We can still reevaluate the feasibility of merging some code when we have a 
better idea of the final format for this analyzer.

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary 

[jira] [Comment Edited] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-03-30 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420467#comment-16420467
 ] 

Jim Ferenczi edited comment on LUCENE-8231 at 3/30/18 12:59 PM:


I attached a new patch that adds a better compression for the compounds (saves 
1MB on the total size) and fixes the handling of the user dictionary. It is now 
possible to add common nouns or compounds in the user dictionary. I also added 
more tests to make sure that the dictionary contains valid data. In terms of 
feature I think it's ready, now I'll focus on cleanup and adding more tests.

[~thetaphi], thanks for volunteering ! I think it would be nice to share some 
code with the Kuromoji but I agree with Robert, this is a lot of work and I 
didn't want to change the format and the processing of the Kuromoji too early. 
We can still reevaluate the feasibility of merging some code when we have a 
better idea of the final format for this analyzer.


was (Author: jim.ferenczi):
I attached a new patch that adds a better compression for the compounds (saves 
1MB on the total size) and fix the handling of the user dictionary. It is now 
possible to add common nouns or compounds in the user dictionary. I also added 
more tests to make sure that the dictionary contains valid data. In terms of 
feature I think it's ready, now I'll focus on cleanup and adding more tests.

[~thetaphi], thanks for volunteering ! I think it would be nice to share some 
code with the Kuromoji but I agree with Robert, this is a lot of work and I 
didn't want to change the format and the processing of the Kuromoji too early. 
We can still reevaluate the feasibility of merging some code when we have a 
better idea of the final format for this analyzer.

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result 

[jira] [Updated] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-03-30 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8231:
-
Attachment: LUCENE-8231.patch

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the module to play with the analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional 

[jira] [Updated] (LUCENE-8231) Godori, a Korean analyzer based on mecab-ko-dic

2018-03-28 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8231:
-
Attachment: (was: LUCENE-8231.patch)

> Godori, a Korean analyzer based on mecab-ko-dic
> ---
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called godori (a 
> popular card game in Korea). It is an adaptation of the Kuromoji module so 
> currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the module to play with the analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[jira] [Updated] (LUCENE-8231) Godori, a Korean analyzer based on mecab-ko-dic

2018-03-28 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8231:
-
Attachment: LUCENE-8231.patch

> Godori, a Korean analyzer based on mecab-ko-dic
> ---
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called godori (a 
> popular card game in Korea). It is an adaptation of the Kuromoji module so 
> currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the module to play with the analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8231) Godori, a Korean analyzer based on mecab-ko-dic

2018-03-28 Thread Jim Ferenczi (JIRA)
Jim Ferenczi created LUCENE-8231:


 Summary: Godori, a Korean analyzer based on mecab-ko-dic
 Key: LUCENE-8231
 URL: https://issues.apache.org/jira/browse/LUCENE-8231
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Jim Ferenczi


There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
It is available under an Apache license here:
https://bitbucket.org/eunjeon/mecab-ko-dic

This dictionary was built with MeCab, it defines a format for the features 
adapted for the Korean language.
Since the Kuromoji tokenizer uses the same format for the morphological 
analysis (left cost + right cost + word cost) I tried to adapt the module to 
handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
Kuromoji module and adapts it for the mecab-ko-dic.
I used the same classes to build and read the dictionary but I had to make some 
modifications to handle the differences with the IPADIC and Japanese. 
The resulting binary dictionary takes 28MB on disk, it's bigger than the IPADIC 
but mainly because the source is bigger and there are a lot of
compound and inflect terms that define a group of terms and the segmentation 
that can be applied. 
I attached the patch that contains this new Korean module called godori (a 
popular card game in Korea). It is an adaptation of the Kuromoji module so 
currently
the two modules don't share any code. I wanted to validate the approach first 
and check the relevancy of the results. I don't speak Korean so I used the 
relevancy
tests that was added for another Korean tokenizer 
(https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
I had to simplify the JapaneseTokenizer, my version removes the nBest output 
and the decomposition of too long tokens. I also
modified the handling of whitespaces since they are important in Korean. 
Whitespaces that appear before a term are attached to that term and this
information is used to compute a penalty based on the Part of Speech of the 
token. The penalty cost is a feature added to mecab-ko to handle 
morphemes that should not appear after a morpheme and is described in the 
mecab-ko page:
https://bitbucket.org/eunjeon/mecab-ko
Ignoring whitespaces is also more inlined with the official MeCab library which 
attach the whitespaces to the term that follows.
I also added a decompounder filter that expand the compounds and inflects 
defined in the dictionary and a part of speech filter similar to the Japanese
that removes the morpheme that are not useful for relevance (suffix, prefix, 
interjection, ...). These filters don't play well with the tokenizer if it can 
output multiple paths (nBest output for instance) so for simplicity I removed 
this ability and the Korean tokenizer only outputs the best path.
I compared the result with mecab-ko to confirm that the analyzer is working and 
ran the relevancy test that is defined in HantecRel.java included
in the patch (written by Robert for another Korean analyzer). Here are the 
results:

||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
|Standard|35s|131MB|.007|.1044|.1053|
|CJK|36s|164MB|.1418|.1924|.1916|
|Korean|212s|90MB|.1628|.2094|.2078|

I find the results very promising so I plan to continue to work on this 
project. I started to extract the part of the code that could be shared with the
Kuromoji module but I wanted to share the status and this POC first to confirm 
that this approach is viable. The advantages of using the same model than
the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
moment ;), the resulting dictionary is small compared to other libraries that
use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
lattice on the fly to select the best path efficiently.
The dictionary can be built directly from the godori module with the following 
command:
ant regenerate (you need to create the resource directory (mkdir 
lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) first 
since the dictionary is not included in the patch).
I've also added some minimal tests in the module to play with the analysis.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8231) Godori, a Korean analyzer based on mecab-ko-dic

2018-03-28 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8231:
-
Attachment: (was: LUCENE-8231)

> Godori, a Korean analyzer based on mecab-ko-dic
> ---
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called godori (a 
> popular card game in Korea). It is an adaptation of the Kuromoji module so 
> currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the module to play with the analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[jira] [Updated] (LUCENE-8231) Godori, a Korean analyzer based on mecab-ko-dic

2018-03-28 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8231:
-
Attachment: LUCENE-8231.patch

> Godori, a Korean analyzer based on mecab-ko-dic
> ---
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called godori (a 
> popular card game in Korea). It is an adaptation of the Kuromoji module so 
> currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the module to play with the analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8231) Godori, a Korean analyzer based on mecab-ko-dic

2018-03-28 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8231:
-
Attachment: LUCENE-8231

> Godori, a Korean analyzer based on mecab-ko-dic
> ---
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called godori (a 
> popular card game in Korea). It is an adaptation of the Kuromoji module so 
> currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the module to play with the analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8196) Add IntervalQuery and IntervalsSource to expose minimum interval semantics across term fields

2018-03-29 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418695#comment-16418695
 ] 

Jim Ferenczi commented on LUCENE-8196:
--

+1

> Add IntervalQuery and IntervalsSource to expose minimum interval semantics 
> across term fields
> -
>
> Key: LUCENE-8196
> URL: https://issues.apache.org/jira/browse/LUCENE-8196
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8196.patch, LUCENE-8196.patch, LUCENE-8196.patch, 
> LUCENE-8196.patch, LUCENE-8196.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This ticket proposes an alternative implementation of the SpanQuery family 
> that uses minimum-interval semantics from 
> [http://vigna.di.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics.pdf]
>  to implement positional queries across term-based fields.  Rather than using 
> TermQueries to construct the interval operators, as in LUCENE-2878 or the 
> current Spans implementation, we instead use a new IntervalsSource object, 
> which will produce IntervalIterators over a particular segment and field.  
> These are constructed using various static helper methods, and can then be 
> passed to a new IntervalQuery which will return documents that contain one or 
> more intervals so defined.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8231) Godori, a Korean analyzer based on mecab-ko-dic

2018-03-29 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418690#comment-16418690
 ] 

Jim Ferenczi commented on LUCENE-8231:
--

Thanks for looking Robert !

{quote}

Should there be a ReadingFormFilter similar to the kuromoji case?

{quote}

I attached a new patch that adds this filter, the readings were already in the 
binary dictionary so this does not change the size of the jar. 

{quote}

In the kuromoji case there are is a lot of japanese-specific compression that 
Uwe and I did, if you are worried about size/ram we can try to shrink this 
korean data in ways that make sense for it. That can really be a 
followup/polish/nice-to-have: how big is the built JAR now? something 
semi-reasonable?

{quote}

 

I think the size is reasonable especially if you compare with other libraries 
that use the mecab-ko-dic ;).

Though there are still some rooms for improvement. I did not add the semantic 
class of the token but we could do the same than for the Japanese dic where the 
pos are added in a separate file. The semantic class + POS is unique per leftId 
so this could also save 1 byte in the binary dictionary (we use 1 byte per POS 
per term in the main dictionary).

The expression that contains the decompounds can also be compressed. For 
compound nouns I serialize the segmentations with the term but we could just 
use offset from the surface form. It doesn't work for Inflects which can add 
tokens or use a different form. To be honest I don't know how we can derive the 
offsets for the decompound of Inflects, I don't think there is an easy way to 
do that but I could be completely wrong.

 

In the patch I attached the user dictionary is broken, I copied the one from 
Kuromoji but we should probably change it to accept simple nouns (NNG or NNP) 
where there is no segmentation and use the PREANALYSIS type to add custom 
segmentations (or COMPOUND for nouns only).

 

I talked to my Korean colleagues and they told me that godori has a negative 
meaning in Korea. It is linked with illegal gambling and it also has an ancient 
meaning of "killed by king's order" which is bad :(. This is what happens when 
you pick a name without knowing the culture so apologize for this. I changed 
the name to "nori" which is the proposal they made, it means joy/play. This is 
a very generic name, in Japanese it means seaweeds and is used to wrap sushis 
and onigiri which I find nice because it 's also a reference to the Japanese 
analyzer which was used as a root for this.

 

 

> Godori, a Korean analyzer based on mecab-ko-dic
> ---
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called godori (a 
> popular card game in Korea). It is an adaptation of the Kuromoji module so 
> currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes 

[jira] [Updated] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-03-29 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8231:
-
Summary: Nori, a Korean analyzer based on mecab-ko-dic  (was: Godori, a 
Korean analyzer based on mecab-ko-dic)

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called godori (a 
> popular card game in Korea). It is an adaptation of the Kuromoji module so 
> currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the module to play with the analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Updated] (LUCENE-8231) Godori, a Korean analyzer based on mecab-ko-dic

2018-03-29 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8231:
-
Attachment: LUCENE-8231.patch

> Godori, a Korean analyzer based on mecab-ko-dic
> ---
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called godori (a 
> popular card game in Korea). It is an adaptation of the Kuromoji module so 
> currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the module to play with the analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[jira] [Updated] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-03-29 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8231:
-
Description: 
There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
It is available under an Apache license here:
https://bitbucket.org/eunjeon/mecab-ko-dic

This dictionary was built with MeCab, it defines a format for the features 
adapted for the Korean language.
Since the Kuromoji tokenizer uses the same format for the morphological 
analysis (left cost + right cost + word cost) I tried to adapt the module to 
handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
Kuromoji module and adapts it for the mecab-ko-dic.
I used the same classes to build and read the dictionary but I had to make some 
modifications to handle the differences with the IPADIC and Japanese. 
The resulting binary dictionary takes 28MB on disk, it's bigger than the IPADIC 
but mainly because the source is bigger and there are a lot of
compound and inflect terms that define a group of terms and the segmentation 
that can be applied. 
I attached the patch that contains this new Korean module called -godori- nori. 
It is an adaptation of the Kuromoji module so currently
the two modules don't share any code. I wanted to validate the approach first 
and check the relevancy of the results. I don't speak Korean so I used the 
relevancy
tests that was added for another Korean tokenizer 
(https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
I had to simplify the JapaneseTokenizer, my version removes the nBest output 
and the decomposition of too long tokens. I also
modified the handling of whitespaces since they are important in Korean. 
Whitespaces that appear before a term are attached to that term and this
information is used to compute a penalty based on the Part of Speech of the 
token. The penalty cost is a feature added to mecab-ko to handle 
morphemes that should not appear after a morpheme and is described in the 
mecab-ko page:
https://bitbucket.org/eunjeon/mecab-ko
Ignoring whitespaces is also more inlined with the official MeCab library which 
attach the whitespaces to the term that follows.
I also added a decompounder filter that expand the compounds and inflects 
defined in the dictionary and a part of speech filter similar to the Japanese
that removes the morpheme that are not useful for relevance (suffix, prefix, 
interjection, ...). These filters don't play well with the tokenizer if it can 
output multiple paths (nBest output for instance) so for simplicity I removed 
this ability and the Korean tokenizer only outputs the best path.
I compared the result with mecab-ko to confirm that the analyzer is working and 
ran the relevancy test that is defined in HantecRel.java included
in the patch (written by Robert for another Korean analyzer). Here are the 
results:

||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
|Standard|35s|131MB|.007|.1044|.1053|
|CJK|36s|164MB|.1418|.1924|.1916|
|Korean|212s|90MB|.1628|.2094|.2078|

I find the results very promising so I plan to continue to work on this 
project. I started to extract the part of the code that could be shared with the
Kuromoji module but I wanted to share the status and this POC first to confirm 
that this approach is viable. The advantages of using the same model than
the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
moment ;), the resulting dictionary is small compared to other libraries that
use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
lattice on the fly to select the best path efficiently.
The dictionary can be built directly from the godori module with the following 
command:
ant regenerate (you need to create the resource directory (mkdir 
lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) first 
since the dictionary is not included in the patch).
I've also added some minimal tests in the module to play with the analysis.


  was:
There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
It is available under an Apache license here:
https://bitbucket.org/eunjeon/mecab-ko-dic

This dictionary was built with MeCab, it defines a format for the features 
adapted for the Korean language.
Since the Kuromoji tokenizer uses the same format for the morphological 
analysis (left cost + right cost + word cost) I tried to adapt the module to 
handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
Kuromoji module and adapts it for the mecab-ko-dic.
I used the same classes to build and read the dictionary but I had to make some 
modifications to handle the differences with the IPADIC and Japanese. 
The resulting binary dictionary takes 28MB on disk, it's bigger than the IPADIC 
but mainly because 

[jira] [Commented] (LUCENE-8229) Add a method to Weight to retrieve matches for a single document

2018-03-29 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418708#comment-16418708
 ] 

Jim Ferenczi commented on LUCENE-8229:
--

I like the proposal here. For simple queries it makes the extraction of matched 
positions trivial. Though I wonder how the complex queries would handle this, 
for instance the AutomatonQuery cannot just return an enum over all matching 
terms, we have a special handling of this query in highlighters to avoid the 
explosion for instance. What is your current plan to handle this query ? Should 
it return null for simplicity or should it try to expand the automaton with a 
limit on the number of terms ? I prefer the former which is safe and if users 
want to check the matching of a complex automaton they can use use a 
MemoryIndex for each TopDocument and change the query to use the rewrite method 
that builds a boolean query.

> Add a method to Weight to retrieve matches for a single document
> 
>
> Key: LUCENE-8229
> URL: https://issues.apache.org/jira/browse/LUCENE-8229
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The ability to find out exactly what a query has matched on is a fairly 
> frequent feature request, and would also make highlighters much easier to 
> implement.  There have been a few attempts at doing this, including adding 
> positions to Scorers, or re-writing queries as Spans, but these all either 
> compromise general performance or involve up-front knowledge of all queries.
> Instead, I propose adding a method to Weight that exposes an iterator over 
> matches in a particular document and field.  It should be used in a similar 
> manner to explain() - ie, just for TopDocs, not as part of the scoring loop, 
> which relieves some of the pressure on performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-03-29 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1641#comment-1641
 ] 

Jim Ferenczi commented on LUCENE-8231:
--

{quote}

and looking more, you'd need full byte range to do that. So a BYTE1 FST with 
raw bytes (no UTF-8) but its doable. The root cache would just be 256 entries. 
Maybe you just have a separate BYTE2 FST for the other stuff such as hanja 
forms. but i think overall the performance may be faster. The 
decomposition/recomposition is not so bad from what i remember, its some simple 
math on the unicode codepoint numbers.

{quote}

 

That's a good idea. I'll give it a try. FYI I reindexed the HantecRel corpus 
without the root arc caching and it took 300s to build instead of the 200s so 
caching helps but I agree that caching the full syllabe range is too much.

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly 

[jira] [Comment Edited] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-02 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421999#comment-16421999
 ] 

Jim Ferenczi edited comment on LUCENE-8231 at 4/2/18 7:33 AM:
--

Hi Robert, thanks for your testings and suggestions !

I pushed another patch that applies your suggestions. The connection costs is 
now loaded in a direct byte buffer on init (the matrix is still compressed on 
disk). I also changed the format of the binary dictionary to use 3 bits of the 
left id to encode the type (compound, inflect, morpheme or prenanalysis), added 
a POS dict that maps the left id and the part of speech tag and introduced a 
new flag to indicate entries with a single POS. The new size on disk is 25MB 
with 10MB on the heap at start since we don't load the connection costs in the 
heap anymore. 


was (Author: jim.ferenczi):
Hi Robert, thanks for your testings and suggestions !

I pushed another patch that applies your suggestions. The connection costs is 
now loaded in a direct byte buffer on init (the matrix is still compressed on 
disk). I also changed the format of the binary dictionary to 3 bits of the left 
id to encode the type (compound, inflect, morpheme or prenanalysis), added a 
POS dict that maps the left id and the part of speech tag and introduced a new 
flag to indicate entries with a single POS. The new size on disk is 25MB with 
10MB on the heap at start since we don't load the connection costs in the heap 
anymore. 

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index 

[jira] [Commented] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-02 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421999#comment-16421999
 ] 

Jim Ferenczi commented on LUCENE-8231:
--

Hi Robert, thanks for your testings and suggestions !

I pushed another patch that applies your suggestions. The connection costs is 
now loaded in a direct byte buffer on init (the matrix is still compressed on 
disk). I also changed the format of the binary dictionary to 3 bits of the left 
id to encode the type (compound, inflect, morpheme or prenanalysis), added a 
POS dict that maps the left id and the part of speech tag and introduced a new 
flag to indicate entries with a single POS. The new size on disk is 25MB with 
10MB on the heap at start since we don't load the connection costs in the heap 
anymore. 

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes 

[jira] [Comment Edited] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-02 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421999#comment-16421999
 ] 

Jim Ferenczi edited comment on LUCENE-8231 at 4/2/18 8:32 AM:
--

Hi Robert, thanks for your testings and suggestions !

I pushed another patch that applies your suggestions. The connection costs is 
now loaded in a direct byte buffer on init (the matrix is still compressed on 
disk). I also changed the format of the binary dictionary to use 2 bits of the 
left id to encode the type (compound, inflect, morpheme or prenanalysis), added 
a POS dict that maps the left id and the part of speech tag and introduced a 
new flag to indicate entries with a single POS. The new size on disk is 25MB 
with 10MB on the heap at start since we don't load the connection costs in the 
heap anymore. 


was (Author: jim.ferenczi):
Hi Robert, thanks for your testings and suggestions !

I pushed another patch that applies your suggestions. The connection costs is 
now loaded in a direct byte buffer on init (the matrix is still compressed on 
disk). I also changed the format of the binary dictionary to use 3 bits of the 
left id to encode the type (compound, inflect, morpheme or prenanalysis), added 
a POS dict that maps the left id and the part of speech tag and introduced a 
new flag to indicate entries with a single POS. The new size on disk is 25MB 
with 10MB on the heap at start since we don't load the connection costs in the 
heap anymore. 

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index 

[jira] [Updated] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-02 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8231:
-
Attachment: LUCENE-8231.patch

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the module to play with the analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org

[jira] [Updated] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-02 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8231:
-
Attachment: LUCENE-8231.patch

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the module to play with the analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org

[jira] [Updated] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-02 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8231:
-
Attachment: (was: LUCENE-8231.patch)

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the module to play with the analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Commented] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-02 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422791#comment-16422791
 ] 

Jim Ferenczi commented on LUCENE-8231:
--

I attached a new patch with lots of cleanups and fixes. I ran HantecRel again, 
here are the results:

||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
|Korean|178s|90MB|.1638|.2101|.2081|

I am not sure why it got faster, could be the new compressed format, I'll dig.

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is 

[jira] [Updated] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-03-29 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8231:
-
Attachment: LUCENE-8231-remap-hangul.patch

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the module to play with the analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, 

[jira] [Commented] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-03-29 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419178#comment-16419178
 ] 

Jim Ferenczi commented on LUCENE-8231:
--

Sure I attached a new patch (LUCENE-8231-remap-hangul.patch) that applies the 
remap at build and analyze time. I skipped all entries that are not hangul or 
latin-1 chars to make it easier to test. I must have missed something so thanks 
for testing !

 

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the 

[jira] [Commented] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-03-29 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419121#comment-16419121
 ] 

Jim Ferenczi commented on LUCENE-8231:
--

I tried this approach and generated a new FST with the remap chars. The size of 
the FST after conversion is 4MB + 1MB for the separated Hanja FST which is 
roughly the same size as the FST with the hangul syllab and the Hanja together 
(5.4MB). I also ran the HantecRel indexation and it tooks approximatively 235s 
to build (I tried multiple times and the times were pretty consistent) with 
root caching for the 255 first arcs. That's surprising because it's slower than 
the FST with hangul syllab and root caching (200s) so I wonder if this feature 
is worth the complexity ? I checked the size of the root caching for the 11,171 
syllabs for Hangul and it takes approximatively 250k so that's not bad 
considering that this version is faster.

 

I'll try the compression for compounds now.

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small 

[jira] [Commented] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-04 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425218#comment-16425218
 ] 

Jim Ferenczi commented on LUCENE-8231:
--

Hi Robert,
I pushed another iteration that moves the decompound process and the POS 
filtering in the tokenizer. I think it's simpler to perform the decompound and 
the filtering directly in the tokenizer, this also allows to keep the compound 
token (I added a decompound mode option that disallow decompound (none), 
discard the decompound (discard) or perform the decompound and keep the 
original token (mixed)). By default the compound token is discarded but it can 
be kept using the mixed mode. 
I also changed the normalization option when building the dictionary, instead 
of adding the normalized form and the original form the builder now replaces 
the original form with the normalized one. By default the normalization is not 
activated but it can be useful for other Korean dictionaries that uses a 
decomposed form for hanguls like the Handic for instance:
https://ja.osdn.net/projects/handic/
I added more tests and javadocs, I think it's getting closer ;)


> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I 

[jira] [Updated] (LUCENE-8231) Nori, a Korean analyzer based on mecab-ko-dic

2018-04-04 Thread Jim Ferenczi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8231:
-
Attachment: LUCENE-8231.patch

> Nori, a Korean analyzer based on mecab-ko-dic
> -
>
> Key: LUCENE-8231
> URL: https://issues.apache.org/jira/browse/LUCENE-8231
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the module to play with the analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Commented] (LUCENE-8202) Add a FixedShingleFilter

2018-03-22 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409390#comment-16409390
 ] 

Jim Ferenczi commented on LUCENE-8202:
--

sure +1 for the exception, I don't think that this limit should be configurable 
though, 1000 seems more than enough to handle normal cases ?

> Add a FixedShingleFilter
> 
>
> Key: LUCENE-8202
> URL: https://issues.apache.org/jira/browse/LUCENE-8202
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8202.patch, LUCENE-8202.patch, LUCENE-8202.patch
>
>
> In LUCENE-3475 I tried to make a ShingleGraphFilter that could accept and 
> emit arbitrary graphs, while duplicating all the functionality of the 
> existing ShingleFilter.  This ends up being extremely hairy, and doesn't play 
> well with query parsers.
> I'd like to step back and try and create a simpler shingle filter that can be 
> used for index-time phrase tokenization only.  It will have a single fixed 
> shingle size, can deal with single-token synonyms, and won't emit unigrams.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8202) Add a FixedShingleFilter

2018-03-22 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409382#comment-16409382
 ] 

Jim Ferenczi commented on LUCENE-8202:
--

+1 to set position length to 1, this is a fixed size shingle filter so there's 
no additional information in this attribute.
Regarding the explosion of the number of terms can you track the total number 
of tokens that need to produce a shingle at the next position and ignore new 
tokens with posIncr=0 if the number is too high (1000 ?) ?


> Add a FixedShingleFilter
> 
>
> Key: LUCENE-8202
> URL: https://issues.apache.org/jira/browse/LUCENE-8202
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8202.patch, LUCENE-8202.patch, LUCENE-8202.patch
>
>
> In LUCENE-3475 I tried to make a ShingleGraphFilter that could accept and 
> emit arbitrary graphs, while duplicating all the functionality of the 
> existing ShingleFilter.  This ends up being extremely hairy, and doesn't play 
> well with query parsers.
> I'd like to step back and try and create a simpler shingle filter that can be 
> used for index-time phrase tokenization only.  It will have a single fixed 
> shingle size, can deal with single-token synonyms, and won't emit unigrams.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8202) Add a FixedShingleFilter

2018-03-21 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407723#comment-16407723
 ] 

Jim Ferenczi commented on LUCENE-8202:
--

+1, thanks Alan. 

> Add a FixedShingleFilter
> 
>
> Key: LUCENE-8202
> URL: https://issues.apache.org/jira/browse/LUCENE-8202
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8202.patch, LUCENE-8202.patch, LUCENE-8202.patch
>
>
> In LUCENE-3475 I tried to make a ShingleGraphFilter that could accept and 
> emit arbitrary graphs, while duplicating all the functionality of the 
> existing ShingleFilter.  This ends up being extremely hairy, and doesn't play 
> well with query parsers.
> I'd like to step back and try and create a simpler shingle filter that can be 
> used for index-time phrase tokenization only.  It will have a single fixed 
> shingle size, can deal with single-token synonyms, and won't emit unigrams.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8196) Add IntervalQuery and IntervalsSource to expose minimum interval semantics across term fields

2018-03-19 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404532#comment-16404532
 ] 

Jim Ferenczi commented on LUCENE-8196:
--

+1 too, there are some places where you could initialize the current interval 
with[−∞ . . −∞] in order to avoid the nullity check.

Most of the operators algorithm seem good, though I don't understand why you 
change the order of the disjunction ? If you don't start with the smallest 
right interval from the queue you could miss a lot of minimum intervals that 
could be needed if the disjunction is used inside another operator ?

> Add IntervalQuery and IntervalsSource to expose minimum interval semantics 
> across term fields
> -
>
> Key: LUCENE-8196
> URL: https://issues.apache.org/jira/browse/LUCENE-8196
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8196.patch, LUCENE-8196.patch, LUCENE-8196.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This ticket proposes an alternative implementation of the SpanQuery family 
> that uses minimum-interval semantics from 
> [http://vigna.di.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics.pdf]
>  to implement positional queries across term-based fields.  Rather than using 
> TermQueries to construct the interval operators, as in LUCENE-2878 or the 
> current Spans implementation, we instead use a new IntervalsSource object, 
> which will produce IntervalIterators over a particular segment and field.  
> These are constructed using various static helper methods, and can then be 
> passed to a new IntervalQuery which will return documents that contain one or 
> more intervals so defined.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8182) BoostingQuery applies the wrong boost to the query score

2018-03-01 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382845#comment-16382845
 ] 

Jim Ferenczi commented on LUCENE-8182:
--

Thanks [~hossman] . I pushed a commit to add the missing changes in master. 

> BoostingQuery applies the wrong boost to the query score
> 
>
> Key: LUCENE-8182
> URL: https://issues.apache.org/jira/browse/LUCENE-8182
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 7.0, 7.1, 7.2
>Reporter: Jim Ferenczi
>Priority: Major
> Fix For: 7.3
>
> Attachments: LUCENE-8182.patch, LUCENE-8182.patch, LUCENE-8182.patch
>
>
> BoostingQuery applies the parent query boost instead of the boost set on the 
> query due to a name clash in the anonymous class created by the createWeight 
> method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8526) StandardTokenizer doesn't separate hangul characters from other non-CJK chars

2018-10-05 Thread Jim Ferenczi (JIRA)
Jim Ferenczi created LUCENE-8526:


 Summary: StandardTokenizer doesn't separate hangul characters from 
other non-CJK chars
 Key: LUCENE-8526
 URL: https://issues.apache.org/jira/browse/LUCENE-8526
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Jim Ferenczi


It was first reported here 
https://github.com/elastic/elasticsearch/issues/34285.
I don't know if it's the expected behavior but the StandardTokenizer does not 
split words
which are composed of a mixed of non-CJK characters and hangul syllabs. For 
instance "한국2018" or "한국abc" is kept as is by this tokenizer and mark as an 
alpha-numeric group. This breaks the CJKBigram token filter which will not 
build bigrams on such groups. The other CJK characters are correctly splitted 
when they are mixed with other alphabet so I'd expect the same for hangul.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8526) StandardTokenizer doesn't separate hangul characters from other non-CJK chars

2018-10-05 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16640245#comment-16640245
 ] 

Jim Ferenczi commented on LUCENE-8526:
--

Ok thanks for explaining [~steve_rowe]. I thought that script boundary break 
was part of the UAX#29 and that the ICUTokenizer and StandardTokenizer should 
behave the same regarding CJK splits. We can maybe add a note in the CJKBigram 
filter regarding this behavior when the StandardTokenizer is used ?

> StandardTokenizer doesn't separate hangul characters from other non-CJK chars
> -
>
> Key: LUCENE-8526
> URL: https://issues.apache.org/jira/browse/LUCENE-8526
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> It was first reported here 
> https://github.com/elastic/elasticsearch/issues/34285.
> I don't know if it's the expected behavior but the StandardTokenizer does not 
> split words
> which are composed of a mixed of non-CJK characters and hangul syllabs. For 
> instance "한국2018" or "한국abc" is kept as is by this tokenizer and mark as an 
> alpha-numeric group. This breaks the CJKBigram token filter which will not 
> build bigrams on such groups. The other CJK characters are correctly splitted 
> when they are mixed with other alphabet so I'd expect the same for hangul.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8526) StandardTokenizer doesn't separate hangul characters from other non-CJK chars

2018-10-05 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16640282#comment-16640282
 ] 

Jim Ferenczi commented on LUCENE-8526:
--

Sounds great [~steve_rowe]. I'll prepare a patch.

> StandardTokenizer doesn't separate hangul characters from other non-CJK chars
> -
>
> Key: LUCENE-8526
> URL: https://issues.apache.org/jira/browse/LUCENE-8526
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> It was first reported here 
> https://github.com/elastic/elasticsearch/issues/34285.
> I don't know if it's the expected behavior but the StandardTokenizer does not 
> split words
> which are composed of a mixed of non-CJK characters and hangul syllabs. For 
> instance "한국2018" or "한국abc" is kept as is by this tokenizer and mark as an 
> alpha-numeric group. This breaks the CJKBigram token filter which will not 
> build bigrams on such groups. The other CJK characters are correctly splitted 
> when they are mixed with other alphabet so I'd expect the same for hangul.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8529) Use the completion key to tiebreak completion suggestion

2018-10-11 Thread Jim Ferenczi (JIRA)
Jim Ferenczi created LUCENE-8529:


 Summary: Use the completion key to tiebreak completion suggestion
 Key: LUCENE-8529
 URL: https://issues.apache.org/jira/browse/LUCENE-8529
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Jim Ferenczi


Today the completion suggester uses the document id to tiebreak completion 
suggestion with same scores. It would improve the stability of the sort to use 
the surface form of suggestions as the first tiebreaker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8531) QueryBuilder hard-codes inOrder=true for generated sloppy span near queries

2018-10-15 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650813#comment-16650813
 ] 

Jim Ferenczi commented on LUCENE-8531:
--

(Multi)PhraseQuery-s allows some reordering but the semantic is different from 
an unordered span near query.
I don't think we can respect the slop correctly if we continue to use span 
queries here. We switched to span queries to avoid searching duplicate terms in 
multiple phrase queries but I agree that the behavior is not consistent when 
using a slop. Maybe we could switch to the old method of building one phrase 
query per path if a slop is used ? This way we could apply the slop to each 
phrase query independently. This is more costly than the span method but it 
would be semantically correct. 

> QueryBuilder hard-codes inOrder=true for generated sloppy span near queries
> ---
>
> Key: LUCENE-8531
> URL: https://issues.apache.org/jira/browse/LUCENE-8531
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Reporter: Steve Rowe
>Assignee: Steve Rowe
>Priority: Major
>
> QueryBuilder.analyzeGraphPhrase() generates SpanNearQuery-s with passed-in 
> phraseSlop, but hard-codes inOrder ctor param as true.
> Before multi-term synonym support and graph token streams introduced the 
> possibility of generating SpanNearQuery-s, QueryBuilder generated 
> (Multi)PhraseQuery-s, which always interpret slop as allowing reordering 
> edits.  Solr's eDismax query parser generates phrase queries when its 
> pf/pf2/pf3 params are specified, and when multi-term synonyms are used with a 
> graph-aware synonym filter, SpanNearQuery-s are generated that require 
> clauses to be in order; unlike with (Multi)PhraseQuery-s, reordering edits 
> are not allowed, so this is a kind of regression.  See SOLR-12243 for edismax 
> pf/pf2/pf3 context.  (Note that the patch on SOLR-12243 also addresses 
> another problem that blocks eDismax from generating queries *at all* under 
> the above-described circumstances.)
> I propose adding a new analyzeGraphPhrase() method that allows configuration 
> of inOrder, which would allow eDismax to specify inOrder=false.  The existing 
> analyzeGraphPhrase() method would remain with its hard-coded inOrder=true, so 
> existing client behavior would remain unchanged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8535) Should we drop support for highlighting block-join queris

2018-10-17 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654045#comment-16654045
 ] 

Jim Ferenczi commented on LUCENE-8535:
--

+1 to support this through the extension points. We can add javadocs explaining 
why we don't handle all queries in the relevant class 
(WeightedSpanTermsExtractor).

> Should we drop support for highlighting block-join queris
> -
>
> Key: LUCENE-8535
> URL: https://issues.apache.org/jira/browse/LUCENE-8535
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: master (8.0)
>Reporter: Simon Willnauer
>Priority: Major
>
> This is a spin-off from LUCENE-6572. We currently depend on the block-join 
> module which is due to the fact that we try to highlight the queries wrapped 
> by the block join queries. The current discussion on LUCENE-6572 mentioned 
> that this doesn't make much sense from an highlighting perspecitve and if we 
> should drop support for it. Lucene 8.0 would be a good time to do so.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



<    1   2   3   4   5   >