from:"Christian Moen \(JIRA\)"

[jira] [Commented] (LUCENE-8959) JapaneseNumberFilter does not take whitespaces into account when concatenating numbers

2019-08-29 Thread Christian Moen (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918464#comment-16918464
 ] 

Christian Moen commented on LUCENE-8959:


Sounds like a good idea.  This is also rather big rabbit hole... 

Would it be useful to consider making the digit grouping separators 
configurable as part of a bigger scheme here?

In Japanese, if you're processing text with SI numbers, I believe space is a 
valid digit grouping.

> JapaneseNumberFilter does not take whitespaces into account when 
> concatenating numbers
> --
>
> Key: LUCENE-8959
> URL: https://issues.apache.org/jira/browse/LUCENE-8959
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> Today the JapaneseNumberFilter tries to concatenate numbers even if they are 
> separated by whitespaces. So for instance "10 100" is rewritten into "10100" 
> even if the tokenizer doesn't discard punctuations. In practice this is not 
> an issue but this can lead to giant number of tokens if there are a lot of 
> numbers separated by spaces. The number of concatenation should be 
> configurable with a sane default limit in order to avoid creating big tokens 
> that slows down the analysis.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8817) Combine Nori and Kuromoji DictionaryBuilder

2019-06-09 Thread Christian Moen (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859673#comment-16859673
 ] 

Christian Moen commented on LUCENE-8817:


Thanks, [~tomoko].  I don't think we should any "mecab" in the naming.  Please 
let me elaborate a bit.

Kuromoji can read MeCab format models, but Kuromoji isn't a port of MeCab.  
Kuromoji has been developed independently without inspecting or reviewing any 
MeCab source code.  This was an initial goal of the project to make sure we 
could use an Apache License.

The MeCab and Kuromoji feature sets are quite different and I think users will 
find it confusing if they expect MeCab and find that Kuromoji is much more 
limited.

I'm also unsure if Kudo-san will appreciate that we make an association by name 
like this.  It certainly doesn't give due credit to MeCab, in my opinion, which 
is a much more extensive project.

In terms of naming, what about using "statistical" instead of "mecab" for this 
class of analyzers?

I'm thinking "Viterbi" could be good to refer to in shared tokenizer code.

This said, I think it could be a good to refer to "mecab" in the dictionary 
compiler code, documentation, etc. to make sure users understand that we can 
read this model format.

Any thoughts?

> Combine Nori and Kuromoji DictionaryBuilder
> ---
>
> Key: LUCENE-8817
> URL: https://issues.apache.org/jira/browse/LUCENE-8817
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
>
> This issue is related to LUCENE-8816.
> Currently Nori and Kuromoji Analyzer use the same dictionary structure. 
> (MeCab)
>  If we make combine DictionaryBuilder, we can reduce the code size.
>  But this task may have a dependency on the language.
>  (like HEADER string in BinaryDictionary and CharacterDefinition, methods in 
> BinaryDictionaryWriter, ...)
>  On the other hand, there are many overlapped classes.
> The purpose of this patch is to provide users of Nori and Kuromoji with the 
> same system dictionary generator.
> It may take some time because there is a little workload.
>  The work will be based on the latest master, and if the LUCENE-8816 is 
> finished first, I will pull the latest code and proceed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-30 Thread Christian Moen (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852551#comment-16852551
 ] 

Christian Moen commented on LUCENE-8816:


Separating out the dictionaries is a great idea.

[~rcmuir] made great efforts making the original dictionary tiny and some 
assumptions were made based on the value ranges of the original source data.

To me it sounds like a good idea to keep the Japanese and Korean dictionaries 
separately initially and consider combining them later on when implications of 
such combination is clear.  I agree with [~jim.ferenczi].

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8752) Apply a patch to kuromoji dictionary to properly handle Japanese new era '令和' (REIWA)

2019-04-10 Thread Christian Moen (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814467#comment-16814467
 ] 

Christian Moen commented on LUCENE-8752:


Thanks a lot, [~Tomoko Uchida].

> Apply a patch to kuromoji dictionary to properly handle Japanese new era '令和' 
> (REIWA)
> -
>
> Key: LUCENE-8752
> URL: https://issues.apache.org/jira/browse/LUCENE-8752
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Minor
>
> As of May 1st, 2019, Japanese era '元号' (Gengo) will be set to '令和' (Reiwa). 
> See this article for more details:
> [https://www.bbc.com/news/world-asia-47769566]
> Currently '令和' is splitted up to '令' and '和' by {{JapaneseTokenizer}}. It 
> should be tokenized as one word so that Japanese texts including era names 
> are searched as users expect. Because the default Kuromoji dictionary 
> (mecab-ipadic) has not been maintained since 2007, a one-line patch to the 
> source CSV file is needed for this era change.
> Era name is used in many official or formal documents in Japan, so it would 
> be desirable the search systems properly handle this without adding a user 
> dictionary or using phrase query. :)
> FYI, JDK DateTime API will support the new era (in the next updates.)
> [https://blogs.oracle.com/java-platform-group/a-new-japanese-era-for-java]
> The patch is available here:
> [https://github.com/apache/lucene-solr/pull/632]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8752) Apply a patch to kuromoji dictionary to properly handle Japanese new era '令和' (REIWA)

2019-04-05 Thread Christian Moen (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811412#comment-16811412
 ] 

Christian Moen commented on LUCENE-8752:


Thanks for this, [~Tomoko Uchida].  I think it's a good idea to make this 
change.  I'll follow early next week.

> Apply a patch to kuromoji dictionary to properly handle Japanese new era '令和' 
> (REIWA)
> -
>
> Key: LUCENE-8752
> URL: https://issues.apache.org/jira/browse/LUCENE-8752
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Minor
>
> As of May 1st, 2019, Japanese era '元号' (Gengo) will be set to '令和' (Reiwa). 
> See this article for more details:
> [https://www.bbc.com/news/world-asia-47769566]
> Currently '令和' is splitted up to '令' and '和' by {{JapaneseTokenizer}}. It 
> should be tokenized as one word so that Japanese texts including era names 
> are searched as users expect. Because the default Kuromoji dictionary 
> (mecab-ipadic) has not been maintained since 2007, a one-line patch to the 
> source CSV file is needed for this era change.
> Era name is used in many official or formal documents in Japan, so it would 
> be desirable the search systems properly handle this without adding a user 
> dictionary or using phrase query. :)
> FYI, JDK DateTime API will support the new era (in the next updates.)
> [https://blogs.oracle.com/java-platform-group/a-new-japanese-era-for-java]
> The patch is available here:
> [https://github.com/apache/lucene-solr/pull/632]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-7992) Kuromoji fails with UnsupportedOperationException in case of duplicate keys in the user dictionary

2017-10-12 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen reassigned LUCENE-7992:
--

Assignee: Christian Moen

> Kuromoji fails with UnsupportedOperationException in case of duplicate keys 
> in the user dictionary
> --
>
> Key: LUCENE-7992
> URL: https://issues.apache.org/jira/browse/LUCENE-7992
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Assignee: Christian Moen
>Priority: Minor
>
> Failing is the right thing to do but the exception could clarify the source 
> of the problem. Today it just throws an UnsupportedOperationException with no 
> error message because of a call to PositiveIntOutputs.merge.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7992) Kuromoji fails with UnsupportedOperationException in case of duplicate keys in the user dictionary

2017-10-12 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202098#comment-16202098
 ] 

Christian Moen commented on LUCENE-7992:


Thanks, Adrien.  I'll have a look. 

> Kuromoji fails with UnsupportedOperationException in case of duplicate keys 
> in the user dictionary
> --
>
> Key: LUCENE-7992
> URL: https://issues.apache.org/jira/browse/LUCENE-7992
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> Failing is the right thing to do but the exception could clarify the source 
> of the problem. Today it just throws an UnsupportedOperationException with no 
> error message because of a call to PositiveIntOutputs.merge.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-7181) JapaneseTokenizer: Validate segmentation of User Dictionary entries on creation

2016-04-08 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen reassigned LUCENE-7181:
--

Assignee: Christian Moen

> JapaneseTokenizer: Validate segmentation of User Dictionary entries on 
> creation
> ---
>
> Key: LUCENE-7181
> URL: https://issues.apache.org/jira/browse/LUCENE-7181
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Tomás Fernández Löbbe
>Assignee: Christian Moen
> Attachments: LUCENE-7181.patch
>
>
> From the [conversation on the dev 
> list|http://mail-archives.apache.org/mod_mbox/lucene-dev/201604.mbox/%3CCAMJgJxR8gLnXi7WXkN3KFfxHu=posevxxarbbg+chce1tzh...@mail.gmail.com%3E]
> The user dictionary in the {{JapaneseTokenizer}} allows users to customize 
> how a stream is broken into tokens using a specific set of rules provided 
> like: 
> AABBBCC -> AA BBB CC
> It does not allow users to change any of the token characters like:
> (1) AABBBCC -> DD BBB CC   (this will just tokenize to "AA", "BBB", "CC", 
> seems to only care about positions) 
> It also doesn't let a character be part of more than one token, like:
> (2) AABBBCC -> AAB BBB BCC (this will throw an AIOOBE)
> ..or make the output token bigger than the input text: 
> (3) AA -> AAA (Also AIOOBE)
> Currently there is no validation for those cases, case 1 doesn't fail but 
> provide unexpected tokens. Cases 2 and 3 fail when the input text is 
> analyzed. We should add validation to the {{UserDictionary}} creation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer

2016-01-11 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093096#comment-15093096
 ] 

Christian Moen commented on LUCENE-6837:


Hello Mike,

Yes, I'd like to backport this to 5.5.

> Add N-best output capability to JapaneseTokenizer
> -
>
> Key: LUCENE-6837
> URL: https://issues.apache.org/jira/browse/LUCENE-6837
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 5.3
>Reporter: KONNO, Hiroharu
>Assignee: Christian Moen
>Priority: Minor
> Attachments: LUCENE-6837 for 5.4.zip, LUCENE-6837.patch, 
> LUCENE-6837.patch, LUCENE-6837.patch, LUCENE-6837.patch, LUCENE-6837.patch
>
>
> Japanese morphological analyzers often generate mis-segmented tokens. N-best 
> output reduces the impact of mis-segmentation on search result. N-best output 
> is more meaningful than character N-gram, and it increases hit count too.
> If you use N-best output, you can get decompounded tokens (ex: 
> "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and 
> overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), 
> depending on the dictionary and N-best parameter settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer

2015-11-27 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-6837:
---
Attachment: LUCENE-6837.patch

> Add N-best output capability to JapaneseTokenizer
> -
>
> Key: LUCENE-6837
> URL: https://issues.apache.org/jira/browse/LUCENE-6837
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 5.3
>Reporter: KONNO, Hiroharu
>Assignee: Christian Moen
>Priority: Minor
> Attachments: LUCENE-6837.patch, LUCENE-6837.patch, LUCENE-6837.patch, 
> LUCENE-6837.patch, LUCENE-6837.patch
>
>
> Japanese morphological analyzers often generate mis-segmented tokens. N-best 
> output reduces the impact of mis-segmentation on search result. N-best output 
> is more meaningful than character N-gram, and it increases hit count too.
> If you use N-best output, you can get decompounded tokens (ex: 
> "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and 
> overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), 
> depending on the dictionary and N-best parameter settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer

2015-11-27 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029762#comment-15029762
 ] 

Christian Moen commented on LUCENE-6837:


Thanks a lot, Konno-san.  Things look good.  My apologies that I couldn't look 
into this earlier.

I've attached a new patch where I've included your fix and also renamed some 
methods.  I think it's getting ready...


> Add N-best output capability to JapaneseTokenizer
> -
>
> Key: LUCENE-6837
> URL: https://issues.apache.org/jira/browse/LUCENE-6837
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 5.3
>Reporter: KONNO, Hiroharu
>Assignee: Christian Moen
>Priority: Minor
> Attachments: LUCENE-6837.patch, LUCENE-6837.patch, LUCENE-6837.patch, 
> LUCENE-6837.patch, LUCENE-6837.patch
>
>
> Japanese morphological analyzers often generate mis-segmented tokens. N-best 
> output reduces the impact of mis-segmentation on search result. N-best output 
> is more meaningful than character N-gram, and it increases hit count too.
> If you use N-best output, you can get decompounded tokens (ex: 
> "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and 
> overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), 
> depending on the dictionary and N-best parameter settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer

2015-11-18 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15010673#comment-15010673
 ] 

Christian Moen commented on LUCENE-6837:


Tokenizing Japanese Wikipedia seems fine with nBestCost set, but it seems like 
random-blasting doesn't pass.

Konno-san, I'm wondering if I can ask you the trouble of looking into why the 
{{testRandomHugeStrings}} fails with the latest patch?

The test basically does random-blasting with nBestCost set to 2000.  I think 
it's a good idea that we fix this before we commit.  I believe it's easily 
reproducible, but I used

{noformat}
ant test  -Dtestcase=TestJapaneseTokenizer -Dtests.method=testRandomHugeStrings 
-Dtests.seed=99EB179B92E66345 -Dtests.slow=true -Dtests.locale=sr_CS 
-Dtests.timezone=PNT -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
{noformat}

in my environment.

> Add N-best output capability to JapaneseTokenizer
> -
>
> Key: LUCENE-6837
> URL: https://issues.apache.org/jira/browse/LUCENE-6837
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 5.3
>Reporter: KONNO, Hiroharu
>Assignee: Christian Moen
>Priority: Minor
> Attachments: LUCENE-6837.patch, LUCENE-6837.patch, LUCENE-6837.patch
>
>
> Japanese morphological analyzers often generate mis-segmented tokens. N-best 
> output reduces the impact of mis-segmentation on search result. N-best output 
> is more meaningful than character N-gram, and it increases hit count too.
> If you use N-best output, you can get decompounded tokens (ex: 
> "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and 
> overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), 
> depending on the dictionary and N-best parameter settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer

2015-11-18 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-6837:
---
Attachment: LUCENE-6837.patch

> Add N-best output capability to JapaneseTokenizer
> -
>
> Key: LUCENE-6837
> URL: https://issues.apache.org/jira/browse/LUCENE-6837
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 5.3
>Reporter: KONNO, Hiroharu
>Assignee: Christian Moen
>Priority: Minor
> Attachments: LUCENE-6837.patch, LUCENE-6837.patch, LUCENE-6837.patch
>
>
> Japanese morphological analyzers often generate mis-segmented tokens. N-best 
> output reduces the impact of mis-segmentation on search result. N-best output 
> is more meaningful than character N-gram, and it increases hit count too.
> If you use N-best output, you can get decompounded tokens (ex: 
> "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and 
> overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), 
> depending on the dictionary and N-best parameter settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer

2015-11-08 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-6837:
---
Attachment: LUCENE-6837.patch

> Add N-best output capability to JapaneseTokenizer
> -
>
> Key: LUCENE-6837
> URL: https://issues.apache.org/jira/browse/LUCENE-6837
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 5.3
>Reporter: KONNO, Hiroharu
>Assignee: Christian Moen
>Priority: Minor
> Attachments: LUCENE-6837.patch, LUCENE-6837.patch
>
>
> Japanese morphological analyzers often generate mis-segmented tokens. N-best 
> output reduces the impact of mis-segmentation on search result. N-best output 
> is more meaningful than character N-gram, and it increases hit count too.
> If you use N-best output, you can get decompounded tokens (ex: 
> "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and 
> overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), 
> depending on the dictionary and N-best parameter settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer

2015-11-08 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14995604#comment-14995604
 ] 

Christian Moen commented on LUCENE-6837:


I've attached a new patch with some minor changes:

* Made the {{System.out.printf}} being subject to VERBOSE being true
* Introduced RuntimeException to deal with the initialization error cases
* Renamed the new parameters to {{nBestCost}} and {{nBestExamples}}
* Added additional javadoc here and there to document the new functionality

I'm planning on running some stability tests with the new tokenizer parameters 
next.

> Add N-best output capability to JapaneseTokenizer
> -
>
> Key: LUCENE-6837
> URL: https://issues.apache.org/jira/browse/LUCENE-6837
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 5.3
>Reporter: KONNO, Hiroharu
>Assignee: Christian Moen
>Priority: Minor
> Attachments: LUCENE-6837.patch, LUCENE-6837.patch
>
>
> Japanese morphological analyzers often generate mis-segmented tokens. N-best 
> output reduces the impact of mis-segmentation on search result. N-best output 
> is more meaningful than character N-gram, and it increases hit count too.
> If you use N-best output, you can get decompounded tokens (ex: 
> "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and 
> overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), 
> depending on the dictionary and N-best parameter settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer

2015-10-28 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen reassigned LUCENE-6837:
--

Assignee: Christian Moen

> Add N-best output capability to JapaneseTokenizer
> -
>
> Key: LUCENE-6837
> URL: https://issues.apache.org/jira/browse/LUCENE-6837
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 5.3
>Reporter: KONNO, Hiroharu
>Assignee: Christian Moen
>Priority: Minor
> Attachments: LUCENE-6837.patch
>
>
> Japanese morphological analyzers often generate mis-segmented tokens. N-best 
> output reduces the impact of mis-segmentation on search result. N-best output 
> is more meaningful than character N-gram, and it increases hit count too.
> If you use N-best output, you can get decompounded tokens (ex: 
> "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and 
> overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), 
> depending on the dictionary and N-best parameter settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer

2015-10-28 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978156#comment-14978156
 ] 

Christian Moen commented on LUCENE-6837:


Thanks a lot for this, Konno-san.  Very nice work!  I like the idea to 
calculate the n-best cost using examples.

Since search mode and also extended mode solves a similar problem, I'm 
wondering if it makes sense to introduce n-best as a separate mode in itself.  
In your experience in developing the feature, do you think it makes a lot of 
sense to use it with search and extended mode?

I think I'm in favour of supporting it for all the modes, even though it 
perhaps makes the most sense for normal mode.  The reason for this is to make 
sure that the entire API for {{JapaneseTokenizer}} is functional for all the 
tokenizer modes.

I'll add a few tests and I'd like to commit this soon.

> Add N-best output capability to JapaneseTokenizer
> -
>
> Key: LUCENE-6837
> URL: https://issues.apache.org/jira/browse/LUCENE-6837
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 5.3
>Reporter: KONNO, Hiroharu
>Priority: Minor
> Attachments: LUCENE-6837.patch
>
>
> Japanese morphological analyzers often generate mis-segmented tokens. N-best 
> output reduces the impact of mis-segmentation on search result. N-best output 
> is more meaningful than character N-gram, and it increases hit count too.
> If you use N-best output, you can get decompounded tokens (ex: 
> "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and 
> overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), 
> depending on the dictionary and N-best parameter settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6837) Add N-best output capability to JapaneseTokenizer

2015-10-13 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954487#comment-14954487
 ] 

Christian Moen commented on LUCENE-6837:


Thanks.  I've had a very quick look at the code and have some comments and 
questions.  I'm happy to take care of this, Koji.


> Add N-best output capability to JapaneseTokenizer
> -
>
> Key: LUCENE-6837
> URL: https://issues.apache.org/jira/browse/LUCENE-6837
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 5.3
>Reporter: KONNO, Hiroharu
>Priority: Minor
> Attachments: LUCENE-6837.patch
>
>
> Japanese morphological analyzers often generate mis-segmented tokens. N-best 
> output reduces the impact of mis-segmentation on search result. N-best output 
> is more meaningful than character N-gram, and it increases hit count too.
> If you use N-best output, you can get decompounded tokens (ex: 
> "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and 
> overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), 
> depending on the dictionary and N-best parameter settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6733) Incorrect URL causes build break - analysis/kuromoji

2015-08-11 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692824#comment-14692824
 ] 

Christian Moen commented on LUCENE-6733:


Thanks, I'll have a look.

 Incorrect URL causes build break - analysis/kuromoji
 

 Key: LUCENE-6733
 URL: https://issues.apache.org/jira/browse/LUCENE-6733
 Project: Lucene - Core
  Issue Type: Bug
  Components: general/build, modules/analysis
Affects Versions: 5.2.1
 Environment: n/a
Reporter: Susumu Fukuda
Priority: Minor
 Attachments: LUCENE-6733.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Ivy.xml contains dictionary URL both of IPADIC and NAIST-JDIC.
 But there’re already gone. No existing. So it causes build break at 
 download-dict task.
 Google Code will be closed soon later. And SouceForge(.jp not .net) was moved 
 osdn.jp.
 Fumm… not sure how I can attach a patch file… I can’t find a field. later ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-6468) Empty kuromoji user dictionary - NPE

2015-05-11 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen resolved LUCENE-6468.

   Resolution: Fixed
Fix Version/s: 5.x
   Trunk

 Empty kuromoji user dictionary - NPE
 -

 Key: LUCENE-6468
 URL: https://issues.apache.org/jira/browse/LUCENE-6468
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
Assignee: Christian Moen
 Fix For: Trunk, 5.x

 Attachments: LUCENE-6468.patch


 Kuromoji user dictionary takes Reader and allows for comments and other lines 
 to be ignored. But if its empty in the sense of no actual entries, the 
 returned FST will be null, and it will throw a confusing NPE.
 JapaneseTokenizer and JapaneseAnalyzer apis already treat null UserDictionary 
 as having none at all, so I think the best fix is to fix the UserDictionary 
 api from UserDictionary(Reader) to UserDictionary.open(Reader) or similar, 
 and return null if the FST is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6468) Empty kuromoji user dictionary - NPE

2015-05-11 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537767#comment-14537767
 ] 

Christian Moen commented on LUCENE-6468:


Thanks, Ohtani-san!

I added a {{final}} being required for {{branch_5x}} for JDK 1.7 and also 
changed the empty user dictionary test to contain a user dictionary with a 
comment and some newlines (it's still empty, though).

I've committed your patch to {{trunk}} and {{branch_5x}}.


 Empty kuromoji user dictionary - NPE
 -

 Key: LUCENE-6468
 URL: https://issues.apache.org/jira/browse/LUCENE-6468
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
Assignee: Christian Moen
 Attachments: LUCENE-6468.patch


 Kuromoji user dictionary takes Reader and allows for comments and other lines 
 to be ignored. But if its empty in the sense of no actual entries, the 
 returned FST will be null, and it will throw a confusing NPE.
 JapaneseTokenizer and JapaneseAnalyzer apis already treat null UserDictionary 
 as having none at all, so I think the best fix is to fix the UserDictionary 
 api from UserDictionary(Reader) to UserDictionary.open(Reader) or similar, 
 and return null if the FST is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-6468) Empty kuromoji user dictionary - NPE

2015-05-10 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen reassigned LUCENE-6468:
--

Assignee: Christian Moen

 Empty kuromoji user dictionary - NPE
 -

 Key: LUCENE-6468
 URL: https://issues.apache.org/jira/browse/LUCENE-6468
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
Assignee: Christian Moen
 Attachments: LUCENE-6468.patch


 Kuromoji user dictionary takes Reader and allows for comments and other lines 
 to be ignored. But if its empty in the sense of no actual entries, the 
 returned FST will be null, and it will throw a confusing NPE.
 JapaneseTokenizer and JapaneseAnalyzer apis already treat null UserDictionary 
 as having none at all, so I think the best fix is to fix the UserDictionary 
 api from UserDictionary(Reader) to UserDictionary.open(Reader) or similar, 
 and return null if the FST is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6468) Empty kuromoji user dictionary - NPE

2015-05-07 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532226#comment-14532226
 ] 

Christian Moen commented on LUCENE-6468:


Good catch.  I can look into a patch for this.

 Empty kuromoji user dictionary - NPE
 -

 Key: LUCENE-6468
 URL: https://issues.apache.org/jira/browse/LUCENE-6468
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir

 Kuromoji user dictionary takes Reader and allows for comments and other lines 
 to be ignored. But if its empty in the sense of no actual entries, the 
 returned FST will be null, and it will throw a confusing NPE.
 JapaneseTokenizer and JapaneseAnalyzer apis already treat null UserDictionary 
 as having none at all, so I think the best fix is to fix the UserDictionary 
 api from UserDictionary(Reader) to UserDictionary.open(Reader) or similar, 
 and return null if the FST is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-6216) Make it easier to modify Japanese token attributes downstream

2015-02-03 Thread Christian Moen (JIRA)

Christian Moen created LUCENE-6216:
--

 Summary: Make it easier to modify Japanese token attributes 
downstream
 Key: LUCENE-6216
 URL: https://issues.apache.org/jira/browse/LUCENE-6216
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Christian Moen
Priority: Minor


Japanese-specific token attributes such as {{PartOfSpeechAttribute}}, 
{{BaseFormAttribute}}, etc. get their values from a 
{{org.apache.lucene.analysis.ja.Token}} through a {{setToken()}} method.  This 
makes it cumbersome to change these token attributes later on in the analysis 
chain since the {{Token}} instances are difficult to instantiate (sort of 
read-only objects).

I've ran into this issue in LUCENE-3922 (JapaneseNumberFilter) where it would 
be appropriate to update token attributes to also reflect Japanese number 
normalization.

I think it might be more practical to allow setting a specific value for these 
token attributes directly rather than through a {{Token}} since it makes the 
APIs simpler, allows for easier changing attributes downstream, and also 
supporting additional dictionaries easier.

The drawback with the approach that I can think of is a performance hit as we 
will miss out on the inherent lazy retrieval of these token attributes from the 
{{Token}} object (and the underlying dictionary/buffer).

I'd like to do some testing to better understand the performance impact of this 
change. Happy to hear your thoughts on this.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2015-02-03 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-3922:
---
Attachment: LUCENE-3922.patch

Minor updates to javadoc.

I'll leave reading attributes, etc. unchanged for now and get back to resolving 
this once we have better mechanisms in place for updating some of the Japanese 
token attributes downstream.

 Add Japanese Kanji number normalization to Kuromoji
 ---

 Key: LUCENE-3922
 URL: https://issues.apache.org/jira/browse/LUCENE-3922
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
Assignee: Christian Moen
  Labels: features
 Fix For: 5.1

 Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, 
 LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, 
 LUCENE-3922.patch


 Japanese people use Kanji numerals instead of Arabic numerals for writing 
 price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
 numerals (I don't think we need to have a capability to normalize to Kanji 
 numerals).
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6216) Make it easier to modify Japanese token attributes downstream

2015-02-03 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14304342#comment-14304342
 ] 

Christian Moen commented on LUCENE-6216:


Thanks, Robert.

I had the same idea and I tried this out last night.  The advantage of the 
approach is that we only read the buffer data for the token attributes we use, 
but it leaves the API a bit slightly awkward in my opinion since we would have 
both a {{setToken()}} and a {{setPartOfSpeech()}}.  That said, this is still 
perhaps the best way to go for performance reasons and these APIs being very 
low-level and not commonly used.

For the sake of exploring an alternative idea; a different approach could be to 
have separate token filters set these attributes.  The tokenizer would set a 
{{CharTermAttribute}}, etc. and a {{JapaneseTokenAttribute}} (or something 
suitably named) that holds the {{Token}}.  A separate 
{{JapanesePartOfSpeechFilter}} would be responsible for setting the 
{{PartOfSpeechAttribute}} by getting the data from the 
{{JapaneseTokenAttribute}} using a {{getToken()}} method. We'd still need logic 
similar to the above to deal with {{setPartOfSpeech()}}, etc. so I don't think 
we gain anything by taking this approach, and it's a big change, too.

 Make it easier to modify Japanese token attributes downstream
 -

 Key: LUCENE-6216
 URL: https://issues.apache.org/jira/browse/LUCENE-6216
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Christian Moen
Priority: Minor

 Japanese-specific token attributes such as {{PartOfSpeechAttribute}}, 
 {{BaseFormAttribute}}, etc. get their values from a 
 {{org.apache.lucene.analysis.ja.Token}} through a {{setToken()}} method.  
 This makes it cumbersome to change these token attributes later on in the 
 analysis chain since the {{Token}} instances are difficult to instantiate 
 (sort of read-only objects).
 I've ran into this issue in LUCENE-3922 (JapaneseNumberFilter) where it would 
 be appropriate to update token attributes to also reflect Japanese number 
 normalization.
 I think it might be more practical to allow setting a specific value for 
 these token attributes directly rather than through a {{Token}} since it 
 makes the APIs simpler, allows for easier changing attributes downstream, and 
 also supporting additional dictionaries easier.
 The drawback with the approach that I can think of is a performance hit as we 
 will miss out on the inherent lazy retrieval of these token attributes from 
 the {{Token}} object (and the underlying dictionary/buffer).
 I'd like to do some testing to better understand the performance impact of 
 this change. Happy to hear your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2015-02-02 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-3922:
---
Fix Version/s: 5.1

 Add Japanese Kanji number normalization to Kuromoji
 ---

 Key: LUCENE-3922
 URL: https://issues.apache.org/jira/browse/LUCENE-3922
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
Assignee: Christian Moen
  Labels: features
 Fix For: 5.1

 Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, 
 LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch


 Japanese people use Kanji numerals instead of Arabic numerals for writing 
 price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
 numerals (I don't think we need to have a capability to normalize to Kanji 
 numerals).
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2015-02-02 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-3922:
---
Attachment: LUCENE-3922.patch

Updated patch with decimal number support, additional javadoc and the test code 
now makes precommit happy.

Token-attributes such as part-of-speech, readings, etc. for the normalized 
token is currently inherited from the last token used when composing the 
normalized number. Since these values are likely to be wrong, I'm inclined to 
set this attributes to null or a reasonable default.

I'm very happy to hear your thoughts on this.



 Add Japanese Kanji number normalization to Kuromoji
 ---

 Key: LUCENE-3922
 URL: https://issues.apache.org/jira/browse/LUCENE-3922
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
Assignee: Christian Moen
  Labels: features
 Fix For: 5.1

 Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, 
 LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch


 Japanese people use Kanji numerals instead of Arabic numerals for writing 
 price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
 numerals (I don't think we need to have a capability to normalize to Kanji 
 numerals).
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2015-01-28 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296379#comment-14296379
 ] 

Christian Moen commented on LUCENE-3922:


Please feel free to test it.  Feedback is very welcome.

The patch is against {{trunk}} and this should make it into 5.1.

 Add Japanese Kanji number normalization to Kuromoji
 ---

 Key: LUCENE-3922
 URL: https://issues.apache.org/jira/browse/LUCENE-3922
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
Assignee: Christian Moen
  Labels: features
 Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, 
 LUCENE-3922.patch, LUCENE-3922.patch


 Japanese people use Kanji numerals instead of Arabic numerals for writing 
 price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
 numerals (I don't think we need to have a capability to normalize to Kanji 
 numerals).
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2015-01-28 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-3922:
---
Attachment: LUCENE-3922.patch

New patch with CHANGES.txt and services entry.

Will do some end-to-end testing next.

 Add Japanese Kanji number normalization to Kuromoji
 ---

 Key: LUCENE-3922
 URL: https://issues.apache.org/jira/browse/LUCENE-3922
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
Assignee: Christian Moen
  Labels: features
 Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, 
 LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch


 Japanese people use Kanji numerals instead of Arabic numerals for writing 
 price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
 numerals (I don't think we need to have a capability to normalize to Kanji 
 numerals).
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2015-01-21 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-3922:
---
Attachment: LUCENE-3922.patch

Added factory and wrote javadoc.

 Add Japanese Kanji number normalization to Kuromoji
 ---

 Key: LUCENE-3922
 URL: https://issues.apache.org/jira/browse/LUCENE-3922
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
Assignee: Christian Moen
  Labels: features
 Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, 
 LUCENE-3922.patch, LUCENE-3922.patch


 Japanese people use Kanji numerals instead of Arabic numerals for writing 
 price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
 numerals (I don't think we need to have a capability to normalize to Kanji 
 numerals).
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2014-10-16 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-3922:
---
Attachment: LUCENE-3922.patch

 Add Japanese Kanji number normalization to Kuromoji
 ---

 Key: LUCENE-3922
 URL: https://issues.apache.org/jira/browse/LUCENE-3922
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
Assignee: Christian Moen
  Labels: features
 Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, 
 LUCENE-3922.patch


 Japanese people use Kanji numerals instead of Arabic numerals for writing 
 price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
 numerals (I don't think we need to have a capability to normalize to Kanji 
 numerals).
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2014-10-16 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173567#comment-14173567
 ] 

Christian Moen commented on LUCENE-3922:


Gaute and myself have done testing on real-world data and we've uncovered and 
fixed a couple of corner-case issues.

Our todo items are as follows:

# Do additional testing and possible add additional number formats
# Document some unsupported cases in unit-tests
# Add class-level javadoc
# Add a Solr factory



 Add Japanese Kanji number normalization to Kuromoji
 ---

 Key: LUCENE-3922
 URL: https://issues.apache.org/jira/browse/LUCENE-3922
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
Assignee: Christian Moen
  Labels: features
 Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch, 
 LUCENE-3922.patch


 Japanese people use Kanji numerals instead of Arabic numerals for writing 
 price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
 numerals (I don't think we need to have a capability to normalize to Kanji 
 numerals).
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2014-10-09 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-3922:
---
Attachment: LUCENE-3922.patch

 Add Japanese Kanji number normalization to Kuromoji
 ---

 Key: LUCENE-3922
 URL: https://issues.apache.org/jira/browse/LUCENE-3922
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
Assignee: Christian Moen
  Labels: features
 Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch


 Japanese people use Kanji numerals instead of Arabic numerals for writing 
 price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
 numerals (I don't think we need to have a capability to normalize to Kanji 
 numerals).
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2014-10-09 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164954#comment-14164954
 ] 

Christian Moen commented on LUCENE-3922:


I've attached a new patch.

The {{checkRandomData}} issues were caused by improper handling of token 
composition for graphs (bug found by [~gaute]). Tokens preceded by position 
increment zero token are left untouched and so are stacked/synonym tokens.

We'll do some more testing and add some documentation before we move forward to 
commit this.

 Add Japanese Kanji number normalization to Kuromoji
 ---

 Key: LUCENE-3922
 URL: https://issues.apache.org/jira/browse/LUCENE-3922
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
Assignee: Christian Moen
  Labels: features
 Attachments: LUCENE-3922.patch, LUCENE-3922.patch, LUCENE-3922.patch


 Japanese people use Kanji numerals instead of Arabic numerals for writing 
 price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
 numerals (I don't think we need to have a capability to normalize to Kanji 
 numerals).
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2014-08-05 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-3922:
---

Attachment: LUCENE-3922.patch

 Add Japanese Kanji number normalization to Kuromoji
 ---

 Key: LUCENE-3922
 URL: https://issues.apache.org/jira/browse/LUCENE-3922
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
  Labels: features
 Attachments: LUCENE-3922.patch, LUCENE-3922.patch


 Japanese people use Kanji numerals instead of Arabic numerals for writing 
 price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
 numerals (I don't think we need to have a capability to normalize to Kanji 
 numerals).
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2014-08-05 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14085909#comment-14085909
]

Christian Moen commented on LUCENE-3922:

Gaute and myself have been doing some work on this and we have rewritten this
as a {{TokenFilter}}.

A few comments:

* We have added support for numbers such as ３．２兆円 as you requested, Kazu.
* We could potentially use a POS-tag attribute from Kuromoji to identify number
that we are composing, but perhaps not relying on POS-tags makes this filter
also useful in the case of n-gramming.
* We haven't implemented any of the anchoring logic discussed above, i.e. if we
to restrict normalization to prices, etc. Is this useful to have?
* Input such as {{1,5}} becomes {{15}} after normalization, which could be
undesired. Is this bad input or do we want anchoring to retain these numbers?

One thing though, in order to support some of this number parsing, i.e. cases
such as ３．２兆円, we need to use Kuromoji in a mode that retains punctuation
characters.

There's also an unresolved issue found by {{checkRandomData}} that we haven't
tracked down and fixed, yet.

This is a work in progress and feedback is welcome.

Add Japanese Kanji number normalization to Kuromoji
---

Key: LUCENE-3922
URL: https://issues.apache.org/jira/browse/LUCENE-3922
Project: Lucene - Core
Issue Type: New Feature
Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
Labels: features
Attachments: LUCENE-3922.patch, LUCENE-3922.patch

Japanese people use Kanji numerals instead of Arabic numerals for writing
price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and
十二月(December). So, we would like to normalize those Kanji numerals to Arabic
numerals (I don't think we need to have a capability to normalize to Kanji
numerals).

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2014-02-13 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13901180#comment-13901180
]

Christian Moen commented on SOLR-1301:
--

I've been reading through (pretty much all) the comments on this JIRA and I'd
like to thank you all for the great effort you have put into this.

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 5.0, 4.7

Attachments: README.txt, SOLR-1301-hadoop-0-20.patch,
SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch,
SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar,
commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar,
hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch,
log4j-1.2.15.jar

This patch contains a contrib module that provides distributed indexing
(using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is
twofold:
* provide an API that is familiar to Hadoop developers, i.e. that of
OutputFormat
* avoid unnecessary export and (de)serialization of data maintained on HDFS.
SolrOutputFormat consumes data produced by reduce tasks directly, without
storing it in intermediate files. Furthermore, by using an
EmbeddedSolrServer, the indexing task is split into as many parts as there
are reducers, and the data to be indexed is not sent over the network.
Design
--
Key/value pairs produced by reduce tasks are passed to SolrOutputFormat,
which in turn uses SolrRecordWriter to write this data. SolrRecordWriter
instantiates an EmbeddedSolrServer, and it also instantiates an
implementation of SolrDocumentConverter, which is responsible for turning
Hadoop (key, value) into a SolrInputDocument. This data is then added to a
batch, which is periodically submitted to EmbeddedSolrServer. When reduce
task completes, and the OutputFormat is closed, SolrRecordWriter calls
commit() and optimize() on the EmbeddedSolrServer.
The API provides facilities to specify an arbitrary existing solr.home
directory, from which the conf/ and lib/ files will be taken.
This process results in the creation of as many partial Solr home directories
as there were reduce tasks. The output shards are placed in the output
directory on the default filesystem (e.g. HDFS). Such part-N directories
can be used to run N shard servers. Additionally, users can specify the
number of reduce tasks, in particular 1 reduce task, in which case the output
will consist of a single shard.
An example application is provided that processes large CSV files and uses
this API. It uses a custom CSV processing to avoid (de)serialization overhead.
This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this
issue, you should put it in contrib/hadoop/lib.
Note: the development of this patch was sponsored by an anonymous contributor
and approved for release under Apache License.

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-11-12 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820093#comment-13820093
 ] 

Christian Moen commented on LUCENE-2899:


bq. Stuff like this NER should NOT be in the analysis chain. as i said, its 
more useful in the document build phase anyway.

+1

Benson, as far as I understand, ES doesn't have the concept by design.

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.6

 Attachments: LUCENE-2899-RJN.patch, LUCENE-2899.patch, 
 OpenNLPFilter.java, OpenNLPTokenizer.java


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-10-16 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13796545#comment-13796545
]

Christian Moen commented on LUCENE-4956:

SooMyung, I've committed the latest changes we merged in Seoul on Monday. It's
great if you can fix the decompounding issue we came across, which we disabled
a test for.

Uwe, +1 to use {{Class#getResourceAsStream}} and remove {{FileUtils}} and
{{JarResources}}. I'll make these changes and commit to the branch.

Overall, I think there's a lot of things we can do to improve this code. Would
very much like to hear your opinion on what we should fix before committing to
trunk and getting this on the 4.x branch and improve from there. My thinking
is that it might be good to get this committed so we'll have Korean working
even though the code needs some work. SooMyung has a community in Korea that
uses and it's serving their needs as far as I understand.

Happy to hear people's opinion on this.

the korean analyzer that has a korean morphological analyzer and dictionaries
-

Key: LUCENE-4956
URL: https://issues.apache.org/jira/browse/LUCENE-4956
Project: Lucene - Core
Issue Type: New Feature
Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
Labels: newbie
Attachments: kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch,
LUCENE-4956.patch

Korean language has specific characteristic. When developing search service
with lucene solr in korean, there are some problems in searching and
indexing. The korean analyer solved the problems with a korean morphological
anlyzer. It consists of a korean morphological analyzer, dictionaries, a
korean tokenizer and a korean filter. The korean anlyzer is made for lucene
and solr. If you develop a search service with lucene in korean, It is the
best idea to choose the korean analyzer.

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-10-14 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794046#comment-13794046
]

Christian Moen commented on LUCENE-4956:

Soomyung and myself met up in Seoul today and we've merged his latest locally.
I'll commit the changes to this branch when I'm back in Tokyo and Soomyung will
follow up with fixing a known issue afterwards. Hopefully we can commit this
to trunk very soon.

the korean analyzer that has a korean morphological analyzer and dictionaries
-

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-10-08 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789052#comment-13789052
]

Christian Moen commented on LUCENE-4956:

SooMyung,

The patch you uploaded on September 11th, was that made against the latest
{{lucene4956}} branch?

The patch doesn't apply properly against on {{lucene4956}} for me. Could you
clarify its origin and instruct me how it can be applied? If you can make a
patch against the code on {{lucene4956}}, that would be much appreciated.

Thanks!

the korean analyzer that has a korean morphological analyzer and dictionaries
-

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-10-08 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789944#comment-13789944
 ] 

Christian Moen commented on LUCENE-4956:


Thanks a lot.

 the korean analyzer that has a korean morphological analyzer and dictionaries
 -

 Key: LUCENE-4956
 URL: https://issues.apache.org/jira/browse/LUCENE-4956
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
  Labels: newbie
 Attachments: kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch, 
 LUCENE-4956.patch


 Korean language has specific characteristic. When developing search service 
 with lucene  solr in korean, there are some problems in searching and 
 indexing. The korean analyer solved the problems with a korean morphological 
 anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
 korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
 and solr. If you develop a search service with lucene in korean, It is the 
 best idea to choose the korean analyzer.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-10-04 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13786905#comment-13786905
]

Christian Moen commented on LUCENE-4956:

Thanks for pushing me on this. I'll have a look at your recent changes and
commit to trunk shortly if everything seems fine. I hope to have this
committed to trunk early next week. Sorry for this having dragged out.

the korean analyzer that has a korean morphological analyzer and dictionaries
-

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5244) NPE in Japanese Analyzer

2013-09-25 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777560#comment-13777560
 ] 

Christian Moen commented on LUCENE-5244:


Hello Benson,

In your code on Github, try calling {{tokenStream.reset()}} before consumption.

 NPE in Japanese Analyzer
 

 Key: LUCENE-5244
 URL: https://issues.apache.org/jira/browse/LUCENE-5244
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Affects Versions: 4.4
Reporter: Benson Margulies

 I've got a test case that shows an NPE with the Japanese analyzer.
 It's all available in https://github.com/benson-basis/kuromoji-npe, and I 
 explicitly grant a license to the Foundation.
 If anyone would prefer that I attach a tarball here, just let me know.
 {noformat}
 ---
  T E S T S
 ---
 Running com.basistech.testcase.JapaneseNpeTest
 Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.298 sec  
 FAILURE! - in com.basistech.testcase.JapaneseNpeTest
 japaneseNpe(com.basistech.testcase.JapaneseNpeTest)  Time elapsed: 0.282 sec  
  ERROR!
 java.lang.NullPointerException: null
   at 
 org.apache.lucene.analysis.util.RollingCharBuffer.get(RollingCharBuffer.java:86)
   at 
 org.apache.lucene.analysis.ja.JapaneseTokenizer.parse(JapaneseTokenizer.java:618)
   at 
 org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:468)
   at 
 com.basistech.testcase.JapaneseNpeTest.japaneseNpe(JapaneseNpeTest.java:28)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-08-14 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739301#comment-13739301
]

Christian Moen commented on LUCENE-4956:

I've now aligned the branch with {{trunk}}, updated the example {{schema.xml}}
to use {{text_ko}} naming for the Korean field type.

I've also indexed Korean Wikipedia continuously for a few hours and the JVM
heap looks fine.

There are several additional things that can be done with this code, including
generating the parser using JFlex at build time, fixing some of the position
issues with random-blasting, cleanups and dead-code removal, etc. This said, I
believe the code we have is useful to Korean users as-is and I'm thinking it's
a good idea to integrate it into {{trunk}} and iterate further from there.

Please share your thoughts. Thanks.

the korean analyzer that has a korean morphological analyzer and dictionaries
-

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-08-14 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-4956:
---

Attachment: LUCENE-4956.patch

 the korean analyzer that has a korean morphological analyzer and dictionaries
 -

 Key: LUCENE-4956
 URL: https://issues.apache.org/jira/browse/LUCENE-4956
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
  Labels: newbie
 Attachments: kr.analyzer.4x.tar, lucene4956.patch, LUCENE-4956.patch


 Korean language has specific characteristic. When developing search service 
 with lucene  solr in korean, there are some problems in searching and 
 indexing. The korean analyer solved the problems with a korean morphological 
 anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
 korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
 and solr. If you develop a search service with lucene in korean, It is the 
 best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-08-14 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739387#comment-13739387
 ] 

Christian Moen commented on LUCENE-4956:


Attaching a patch against {{trunk}} (r1513348).


 the korean analyzer that has a korean morphological analyzer and dictionaries
 -

 Key: LUCENE-4956
 URL: https://issues.apache.org/jira/browse/LUCENE-4956
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
  Labels: newbie
 Attachments: kr.analyzer.4x.tar, lucene4956.patch, LUCENE-4956.patch


 Korean language has specific characteristic. When developing search service 
 with lucene  solr in korean, there are some problems in searching and 
 indexing. The korean analyer solved the problems with a korean morphological 
 anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
 korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
 and solr. If you develop a search service with lucene in korean, It is the 
 best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-08-14 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739392#comment-13739392
]

Christian Moen commented on LUCENE-4956:

SooMyung, let's sync up regarding your latest changes (the patch you attached).
I'm thinking perhaps we can merge to {{trunk}} first and iterate from there.
Thanks.

the korean analyzer that has a korean morphological analyzer and dictionaries
-

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-07-09 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13704147#comment-13704147
]

Christian Moen commented on LUCENE-4956:

Hello SooMyung,

I'm the one who haven't followed up properly on this as I've been too bogged
down with other things. I've set aside time next week to work on this and I
hope to have Korean merged and integrated with {{trunk}} then. I'm not sure we
can make 4.4, but I'm willing to put in extra effort if there's a chance we can
get it in in time.

the korean analyzer that has a korean morphological analyzer and dictionaries
-

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4945) Japanese Autocomplete and Highlighter broken

2013-06-26 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13693894#comment-13693894
]

Christian Moen commented on SOLR-4945:
--

Hello Shruthi,

Could you confirm if you see this problem when using
{{JapaneseTokenizerFactory}}?

{{SenTokenizerFactory}} isn't part of Solr and if you are seeing funny offsets
there, that could be the root cause of this. This is my speculation only -- I
really don't know...

I believe {{JapaneseTokenizerFactory}} in normal mode gives a similar
segmentation to {{SenTokenizer}} and it would be good to see if we can
reproduce this using {{JapaneseTokenizerFactory}}.

Many thanks.

Japanese Autocomplete and Highlighter broken

Key: SOLR-4945
URL: https://issues.apache.org/jira/browse/SOLR-4945
Project: Solr
Issue Type: Bug
Components: highlighter
Reporter: Shruthi Khatawkar

Autocomplete is implemented with Highlighter functionality. This works fine
for most of the languages but breaks for Japanese.
multivalued,termVector,termPositions and termOffset are set to true.
Here is an example:
Query: product classic.
Result:
Actual :
この商品の互換性の機種にproduct 1 やclassic Touch2 が記載が有りません。 USB接続ケーブルをproduct 1 やclassic
Touch2に付属の物を使えば利用出来ると思いますが 間違っていますか？
With Highlighter (em /em tags being used):
この商品の互換性の機種emにproduct/em 1 emやclassic/em Touch2 が記載が有りません。
USB接続ケーブルをproduct 1 やclassic Touch2に付属の物を使えば利用出来ると思いますが 間違っていますか？
Though query terms product classic is repeated twice, highlighting is
happening only on the first instance. As shown above.
Solr returns only first instance offset and second instance is ignored.
Also it's observed, highlighter repeats first letter of the token if there is
numeric.
For eg.Query : product and We have product1, highlighter returns as
pemproduct/em1.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4945) Japanese Autocomplete and Highlighter broken

2013-06-26 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694148#comment-13694148
]

Christian Moen commented on SOLR-4945:
--

No, it's not. {{JapaneseTokenizerFactory}} is available in 3.6 or newer.
Kindly upgrade to the latest version of Solr (currently 4.3.1) and see if the
problem persists. If it does, please indicate how you reproduced it in detail
so we can start investigating the cause. Thanks.

Japanese Autocomplete and Highlighter broken

Key: SOLR-4945
URL: https://issues.apache.org/jira/browse/SOLR-4945
Project: Solr
Issue Type: Bug
Components: highlighter
Reporter: Shruthi Khatawkar

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4945) Japanese Autocomplete and Highlighter broken

2013-06-21 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13690131#comment-13690131
]

Christian Moen commented on SOLR-4945:
--

Hello Shruthi,

Does this have anything to do with autocomplete or is this solely a
highlighting issue? Which field type are you using? Are you using
JapaneseTokenizer as part of this field type with search mode turned on?
Thanks.

Japanese Autocomplete and Highlighter broken

Key: SOLR-4945
URL: https://issues.apache.org/jira/browse/SOLR-4945
Project: Solr
Issue Type: Bug
Components: highlighter
Reporter: Shruthi Khatawkar

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-22 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13664223#comment-13664223
]

Christian Moen commented on LUCENE-4956:

I'm happy to take care of this unless you want to do it, Steve. I can do this
either tomorrow or on Friday. Thanks.

the korean analyzer that has a korean morphological analyzer and dictionaries
-

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-22 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13664228#comment-13664228
 ] 

Christian Moen commented on LUCENE-4956:


Thanks a lot!

 the korean analyzer that has a korean morphological analyzer and dictionaries
 -

 Key: LUCENE-4956
 URL: https://issues.apache.org/jira/browse/LUCENE-4956
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
  Labels: newbie
 Attachments: kr.analyzer.4x.tar, lucene4956.patch


 Korean language has specific characteristic. When developing search service 
 with lucene  solr in korean, there are some problems in searching and 
 indexing. The korean analyer solved the problems with a korean morphological 
 anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
 korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
 and solr. If you develop a search service with lucene in korean, It is the 
 best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5013) ScandinavianInterintelligableASCIIFoldingFilter

2013-05-22 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13664383#comment-13664383
]

Christian Moen commented on LUCENE-5013:

bq. Though maybe the name could be simpler, (ScandinavianNormalizationFilter?)

ScandinavianInterintelligableASCIIFoldingFilter
---

Key: LUCENE-5013
URL: https://issues.apache.org/jira/browse/LUCENE-5013
Project: Lucene - Core
Issue Type: New Feature
Components: modules/analysis
Affects Versions: 4.3
Reporter: Karl Wettin
Priority: Trivial
Attachments: LUCENE-5013.txt

This filter is an augmentation of output from ASCIIFoldingFilter,
it discriminate against double vowels aa, ae, ao, oe and oo, leaving just the
first one.
blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej == blabarsyltetoj
räksmörgås == ræksmørgås == ræksmörgaos == raeksmoergaas == raksmorgas
Caveats:
Since this is a filtering on top of ASCIIFoldingFilter äöåøæ already has been
folded down to aoaoae when handled by this filter it will cause effects such
as:
bøen - boen - bon
åene - aene - ane
I find this to be a trivial problem compared to not finding anything at all.
Background:
Swedish åäö is in fact the same letters as Norwegian and Danish åæø and thus
interchangeable in when used between these languages. They are however folded
differently when people type them on a keyboard lacking these characters and
ASCIIFoldingFilter handle ä and æ differently.
When a Swedish person is lacking umlauted characters on the keyboard they
consistently type a, a, o instead of å, ä, ö. Foreigners also tend to use a,
a, o.
In Norway people tend to type aa, ae and oe instead of å, æ and ø. Some use
a, a, o. I've also seen oo, ao, etc. And permutations. Not sure about Denmark
but the pattern is probably the same.
This filter solves that problem, but might also cause new.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-17 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13660528#comment-13660528
]

Christian Moen commented on LUCENE-4956:

I've run {{KoreanAnalyzer}} on Korean Wikipedia and also had a look at
memory/heap usage. Things look okay overall.

I believe {{KoreanFilter}} uses wrong offsets for synonym tokens, which was
discovered by random-blasting. Looking into the issue...

the korean analyzer that has a korean morphological analyzer and dictionaries
-

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-14 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656901#comment-13656901
 ] 

Christian Moen commented on LUCENE-4956:


Thanks, Steve  co.!

 the korean analyzer that has a korean morphological analyzer and dictionaries
 -

 Key: LUCENE-4956
 URL: https://issues.apache.org/jira/browse/LUCENE-4956
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
  Labels: newbie
 Attachments: kr.analyzer.4x.tar


 Korean language has specific characteristic. When developing search service 
 with lucene  solr in korean, there are some problems in searching and 
 indexing. The korean analyer solved the problems with a korean morphological 
 anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
 korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
 and solr. If you develop a search service with lucene in korean, It is the 
 best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-14 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656919#comment-13656919
]

Christian Moen commented on LUCENE-4956:

Hello SooMyung,

Thanks for the above regarding field type. The general approach we have taken
in Lucene is to do the same analysis at both index and query side. For
example, the Japanese analyzer also has functionality to do compound splitting
and we've discussed doing this one the index side only per default for field
type {{text_ja}}, but we decided against it.

I've included your field type in the latest code I've checked in just now, but
it's likely that we will change this in the future.

I'm wondering if you could help me with a few sample sentences that illustrates
the various options {{KoreanFilter}} has. I'd like to add some test-cases for
these to better understand the differences between them and to verify correct
behaviour. Test-cases for this is also a useful way to document functionality
in general. Thanks for any help with this!

the korean analyzer that has a korean morphological analyzer and dictionaries
-

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-08 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13651826#comment-13651826
 ] 

Christian Moen commented on LUCENE-4956:


Updates:

* Added {{text_kr}} field type to {{schema.xml}}
* Fixed Solr factories to load field type {{text_kr}} in the example
* Updated javadoc so that it compiles cleanly (mostly removed illegal javadoc)
* Updated various build things related to include Korean in the Solr 
distribution
* Added placeholder stopwords file
* Added services for arirang

Korean analysis using field type {{text_kr}} seems to be doing the right thing 
out-of-the-box now, but some configuration options in the factories aren't 
working as of now.  There are several other things that needs polishing up, but 
we're making progress.

 the korean analyzer that has a korean morphological analyzer and dictionaries
 -

 Key: LUCENE-4956
 URL: https://issues.apache.org/jira/browse/LUCENE-4956
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
  Labels: newbie
 Attachments: kr.analyzer.4x.tar


 Korean language has specific characteristic. When developing search service 
 with lucene  solr in korean, there are some problems in searching and 
 indexing. The korean analyer solved the problems with a korean morphological 
 anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
 korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
 and solr. If you develop a search service with lucene in korean, It is the 
 best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-06 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13649833#comment-13649833
 ] 

Christian Moen commented on LUCENE-4956:


bq. I think we're ready for the incubator-general vote. [~cm], do you agree?

+1 

 the korean analyzer that has a korean morphological analyzer and dictionaries
 -

 Key: LUCENE-4956
 URL: https://issues.apache.org/jira/browse/LUCENE-4956
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
  Labels: newbie
 Attachments: kr.analyzer.4x.tar


 Korean language has specific characteristic. When developing search service 
 with lucene  solr in korean, there are some problems in searching and 
 indexing. The korean analyer solved the problems with a korean morphological 
 anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
 korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
 and solr. If you develop a search service with lucene in korean, It is the 
 best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-05 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13649296#comment-13649296
 ] 

Christian Moen commented on LUCENE-4956:


Thanks, Steve.  I've added the missing license header to 
{{TestKoreanAnalyzer.java}}.

 the korean analyzer that has a korean morphological analyzer and dictionaries
 -

 Key: LUCENE-4956
 URL: https://issues.apache.org/jira/browse/LUCENE-4956
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
  Labels: newbie
 Attachments: kr.analyzer.4x.tar


 Korean language has specific characteristic. When developing search service 
 with lucene  solr in korean, there are some problems in searching and 
 indexing. The korean analyer solved the problems with a korean morphological 
 anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
 korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
 and solr. If you develop a search service with lucene in korean, It is the 
 best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-05 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13649379#comment-13649379
 ] 

Christian Moen commented on LUCENE-4956:


Good points, Uwe.  I'll look into this.

 the korean analyzer that has a korean morphological analyzer and dictionaries
 -

 Key: LUCENE-4956
 URL: https://issues.apache.org/jira/browse/LUCENE-4956
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
  Labels: newbie
 Attachments: kr.analyzer.4x.tar


 Korean language has specific characteristic. When developing search service 
 with lucene  solr in korean, there are some problems in searching and 
 indexing. The korean analyer solved the problems with a korean morphological 
 anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
 korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
 and solr. If you develop a search service with lucene in korean, It is the 
 best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-04 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13649125#comment-13649125
]

Christian Moen commented on LUCENE-4956:

A quick status update on my side is as follows:

I've put the code into an a module called {{arirang}} on my local setup and
made a few changes necessary to make things work on {{trunk}}.
{{KoreanAnalyzer}} now produces Korean tokens and some tests I've made passes
when run from my IDE.

Loading the dictionaries as resources need some work and I'll spend time on
this during the weekend. I'll also address the headers, etc. to prepare for
the incubator-general vote.

Hopefully, I'll have all this on a branch this weekend. I'll keep you posted
and we can take things from there.

the korean analyzer that has a korean morphological analyzer and dictionaries
-

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-04 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen reassigned LUCENE-4956:
--

Assignee: Christian Moen

 the korean analyzer that has a korean morphological analyzer and dictionaries
 -

 Key: LUCENE-4956
 URL: https://issues.apache.org/jira/browse/LUCENE-4956
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
  Labels: newbie
 Attachments: kr.analyzer.4x.tar


 Korean language has specific characteristic. When developing search service 
 with lucene  solr in korean, there are some problems in searching and 
 indexing. The korean analyer solved the problems with a korean morphological 
 anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
 korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
 and solr. If you develop a search service with lucene in korean, It is the 
 best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-04 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13649258#comment-13649258
]

Christian Moen commented on LUCENE-4956:

Hello SooMyoung,

Could you comment about the origins and authorship of
{{org.apache.lucene.analysis.kr.utils.StringUtil}} in your tar file?

I'm seeing a lot of authors in this file. Is this from Apache Commons Lang?
Thanks!

the korean analyzer that has a korean morphological analyzer and dictionaries
-

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-05-01 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13646960#comment-13646960
]

Christian Moen commented on LUCENE-4956:

SooMyung, I don't think you need to do anything at this point. I think a good
next step is that we create a new branch and check the code you have submitted
onto that branch. We can then start looking into addressing the headers and
other items that people have pointed out in comments. (Thanks, Jack and
Edward!)

Steve, will there be a vote after the code has been checked onto the branch?
If you think the above is a good next step, I'm happy to start working on this
either later this week or next week. Kindly let me know you prefer to proceed.
Thanks.

the korean analyzer that has a korean morphological analyzer and dictionaries
-

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-04-27 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13643901#comment-13643901
 ] 

Christian Moen commented on LUCENE-4956:


The Korean analyzer should be named named 
{{org.apache.lucene.analysis.kr.KoreanAnalyzer}} and we'll provide a 
ready-to-use field type {{text_kr}} in {{schema.xml}} for Solr users, which is 
consistent with what we do for other languages.

As for where the analyzer code itself lives, I think it's fine to put it in 
{{lucene/analysis/arirang}}.  The file {{lucene/analysis/README.txt}} documents 
what these modules are and the code is easily and directly retrievable in IDEs 
by looking up {{KoreanAnalyzer}} (the source code paths will be set up by {{ant 
eclipse}} and {{ant idea}}).

One reason analyzers have not been put in {{lucene/analysis/common} in the past 
is that they require dictionaries that are several megabytes large.

Overall, I don't think the scheme we are using is all that problematic, but 
it's true that {{MorfologikAnalyzer}} and {{SmartChineseAnalyzer}} doesn't 
align with it.  The scheme doesn't easily lend itself to different 
implementations for one language, but that's not a common case today although 
it might become more common in the future.

In the case of Norwegian (no), there are ISO language codes for both Bokmål 
(bm) and Nynorsk (nn), and one way of supporting this is also to consider these 
as options to {{NorwegianAnalyzer}} since both languages are Norwegian.  See 
SOLR-4565 for thoughts on how to extend support in 
{{NorwegianMinimalStemFilter}} for this.

A similar overall approach might make sense when there are multiple 
implementations of a language; end-users can use a analyzer named 
{{LanguageAnalyzer}} without requiring users to study the difference in 
implementation before using.  I also see problems with this, but it's just a 
thought...

I'm all for improving our scheme, but perhaps we can open up a separate JIRA 
for this and keep this one focused on Korean?





 the korean analyzer that has a korean morphological analyzer and dictionaries
 -

 Key: LUCENE-4956
 URL: https://issues.apache.org/jira/browse/LUCENE-4956
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
  Labels: newbie
 Attachments: kr.analyzer.4x.tar


 Korean language has specific characteristic. When developing search service 
 with lucene  solr in korean, there are some problems in searching and 
 indexing. The korean analyer solved the problems with a korean morphological 
 anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
 korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
 and solr. If you develop a search service with lucene in korean, It is the 
 best idea to choose the korean analyzer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4947) Java implementation (and improvement) of Levenshtein associated lexicon automata

2013-04-24 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13640551#comment-13640551
 ] 

Christian Moen commented on LUCENE-4947:


Kevin,

I think it's best that you do the license change yourself and that we don't 
have any active role in making the change since you are the only person 
entitled to make the change.

This change can be done by using the below header on all the source code and 
other relevant text files:

{noformat}
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the License); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an AS IS BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
{noformat}

After this has been done, please make a tarball and attach it to this JIRA and 
indicate that this is the code you wish to grant and also inform us about the 
MD5 hash of the tarball.  (This will go into the IP-clearance document and will 
be used to identify the codebase.)

It's a good idea to also use this MD5 hash as part of Exhibit A in the 
[software-grant.txt|http://www.apache.org/licenses/software-grant.txt] 
agreement unless you have signed and submitted this already.  (If you donate 
the code yourself by attaching it to the JIRA as described above, I believe the 
hashes not being part of Exhibit A is acceptable.)

Please feel free to add your comments, Steve.


 Java implementation (and improvement) of Levenshtein  associated lexicon 
 automata
 --

 Key: LUCENE-4947
 URL: https://issues.apache.org/jira/browse/LUCENE-4947
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.0-ALPHA, 4.0-BETA, 4.0, 4.1, 4.2, 4.2.1
Reporter: Kevin Lawson

 I was encouraged by Mike McCandless to open an issue concerning this after I 
 contacted him privately about it. Thanks Mike!
 I'd like to submit my Java implementation of the Levenshtein Automaton as a 
 homogenous replacement for the current heterogenous, multi-component 
 implementation in Lucene.
 Benefits of upgrading include 
 - Reduced code complexity
 - Better performance from components that were previously implemented in 
 Python
 - Support for on-the-fly dictionary-automaton manipulation (if you wish to 
 use my dictionary-automaton implementation)
 The code for all the components is well structured, easy to follow, and 
 extensively commented. It has also been fully tested for correct 
 functionality and performance.
 The levenshtein automaton implementation (along with the required MDAG 
 reference) can be found in my LevenshteinAutomaton Java library here: 
 https://github.com/klawson88/LevenshteinAutomaton.
 The minimalistic directed acyclic graph (MDAG) which the automaton code uses 
 to store and step through word sets can be found here: 
 https://github.com/klawson88/MDAG
 *Transpositions aren't currently implemented. I hope the comment filled, 
 editing-friendly code combined with the fact that the section in the Mihov 
 paper detailing transpositions is only 2 pages makes adding the functionality 
 trivial.
 *As a result of support for on-the-fly manipulation, the MDAG 
 (dictionary-automaton) creation process incurs a slight speed penalty. In 
 order to have the best of both worlds, i'd recommend the addition of a 
 constructor which only takes sorted input. The complete, easy to follow 
 pseudo-code for the simple procedure can be found in the first article I 
 linked under the references section in the MDAG repository)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-04-24 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13641365#comment-13641365
]

Christian Moen commented on LUCENE-4956:

Thanks again, SooMyung!

I'm seeing that Steven has informed you about the grant process on the mailing
list. I'm happy to also facilitate this process with Steven.

Looking forward to getting Korean supported.

the korean analyzer that has a korean morphological analyzer and dictionaries
-

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4947) Java implementation (and improvement) of Levenshtein associated lexicon automata

2013-04-22 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13637943#comment-13637943
 ] 

Christian Moen commented on LUCENE-4947:


Thanks a lot for wishing to submit code!

It's not possible to include your code in Lucene if it has a GPL license.  
Quite frankly, I don't think even think Lucene committers can even have a look 
at it to consider it for inclusion with a GPL license.

If you have written all the code or otherwise own all copyrights, would you 
mind switching to Apache License 2.0?  That way, I at least think it would be 
possible to have a close look to see if this is a good fit for Lucene.

 Java implementation (and improvement) of Levenshtein  associated lexicon 
 automata
 --

 Key: LUCENE-4947
 URL: https://issues.apache.org/jira/browse/LUCENE-4947
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.0-ALPHA, 4.0-BETA, 4.0, 4.1, 4.2, 4.2.1
Reporter: Kevin Lawson

 I was encouraged by Mike McCandless to open an issue concerning this after I 
 contacted him privately about it. Thanks Mike!
 I'd like to submit my Java implementation of the Levenshtein Automaton as a 
 homogenous replacement for the current heterogenous, multi-component 
 implementation in Lucene.
 Benefits of upgrading include 
 - Reduced code complexity
 - Better performance from components that were previously implemented in 
 Python
 - Support for on-the-fly dictionary-automaton manipulation (if you wish to 
 use my dictionary-automaton implementation)
 The code for all the components is well structured, easy to follow, and 
 extensively commented. It has also been fully tested for correct 
 functionality and performance.
 The levenshtein automaton implementation (along with the required MDAG 
 reference) can be found in my LevenshteinAutomaton Java library here: 
 https://github.com/klawson88/LevenshteinAutomaton.
 The minimalistic directed acyclic graph (MDAG) which the automaton code uses 
 to store and step through word sets can be found here: 
 https://github.com/klawson88/MDAG
 *Transpositions aren't currently implemented. I hope the comment filled, 
 editing-friendly code combined with the fact that the section in the Mihov 
 paper detailing transpositions is only 2 pages makes adding the functionality 
 trivial.
 *As a result of support for on-the-fly manipulation, the MDAG 
 (dictionary-automaton) creation process incurs a slight speed penalty. In 
 order to have the best of both worlds, i'd recommend the addition of a 
 constructor which only takes sorted input. The complete, easy to follow 
 pseudo-code for the simple procedure can be found in the first article I 
 linked under the references section in the MDAG repository)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4947) Java implementation (and improvement) of Levenshtein associated lexicon automata

2013-04-22 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13638128#comment-13638128
 ] 

Christian Moen commented on LUCENE-4947:


It sounds proper to do a code grant also because the software currently has a 
GPL license.  Thanks for following up, Steve.

 Java implementation (and improvement) of Levenshtein  associated lexicon 
 automata
 --

 Key: LUCENE-4947
 URL: https://issues.apache.org/jira/browse/LUCENE-4947
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.0-ALPHA, 4.0-BETA, 4.0, 4.1, 4.2, 4.2.1
Reporter: Kevin Lawson

 I was encouraged by Mike McCandless to open an issue concerning this after I 
 contacted him privately about it. Thanks Mike!
 I'd like to submit my Java implementation of the Levenshtein Automaton as a 
 homogenous replacement for the current heterogenous, multi-component 
 implementation in Lucene.
 Benefits of upgrading include 
 - Reduced code complexity
 - Better performance from components that were previously implemented in 
 Python
 - Support for on-the-fly dictionary-automaton manipulation (if you wish to 
 use my dictionary-automaton implementation)
 The code for all the components is well structured, easy to follow, and 
 extensively commented. It has also been fully tested for correct 
 functionality and performance.
 The levenshtein automaton implementation (along with the required MDAG 
 reference) can be found in my LevenshteinAutomaton Java library here: 
 https://github.com/klawson88/LevenshteinAutomaton.
 The minimalistic directed acyclic graph (MDAG) which the automaton code uses 
 to store and step through word sets can be found here: 
 https://github.com/klawson88/MDAG
 *Transpositions aren't currently implemented. I hope the comment filled, 
 editing-friendly code combined with the fact that the section in the Mihov 
 paper detailing transpositions is only 2 pages makes adding the functionality 
 trivial.
 *As a result of support for on-the-fly manipulation, the MDAG 
 (dictionary-automaton) creation process incurs a slight speed penalty. In 
 order to have the best of both worlds, i'd recommend the addition of a 
 constructor which only takes sorted input. The complete, easy to follow 
 pseudo-code for the simple procedure can be found in the first article I 
 linked under the references section in the MDAG repository)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3706) Ship setup to log with log4j.

2013-03-15 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603516#comment-13603516
]

Christian Moen commented on SOLR-3706:
--

{quote}
Mark, have you tried Logback? That's a good logging implementation; arguably a
better one.
{quote}

David and Mark, I believe [Log4J 2|http://logging.apache.org/log4j/2.x/|]
addresses a lot of the weaknesses in Log4J 1.x also addressed by Logback.
However, Log4J 2 hasn't been released yet.

To me it sounds like a good idea to use Log4J 1.x now and move to Log4J 2 in
the future.

Ship setup to log with log4j.
-

Key: SOLR-3706
URL: https://issues.apache.org/jira/browse/SOLR-3706
Project: Solr
Issue Type: Improvement
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
Fix For: 4.3, 5.0

Attachments: SOLR-3706-solr-log4j.patch

Currently we default to java util logging and it's terrible in my opinion.
*It's simple built in logger is a 2 line logger.
*You have to jump through hoops to use your own custom formatter with jetty -
either putting your class in the start.jar or other pain in the butt
solutions.
*It can't roll files by date out of the box.
I'm sure there are more issues, but those are the ones annoying me now. We
should switch to log4j - it's much nicer and it's easy to get a nice single
line format and roll by date, etc.
If someone wants to use JUL they still can - but at least users could start
with something decent.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4407) SSL auth or basic auth in SolrCloud

2013-02-06 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-4407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13572507#comment-13572507
 ] 

Christian Moen commented on SOLR-4407:
--

I don't think this is a Solr issue, but it might be helpful to provide general 
information on how to secure Solr's interfaces.  However, how to set this up is 
Servlet container specific.  Could you clarify what you had in mind for this?  
Thanks.

 SSL auth or basic auth in SolrCloud
 ---

 Key: SOLR-4407
 URL: https://issues.apache.org/jira/browse/SOLR-4407
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Affects Versions: 4.1
Reporter: Sindre Fiskaa
  Labels: Authentication, Certificate, SSL

 I need to be able to secure sensitive information in solrnodes running in a 
 SolrCloud with either SSL client/server certificates or http basic auth..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4407) SSL auth or basic auth in SolrCloud

2013-02-06 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-4407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13572524#comment-13572524
 ] 

Christian Moen commented on SOLR-4407:
--

Thanks a lot for clarifying, Jan.  I wasn't aware of this limitation.

 SSL auth or basic auth in SolrCloud
 ---

 Key: SOLR-4407
 URL: https://issues.apache.org/jira/browse/SOLR-4407
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Affects Versions: 4.1
Reporter: Sindre Fiskaa
  Labels: Authentication, Certificate, SSL
 Fix For: 4.2, 5.0


 I need to be able to secure sensitive information in solrnodes running in a 
 SolrCloud with either SSL client/server certificates or http basic auth..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2012-10-11 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13474224#comment-13474224
]

Christian Moen commented on LUCENE-3922:

Thanks, Kazu.

I'm aware of the issue and the thinking is to rework this as a {{TokenFilter}}
and use anchoring options with surrounding tokens to decide if normalisation
should take place, i.e. if the preceding token is ￥ or the following token is 円
in the case of normalising prices.

It might also be helpful to look into using POS-info for this to benefit from
what we actually know about the token, i.e. to not apply normalisation if the
POS tag is a person name.

Other suggestions and ideas are of course most welcome.

Add Japanese Kanji number normalization to Kuromoji
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2012-10-11 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13474287#comment-13474287
 ] 

Christian Moen commented on LUCENE-3922:


Ohtani-san,

I saw your tweet about this earlier and it sounds like a very good idea.  
Thanks.

I will try to set aside some time to work on this.

 Add Japanese Kanji number normalization to Kuromoji
 ---

 Key: LUCENE-3922
 URL: https://issues.apache.org/jira/browse/LUCENE-3922
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
  Labels: features
 Attachments: LUCENE-3922.patch


 Japanese people use Kanji numerals instead of Arabic numerals for writing 
 price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
 numerals (I don't think we need to have a capability to normalize to Kanji 
 numerals).
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji

2012-10-06 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13470967#comment-13470967
 ] 

Christian Moen commented on LUCENE-3921:


Lance,

The idea I had in mind for Japanese uses language specific characteristics for 
katakana terms and perhaps weights that are dictionary-specific as well.  
However, we are hacking the our statistical model here and there are 
limitations as to how far we can go with this.

I don't know a whole lot about the Smart Chinese toolkit, but I believe the 
same approach to compound segmentation could work for Chinese as well.  
However, weights and implementation would likely to be separate.  Note that the 
above is really about one specific kind of compound segmentation that applies 
to Japanese so the thinking was to add additional heuristics for this specific 
type that is particularly tricky.

It might be a good idea to approach this problem also using the 
{{DictionaryCompoundWordTokenFilter}} and collectively build some lexical 
assets for compound splitting for the relevant languages than hacking our 
models.

 Add decompose compound Japanese Katakana token capability to Kuromoji
 -

 Key: LUCENE-3921
 URL: https://issues.apache.org/jira/browse/LUCENE-3921
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
 Environment: Cent OS 5, IPA Dictionary, Run with Search mdoe
Reporter: Kazuaki Hiraga
  Labels: features

 Japanese morphological analyzer, Kuromoji doesn't have a capability to 
 decompose every Japanese Katakana compound tokens to sub-tokens. It seems 
 that some Katakana tokens can be decomposed, but it cannot be applied every 
 Katakana compound tokens. For instance, トートバッグ(tote bag) and ショルダーバッグ 
 don't decompose into トート バッグ and ショルダー バッグ although the IPA dictionary 
 has バッグ in its entry.  I would like to apply the decompose feature to every 
 Katakana tokens if the sub-tokens are in the dictionary or add the capability 
 to force apply the decompose feature to every Katakana tokens.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2012-10-06 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471132#comment-13471132
]

Christian Moen commented on LUCENE-3922:

{quote}
Is it difficult to support numbers with period as the following?
３．２兆円
５．２億円
{quote}

Supporting this is no problem and a good idea.

{quote}
I think It would be helpful that this charfilter supports old Kanji numeric
characters (KYU-KANJI or DAIJI) such as 壱, 壹 (One), 弌, 弐, 貳 (Two), 弍, 参,參
(Three), or configureable.
{quote}

This is also easy to support.

As for making preserving zeros configurable, that's also possible, of course.

It's great to get more feedback on what sort of functionality we need and what
should be configurable options. Hopefully, we can find a good balance without
adding too much complexity.

Thanks for the feedback.

Add Japanese Kanji number normalization to Kuromoji
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4433) kuromoji ToStringUtil.getRomanization

2012-09-26 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13463595#comment-13463595
 ] 

Christian Moen commented on LUCENE-4433:


Thanks a lot for this.  I'll fix.

  kuromoji  ToStringUtil.getRomanization
 ---

 Key: LUCENE-4433
 URL: https://issues.apache.org/jira/browse/LUCENE-4433
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Wang Han

 case 'メ':
   builder.append(mi);
   break;
 -
 should be 
 case 'メ':
   builder.append(me);
   break;
 you can refer http://en.wikipedia.org/wiki/Katakana 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4433) kuromoji ToStringUtil.getRomanization

2012-09-26 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13463611#comment-13463611
 ] 

Christian Moen commented on LUCENE-4433:


Robert has already fixed this on {{trunk}} in {{r1339753}.

  kuromoji  ToStringUtil.getRomanization
 ---

 Key: LUCENE-4433
 URL: https://issues.apache.org/jira/browse/LUCENE-4433
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Wang Han

 case 'メ':
   builder.append(mi);
   break;
 -
 should be 
 case 'メ':
   builder.append(me);
   break;
 you can refer http://en.wikipedia.org/wiki/Katakana 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4433) kuromoji ToStringUtil.getRomanization

2012-09-26 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-4433:
---

  Component/s: modules/analysis
Affects Version/s: 3.6.1

  kuromoji  ToStringUtil.getRomanization
 ---

 Key: LUCENE-4433
 URL: https://issues.apache.org/jira/browse/LUCENE-4433
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Affects Versions: 3.6.1
Reporter: Wang Han

 case 'メ':
   builder.append(mi);
   break;
 -
 should be 
 case 'メ':
   builder.append(me);
   break;
 you can refer http://en.wikipedia.org/wiki/Katakana 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-4433) kuromoji ToStringUtil.getRomanization

2012-09-26 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13463611#comment-13463611
 ] 

Christian Moen edited comment on LUCENE-4433 at 9/26/12 7:19 PM:
-

Robert has already fixed this on {{trunk}} in {{r1339753}}.

  was (Author: cm):
Robert has already fixed this on {{trunk}} in {{r1339753}.
  
  kuromoji  ToStringUtil.getRomanization
 ---

 Key: LUCENE-4433
 URL: https://issues.apache.org/jira/browse/LUCENE-4433
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Affects Versions: 3.6.1
Reporter: Wang Han

 case 'メ':
   builder.append(mi);
   break;
 -
 should be 
 case 'メ':
   builder.append(me);
   break;
 you can refer http://en.wikipedia.org/wiki/Katakana 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4433) kuromoji ToStringUtil.getRomanization

2012-09-26 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13463637#comment-13463637
 ] 

Christian Moen commented on LUCENE-4433:


Any thoughts if we should backport this - or just a fix for the specific case 
mention - to the 3.6 branch, Robert?

I'm happy to do it, but I'm not sure if there will be a 3.6.2 with 4.0 being so 
close.


  kuromoji  ToStringUtil.getRomanization
 ---

 Key: LUCENE-4433
 URL: https://issues.apache.org/jira/browse/LUCENE-4433
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Affects Versions: 3.6.1
Reporter: Wang Han

 case 'メ':
   builder.append(mi);
   break;
 -
 should be 
 case 'メ':
   builder.append(me);
   break;
 you can refer http://en.wikipedia.org/wiki/Katakana 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3876) Solr Admin UI is completely dysfunctional on IE 9

2012-09-24 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13461860#comment-13461860
 ] 

Christian Moen commented on SOLR-3876:
--

Thanks a lot for this, Jack.

I'm afraid I don't know the overall status nor history of the 4.0 UI in IE9, 
but do you happen to know if this is a regression of it the UI has been 
generally broken for IE9 all along?

To me it sounds quite important to get this fixed for 4.0 and I can help 
working some on this.

 Solr Admin UI is completely dysfunctional on IE 9
 -

 Key: SOLR-3876
 URL: https://issues.apache.org/jira/browse/SOLR-3876
 Project: Solr
  Issue Type: Bug
  Components: web gui
Affects Versions: 4.0-BETA, 4.0
 Environment: Windows 7, IE 9
Reporter: Jack Krupansky
Priority: Critical
 Fix For: 4.0

 Attachments: screenshot-1.jpg, screenshot-2.jpg, screenshot-3.jpg


 The Solr Admin UI is completely dysfunctional on IE 9. See attached screen 
 shot. I don't even see a collection1 button. But Admin UI is working fine 
 in Google Chrome with same running instance of Solr.
 Currently running 4.0 RC0, but problem existed with 4.0-BETA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-3876) Solr Admin UI is completely dysfunctional on IE 9

2012-09-24 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13461860#comment-13461860
 ] 

Christian Moen edited comment on SOLR-3876 at 9/25/12 2:54 AM:
---

Thanks a lot for this, Jack.

I'm afraid I don't know the overall status nor history of the 4.0 UI in IE9, 
but do you happen to know if this is a regression of it the UI has been 
generally broken for IE9 all along?

To me it sounds quite important to get this fixed for 4.0 if it's a regression. 
 I can help working some on this.

  was (Author: cm):
Thanks a lot for this, Jack.

I'm afraid I don't know the overall status nor history of the 4.0 UI in IE9, 
but do you happen to know if this is a regression of it the UI has been 
generally broken for IE9 all along?

To me it sounds quite important to get this fixed for 4.0 and I can help 
working some on this.
  
 Solr Admin UI is completely dysfunctional on IE 9
 -

 Key: SOLR-3876
 URL: https://issues.apache.org/jira/browse/SOLR-3876
 Project: Solr
  Issue Type: Bug
  Components: web gui
Affects Versions: 4.0-BETA, 4.0
 Environment: Windows 7, IE 9
Reporter: Jack Krupansky
Priority: Critical
 Fix For: 4.0

 Attachments: screenshot-1.jpg, screenshot-2.jpg, screenshot-3.jpg


 The Solr Admin UI is completely dysfunctional on IE 9. See attached screen 
 shot. I don't even see a collection1 button. But Admin UI is working fine 
 in Google Chrome with same running instance of Solr.
 Currently running 4.0 RC0, but problem existed with 4.0-BETA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3876) Solr Admin UI is completely dysfunctional on IE 9

2012-09-24 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13461889#comment-13461889
 ] 

Christian Moen commented on SOLR-3876:
--

The 4.0 UI wasn't developed with IE9 in mind so getting IE9 supported seems 
like a bigger effort.  SOLR-3841 seems related to this issue and has been 
deferred to 4.1 so I'm suggesting that we do the same with this one as well.

Please feel free to jump in with whatever comments you might have, steffkes.

 Solr Admin UI is completely dysfunctional on IE 9
 -

 Key: SOLR-3876
 URL: https://issues.apache.org/jira/browse/SOLR-3876
 Project: Solr
  Issue Type: Bug
  Components: web gui
Affects Versions: 4.0-BETA, 4.0
 Environment: Windows 7, IE 9
Reporter: Jack Krupansky
Priority: Critical
 Fix For: 4.1

 Attachments: screenshot-1.jpg, screenshot-2.jpg, screenshot-3.jpg


 The Solr Admin UI is completely dysfunctional on IE 9. See attached screen 
 shot. I don't even see a collection1 button. But Admin UI is working fine 
 in Google Chrome with same running instance of Solr.
 Currently running 4.0 RC0, but problem existed with 4.0-BETA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-3876) Solr Admin UI is completely dysfunctional on IE 9

2012-09-24 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated SOLR-3876:
-

Fix Version/s: (was: 4.0)
   4.1

 Solr Admin UI is completely dysfunctional on IE 9
 -

 Key: SOLR-3876
 URL: https://issues.apache.org/jira/browse/SOLR-3876
 Project: Solr
  Issue Type: Bug
  Components: web gui
Affects Versions: 4.0-BETA, 4.0
 Environment: Windows 7, IE 9
Reporter: Jack Krupansky
Priority: Critical
 Fix For: 4.1

 Attachments: screenshot-1.jpg, screenshot-2.jpg, screenshot-3.jpg


 The Solr Admin UI is completely dysfunctional on IE 9. See attached screen 
 shot. I don't even see a collection1 button. But Admin UI is working fine 
 in Google Chrome with same running instance of Solr.
 Currently running 4.0 RC0, but problem existed with 4.0-BETA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-4330) Add NAIST-jdic support to Kuromoji

2012-08-27 Thread Christian Moen (JIRA)

Christian Moen created LUCENE-4330:
--

 Summary: Add NAIST-jdic support to Kuromoji
 Key: LUCENE-4330
 URL: https://issues.apache.org/jira/browse/LUCENE-4330
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 5.0, 4.0
Reporter: Christian Moen


We should look into adding NAIST-jdic support to Kuromoji as this dictionary is 
better than the current IPADIC.  The NAIST-jdic license seems fine, but needs a 
formal check-off before any inclusion in Lucene.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2012-07-30 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated LUCENE-3922:
---

Attachment: LUCENE-3922.patch

 Add Japanese Kanji number normalization to Kuromoji
 ---

 Key: LUCENE-3922
 URL: https://issues.apache.org/jira/browse/LUCENE-3922
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
  Labels: features
 Attachments: LUCENE-3922.patch


 Japanese people use Kanji numerals instead of Arabic numerals for writing 
 price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
 numerals (I don't think we need to have a capability to normalize to Kanji 
 numerals).
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2012-07-30 Thread Christian Moen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425488#comment-13425488
]

Christian Moen commented on LUCENE-3922:

I've attached a work-in-progress patch for {{trunk}} that implements a
{{CharFilter}} that normalizes Japanese numbers.

These are some TODOs and implementation considerations I have that I'd be
thankful to get feedback on:

* Buffering the entire input on the first read should be avoided. The primary
reason this is done is because I was thinking to add some regexps before and
after kanji numeric strings to qualify their normalization, i.e. to only
normalize strings that starts with ￥, JPY or ends with 円, to only normalize
monetary amounts in Japanese yen. However, this probably isn't necessary as we
can probably can use {{Matcher.requireEnd()}} and {{Matcher.hitEnd()}} to
decide if we need to read more input. (Thanks, Robert!)

* Is qualifying the numbers to be normalized with prefix and suffix regexps
useful, i.e. to only normalize monetary amounts?

* How do we deal with leading zeros? Currently, 007 and ◯◯七 becomes 7
today. Do we want an option to preserve leading zeros?

* How large numbers do we care about supporting? Some of the larger numbers
are surrogates, which complicates implementation, but they're certainly
possible. If we don't care about really large numbers, we can probably be fine
working with {{long}} instead of {{BigInteger}}.

* Polite numbers and some other variants aren't supported, i.e. 壱, 弐, 参, etc.,
but they can easily be added. We can also add the obsolete variants if that's
useful somehow. Are these useful? Do we want them available via an option?

* Number formats such as １億２，３４５万６，７８９ isn't supported - we don't deal with
the comma today, but this can be added. The same applies to １２ ３４５ where
there's a space that separates thousands like in French. Numbers like 2・2兆
aren't supported, but can be added.

* Only integers are supported today, so we can't parse 〇・一二三四, which becomes
0 and 1234 as separate tokens instead of 0.1234

There are probably other considerations, too, that I doesn't immediately come
to mind.

Numbers are fairly complicated and feedback on direction for further
implementation is most appreciated. Thanks.

Add Japanese Kanji number normalization to Kuromoji
---

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-07-12 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen updated SOLR-3524:
-

Attachment: SOLR-3524.patch

 Make discard-punctuation feature in Kuromoji configurable from 
 JapaneseTokenizerFactory
 ---

 Key: SOLR-3524
 URL: https://issues.apache.org/jira/browse/SOLR-3524
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.6
Reporter: Kazuaki Hiraga
Assignee: Christian Moen
Priority: Minor
 Attachments: SOLR-3524.patch, SOLR-3524.patch, 
 kuromoji_discard_punctuation.patch.txt


 JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
 punctuation in Japanese text, although It has a parameter to change this 
 behavior.  JapaneseTokenizerFactory always set third parameter, which 
 controls this behavior, to true to remove punctuation.
 I would like to have an option I can configure this behavior by fieldtype 
 definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-07-12 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13412627#comment-13412627
 ] 

Christian Moen commented on SOLR-3524:
--

Patch updated due to recent configuration changes.

 Make discard-punctuation feature in Kuromoji configurable from 
 JapaneseTokenizerFactory
 ---

 Key: SOLR-3524
 URL: https://issues.apache.org/jira/browse/SOLR-3524
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.6
Reporter: Kazuaki Hiraga
Assignee: Christian Moen
Priority: Minor
 Attachments: SOLR-3524.patch, SOLR-3524.patch, 
 kuromoji_discard_punctuation.patch.txt


 JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
 punctuation in Japanese text, although It has a parameter to change this 
 behavior.  JapaneseTokenizerFactory always set third parameter, which 
 controls this behavior, to true to remove punctuation.
 I would like to have an option I can configure this behavior by fieldtype 
 definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-07-12 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13412628#comment-13412628
 ] 

Christian Moen commented on SOLR-3524:
--

Committed revision 1360592 on {{trunk}}

 Make discard-punctuation feature in Kuromoji configurable from 
 JapaneseTokenizerFactory
 ---

 Key: SOLR-3524
 URL: https://issues.apache.org/jira/browse/SOLR-3524
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.6
Reporter: Kazuaki Hiraga
Assignee: Christian Moen
Priority: Minor
 Attachments: SOLR-3524.patch, SOLR-3524.patch, 
 kuromoji_discard_punctuation.patch.txt


 JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
 punctuation in Japanese text, although It has a parameter to change this 
 behavior.  JapaneseTokenizerFactory always set third parameter, which 
 controls this behavior, to true to remove punctuation.
 I would like to have an option I can configure this behavior by fieldtype 
 definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-07-12 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13412659#comment-13412659
 ] 

Christian Moen commented on SOLR-3524:
--

Committed revision 1360613 on {{branch_4x}}

 Make discard-punctuation feature in Kuromoji configurable from 
 JapaneseTokenizerFactory
 ---

 Key: SOLR-3524
 URL: https://issues.apache.org/jira/browse/SOLR-3524
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.6
Reporter: Kazuaki Hiraga
Assignee: Christian Moen
Priority: Minor
 Fix For: 4.0, 5.0

 Attachments: SOLR-3524.patch, SOLR-3524.patch, 
 kuromoji_discard_punctuation.patch.txt


 JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
 punctuation in Japanese text, although It has a parameter to change this 
 behavior.  JapaneseTokenizerFactory always set third parameter, which 
 controls this behavior, to true to remove punctuation.
 I would like to have an option I can configure this behavior by fieldtype 
 definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-07-12 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen resolved SOLR-3524.
--

   Resolution: Fixed
Fix Version/s: 5.0
   4.0

Thanks, Kazu and Ohtani-san!

 Make discard-punctuation feature in Kuromoji configurable from 
 JapaneseTokenizerFactory
 ---

 Key: SOLR-3524
 URL: https://issues.apache.org/jira/browse/SOLR-3524
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.6
Reporter: Kazuaki Hiraga
Assignee: Christian Moen
Priority: Minor
 Fix For: 4.0, 5.0

 Attachments: SOLR-3524.patch, SOLR-3524.patch, 
 kuromoji_discard_punctuation.patch.txt


 JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
 punctuation in Japanese text, although It has a parameter to change this 
 behavior.  JapaneseTokenizerFactory always set third parameter, which 
 controls this behavior, to true to remove punctuation.
 I would like to have an option I can configure this behavior by fieldtype 
 definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory

2012-07-12 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13412685#comment-13412685
 ] 

Christian Moen commented on SOLR-3524:
--

{{CHANGES.txt}} for some reason didn't make it into {{branch_4x}}.  Fixed this 
in revision 1360622.

 Make discard-punctuation feature in Kuromoji configurable from 
 JapaneseTokenizerFactory
 ---

 Key: SOLR-3524
 URL: https://issues.apache.org/jira/browse/SOLR-3524
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 3.6
Reporter: Kazuaki Hiraga
Assignee: Christian Moen
Priority: Minor
 Fix For: 4.0, 5.0

 Attachments: SOLR-3524.patch, SOLR-3524.patch, 
 kuromoji_discard_punctuation.patch.txt


 JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve 
 punctuation in Japanese text, although It has a parameter to change this 
 behavior.  JapaneseTokenizerFactory always set third parameter, which 
 controls this behavior, to true to remove punctuation.
 I would like to have an option I can configure this behavior by fieldtype 
 definition in schema.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3615) JMX Mbeans disappear on core reload

2012-07-11 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13411396#comment-13411396
 ] 

Christian Moen commented on SOLR-3615:
--

Emanuele, does this apply to 4.x as well or is this 3.{4,5,6} only?  I haven't 
checked the code nor am I very familiar with the Core Container, but this seems 
like an important issue to get fixed.  I'd like to help.

 JMX Mbeans disappear on core reload
 ---

 Key: SOLR-3615
 URL: https://issues.apache.org/jira/browse/SOLR-3615
 Project: Solr
  Issue Type: Bug
  Components: multicore
Affects Versions: 3.4, 3.5, 3.6
Reporter: Emanuele Lombardi
  Labels: CoreContainer, CoreReload, JMX
 Attachments: patch.txt


 https://issues.apache.org/jira/browse/SOLR-3616
 This fix solves the issue of MBeans disappearing after core reload

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4207) speed up our slowest tests

2012-07-11 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13411654#comment-13411654
 ] 

Christian Moen commented on LUCENE-4207:


Thanks a lot, Dawid.  I'll try this, have a look and report back.

Adrien, thanks for taking the time!

 speed up our slowest tests
 --

 Key: LUCENE-4207
 URL: https://issues.apache.org/jira/browse/LUCENE-4207
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Robert Muir

 Was surprised to hear from Christian that lucene/solr tests take him 40 
 minutes on a modern mac.
 This is too much. Lets look at the slowest tests and make them reasonable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-4201) Add Japanese character filter to normalize iteration marks

2012-07-10 Thread Christian Moen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Moen reassigned LUCENE-4201:
--

Assignee: Christian Moen

 Add Japanese character filter to normalize iteration marks
 --

 Key: LUCENE-4201
 URL: https://issues.apache.org/jira/browse/LUCENE-4201
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.0, 5.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: LUCENE-4201.patch, LUCENE-4201.patch, LUCENE-4201.patch, 
 LUCENE-4201.patch, LUCENE-4201.patch


 For some applications it might be useful to normalize kanji and kana 
 iteration marks such as 々, ゞ, ゝ, ヽ and ヾ to make sure they are treated 
 uniformly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

1 2 >

1 - 100 of 162 matches

Mail list logo