[
https://issues.apache.org/jira/browse/LUCENE-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828848#comment-15828848
]
Hoss Man commented on LUCENE-7635:
----------------------------------
i'm not very familiar with Kuromoji but i believe the lines you 're deleting in
this patch are intended to catch comments at the _end_ of a line -- not just
the begining, ie...
{noformat}
# comment at start of line
朝青龍,朝青龍,アサショウリュウ,カスタム人名 # end line comment, has a comma in it
# spans more then one line
abcd,a b cd,foo1 foo2 foo3,bar # Another end line comment
{noformat}
Since it seems like the intent of the UserDict format is to be "CSV with '#'
comments" it seems like the comment stripping should be moved to
o.a.l.analysis.ja.util.CSVUtil where it can be done if-and-only-if the '#' is
not part of a quoted value...
{noformat}
朝青龍,朝青龍,アサショウリュウ,カスタム人名 # end line comment, has a comma in it
# spans more then one line
abcd,a b cd,foo1 foo2 foo3,bar # Another end line comment
"quoted#sharp",other,"quoted,stuff" # yet another end line comment
{noformat}
ie: add a {{if(c == '#' && !insideQuote)}} block (similar to the existing
{{COMMA}} conditional) to CSVUtil.parse() that would (trim and) add the final
value to result and break out of the for loop.
?
> Kuromoji fails if user dictionary contains #
> --------------------------------------------
>
> Key: LUCENE-7635
> URL: https://issues.apache.org/jira/browse/LUCENE-7635
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Masaru Hasegawa
> Attachments: LUCENE-7635.patch
>
>
> If user dictionary contains entries like:
> {code}
> withsharp#,withsharp#,withsharp#,カスタム名詞
> {code}
> It fails to create dictionary throwing
> java.lang.ArrayIndexOutOfBoundsException.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]