HI Marc, I wonder if there is a workaround for this issue: eg, could
we have entries for both widths? I wonder if there is some interaction
with an analysis chain that is doing half-width -> full-width
conversion (or vice versa)? I think the UserDictionary has to operate
on pre-analyzed tokens ... although maybe *after* char filtering,
which presumably could handle width conversions. A bunch of rambling,
but maybe the point is - can you share some more information -- what
is the full entry in the dictionary that causes the problem?

On Wed, Jan 12, 2022 at 7:04 PM Marc D'Mello <marcd2...@gmail.com> wrote:
>
> Hi,
>
> I had a question about the Japanese user dictionary. We have a user
> dictionary that used to work but after attempting to upgrade Lucene, it
> fails with the following error:
>
> Caused by: java.lang.RuntimeException: Illegal user dictionary entry レコーダー
> - the concatenated segmentation (レコーダー) does not match the surface form
> (レコーダー)
>     at
> org.apache.lucene.analysis.ja.dict.UserDictionary.<init>(UserDictionary.java:123)
>
> The specific commit causing this error is here
> <https://github.com/apache/lucene/commit/73ba88a50dec64f367caa88d277c26dfd1d8883b#diff-75fd48fadfd3d011e9c34c4310ef66e9009edfbc738fd82deb5661a8edb5c5d9>.
> The only thing that seems to differ is that the characters are full-width
> vs half-width, so I was wondering if this is intended behavior or a bug/too
> restrictive. Any suggestions for fixing this would be greatly appreciated!
> Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to