HI Marc, I wonder if there is a workaround for this issue: eg, could we have entries for both widths? I wonder if there is some interaction with an analysis chain that is doing half-width -> full-width conversion (or vice versa)? I think the UserDictionary has to operate on pre-analyzed tokens ... although maybe *after* char filtering, which presumably could handle width conversions. A bunch of rambling, but maybe the point is - can you share some more information -- what is the full entry in the dictionary that causes the problem?
On Wed, Jan 12, 2022 at 7:04 PM Marc D'Mello <marcd2...@gmail.com> wrote: > > Hi, > > I had a question about the Japanese user dictionary. We have a user > dictionary that used to work but after attempting to upgrade Lucene, it > fails with the following error: > > Caused by: java.lang.RuntimeException: Illegal user dictionary entry レコーダー > - the concatenated segmentation (レコーダー) does not match the surface form > (レコーダー) > at > org.apache.lucene.analysis.ja.dict.UserDictionary.<init>(UserDictionary.java:123) > > The specific commit causing this error is here > <https://github.com/apache/lucene/commit/73ba88a50dec64f367caa88d277c26dfd1d8883b#diff-75fd48fadfd3d011e9c34c4310ef66e9009edfbc738fd82deb5661a8edb5c5d9>. > The only thing that seems to differ is that the characters are full-width > vs half-width, so I was wondering if this is intended behavior or a bug/too > restrictive. Any suggestions for fixing this would be greatly appreciated! > Thanks! --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org