Re: Issue with Japanese User Dictionary

Tomoko Uchida Thu, 13 Jan 2022 18:59:04 -0800

Hi,

> The only thing that seems to differ is that the characters are full-width
> vs half-width, so I was wondering if this is intended behavior or a bug/too
> restrictive


This is intended behavior. The first column in the user dictionary
must be equal to the concatenated string of the second column in terms
of Unicode codepoint. No normalization such as full-width and
half-width normalization should not be applied (any normalization or
tweak can cause runtime bugs).

2022年1月14日(金) 5:45 Marc D'Mello <[email protected]>:
>
> Hi Mike,
>
> Thanks for the response! I'm actually not super familiar with
> UserDictionaries, but looking at the code, it seems like a single line in
> the user provided user dictionary corresponds to a single entry? In that
> case, here is the line (or entry) that does have both widths that I believe
> is causing the problem:
>
> ﾚｺｰﾀﾞｰ,レコーダー,レコーダー,JA名詞
>
> I'm guess here the surface is ﾚｺｰﾀﾞｰ and the concatentated segment is the
> first occurrence of レコーダー. I'm what surface or concatenated segment means
> though, and what it would mean semantically to replace the surface with the
> full width version or the concatenated segment with the half width version.
>
> Thanks,
> Marc
>
>
> On Thu, Jan 13, 2022 at 7:18 AM Michael Sokolov <[email protected]> wrote:
>
> > HI Marc, I wonder if there is a workaround for this issue: eg, could
> > we have entries for both widths? I wonder if there is some interaction
> > with an analysis chain that is doing half-width -> full-width
> > conversion (or vice versa)? I think the UserDictionary has to operate
> > on pre-analyzed tokens ... although maybe *after* char filtering,
> > which presumably could handle width conversions. A bunch of rambling,
> > but maybe the point is - can you share some more information -- what
> > is the full entry in the dictionary that causes the problem?
> >
> > On Wed, Jan 12, 2022 at 7:04 PM Marc D'Mello <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > I had a question about the Japanese user dictionary. We have a user
> > > dictionary that used to work but after attempting to upgrade Lucene, it
> > > fails with the following error:
> > >
> > > Caused by: java.lang.RuntimeException: Illegal user dictionary entry
> > ﾚｺｰﾀﾞｰ
> > > - the concatenated segmentation (レコーダー) does not match the surface form
> > > (ﾚｺｰﾀﾞｰ)
> > >     at
> > >
> > org.apache.lucene.analysis.ja.dict.UserDictionary.<init>(UserDictionary.java:123)
> > >
> > > The specific commit causing this error is here
> > > <
> > https://github.com/apache/lucene/commit/73ba88a50dec64f367caa88d277c26dfd1d8883b#diff-75fd48fadfd3d011e9c34c4310ef66e9009edfbc738fd82deb5661a8edb5c5d9
> > >.
> > > The only thing that seems to differ is that the characters are full-width
> > > vs half-width, so I was wondering if this is intended behavior or a
> > bug/too
> > > restrictive. Any suggestions for fixing this would be greatly
> > appreciated!
> > > Thanks!
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Issue with Japanese User Dictionary

Reply via email to