You just have to make sure that the language pack makes it easy to apply the
same pre-processing to test data that you applied at training time. Which means
bundling the segmentation model with the language pack (or doing something
simple, like single-character words—that degrades performance but would be
easier). I typically use the Stanford segmenter but I'm not sure it would
matter that much.
> On Feb 19, 2018, at 1:45 PM, Tommaso Teofili <tommaso.teof...@gmail.com>
> thanks Matt.
> Would you be able to point out such additional step in a bit more detail
> when you have time ?
> Not sure what you used for segmentation, perhaps could use either Lucene's
> CJK  or Kuromoji  analyzers.
>  :
>  : https://lucene.apache.org/core/7_0_0/analyzers-kuromoji/
> Il giorno lun 19 feb 2018 alle ore 12:12 Matt Post <p...@cs.jhu.edu> ha
>> I don’t think I ever built these. There is an additional step of properly
>> and consistently segmenting Chinese which complicates things and creates an
>> external dependency.
>> matt (from my phone)
>>> Le 19 févr. 2018 à 10:46, Tommaso Teofili <tommaso.teof...@gmail.com> a
>> écrit :
>>> Hi all,
>>> I am not sure if I am missing something, but I somewhat recalled that
>>> language packs for Chinese (but also Japanese / Korean) existed at ,
>>> however I can't find any.
>>> Reading through the comments it seems at least that was the plan.
>>> If that is a leftout from the recent LP migration we could try to fix it
>>> otherwise it'd be nice to build and provide such CJK LPs.
>>> Can anyone help clarify ?
>>>  : https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs