Kohei Nozaki created ROL-2090:
---------------------------------

             Summary: Lucene integration doesn't work well for entries that 
written in some languages
                 Key: ROL-2090
                 URL: https://issues.apache.org/jira/browse/ROL-2090
             Project: Apache Roller
          Issue Type: Improvement
          Components: Data Model & JPA Backend
    Affects Versions: 5.1.2
            Reporter: Kohei Nozaki
            Assignee: Roller Unassigned
            Priority: Minor


Reported in 
http://benzaiten.dyndns.org/roller/ugya/entry/roller_500_to_510_migration 
(Japanese). Summary in English:

h4. Japanese keywords doesn't hit against the latter part of long entry

It's caused by maximum token limit in the following code. The author said that 
typical Japanese text is not splitted by white spaces so that's not work well 
with it.

{noformat}
// Limit to 1000 tokens.
LimitTokenCountAnalyzer analyzer = new LimitTokenCountAnalyzer(
        IndexManagerImpl.getAnalyzer(), 1000);
{noformat}

h4. StandardAnalyzer doesn't work well with Japanese text

Roller uses {{StandardAnalyzer}} but there are some other language specific 
implementations for it such as {{CJKAnalyzer}} or {{JapaneseAnalyzer}}. The 
author said that these implementations improve accuracy for such languages. I 
know these implementations are language specific so we can't simply replace it 
to them but might want to switch it in flexible manner, Such as using language 
configuration in each blogs.

{noformat}
public static final Analyzer getAnalyzer() {
    return new StandardAnalyzer(FieldConstants.LUCENE_VERSION);
}
{noformat}

I'm still not sure what would be proper solutions but I believe we have room 
for some improvement here. Any advices would be appreciated.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to