[ https://issues.apache.org/jira/browse/LUCENE-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir resolved LUCENE-3696. --------------------------------- Resolution: Fixed Fix Version/s: 4.0 3.6 > building a kuromoji dictionary is very slow and eventually fails if you use > java 5 > ---------------------------------------------------------------------------------- > > Key: LUCENE-3696 > URL: https://issues.apache.org/jira/browse/LUCENE-3696 > Project: Lucene - Java > Issue Type: Bug > Affects Versions: 3.6 > Reporter: Robert Muir > Fix For: 3.6, 4.0 > > Attachments: LUCENE-3696.patch, LUCENE-3696.patch > > > Note: This only affects you if you use java 5 on 3.x, and it only affects you > if you want to download/rebuild the dictionary. > the analyzer itself works fine on 3.x with java 5. > With java 6, building a kuromoji dictionary is quite fast: > {noformat} > [java] building tokeninfo dict... > [java] parse... > [java] sort... > [java] encode... > [java] 53645 nodes, 253185 arcs, 1954817 bytes... done > [java] done > [java] building unknown word dict...done > [java] building connection costs...done > BUILD SUCCESSFUL > Total time: 6 seconds > {noformat} > However, if you use java 5, it takes forever and eventually runs out of > memory in the CSV parsing phase. > So we might need to optimize the CSV parser (like precompile its patterns). > {noformat} > [java] building tokeninfo dict... > [java] parse... > [java] Exception in thread "main" java.lang.OutOfMemoryError: Java heap > space > [java] at java.util.regex.Pattern.newSlice(Pattern.java:2909) > [java] at java.util.regex.Pattern.atom(Pattern.java:1898) > [java] at java.util.regex.Pattern.sequence(Pattern.java:1794) > [java] at java.util.regex.Pattern.expr(Pattern.java:1687) > [java] at java.util.regex.Pattern.compile(Pattern.java:1397) > [java] at java.util.regex.Pattern.<init>(Pattern.java:1124) > [java] at java.util.regex.Pattern.compile(Pattern.java:817) > [java] at java.lang.String.replaceAll(String.java:2000) > [java] at > org.apache.lucene.analysis.kuromoji.util.CSVUtil.unQuoteUnEscape(CSVUtil.java:84) > [java] at > org.apache.lucene.analysis.kuromoji.util.CSVUtil.parse(CSVUtil.java:55) > [java] at > org.apache.lucene.analysis.kuromoji.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:96) > [java] at > org.apache.lucene.analysis.kuromoji.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76) > [java] at > org.apache.lucene.analysis.kuromoji.util.DictionaryBuilder.build(DictionaryBuilder.java:37) > [java] at > org.apache.lucene.analysis.kuromoji.util.DictionaryBuilder.main(DictionaryBuilder.java:82) > BUILD FAILED > /home/rmuir/workspace/lucene-branch3x2/lucene/contrib/analyzers/kuromoji/build.xml:75: > Java returned: 1 > Total time: 2 minutes 4 seconds > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org