One thing that can be done is to move the n-gram language detection calls to the HTMLLangaugeParser (a HtmlParseFilter plugin). After getting the results of the language detection, set a parse metadata field. Modify LanguageIdentifier IndexFilter plugin to look for this metadata field instead of running n-gram language detection.
Because fetching is multi-threaded, this speeds up language detection dramatically. Using this method does place more load on the machine during fetching, but it doesn't slow it down too badly. On 4/16/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote: > Hi Sami, Hi all, > > i like the language identifier very much, but we notice that it slow > down the indexing process 3 times. > In case people index very large segments this is may a problem. > > I have a set of questions: > + Can you tell me which corpus you used to generate the ngram files? > + Are there any plans to improve speed by fine tuning the > implementation? > + Why use vectors instead of array lists? > + Do you think it make sense to use thresholds? For example not > generate a score for the complete profile but use only the top 10 > ngrams and check if there is a clear best profile using a threshold. In > case the result isn't clear use 10 more ngrams. etc. > > Thanks for any comments, > Stefan > > ------------------------------------------------------------- > Hommingberger Gepardenforelle > http://wiki.media-style.com/display/~hommingbergergepardenforelle > >
