One thing that can be done is to move the n-gram language detection
calls to the HTMLLangaugeParser (a HtmlParseFilter plugin).  After
getting the results of the language detection, set a parse metadata
field.  Modify LanguageIdentifier IndexFilter plugin to look for this
metadata field instead of running n-gram language detection.

Because fetching is multi-threaded, this speeds up language detection
dramatically.  Using this method does place more load on the machine
during fetching, but it doesn't slow it down too badly.

On 4/16/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> Hi Sami, Hi all,
> 
> i like the language identifier very much, but we notice that it slow
> down the indexing process 3 times.
> In case people index very large segments this is may a problem.
> 
> I have a set of questions:
> + Can you tell me which corpus you used to generate the ngram files?
> + Are there any plans to improve speed by fine tuning the
> implementation?
> + Why use vectors instead of array lists?
> + Do you think it make sense to use     thresholds? For example not
> generate a score for the complete profile but use only the top 10
> ngrams and check if there is a clear best profile using a threshold. In
> case the result isn't clear use 10 more ngrams. etc.
> 
> Thanks for any comments,
> Stefan
> 
> -------------------------------------------------------------
> Hommingberger Gepardenforelle
> http://wiki.media-style.com/display/~hommingbergergepardenforelle
> 
>

Reply via email to