Move statistical language identification from indexing to parsing step
----------------------------------------------------------------------

                 Key: NUTCH-894
                 URL: https://issues.apache.org/jira/browse/NUTCH-894
             Project: Nutch
          Issue Type: Improvement
          Components: parser
    Affects Versions: 2.0
            Reporter: Julien Nioche
            Assignee: Julien Nioche
             Fix For: 2.0


The statistical identification of language is currently done part in the 
indexing step, whereas the detection based on HTTP header and HTML code is done 
during the parsing.
We could keep the same logic i.e. do the statistical detection only if nothing 
has been found with the previous methods but as part of the parsing. This would 
be useful for ParseFilters which need the language information or to use with 
ScoringFilters e.g. to focus the crawl on a set of languages.

Since the statistical models have been ported to Tika we should probably rely 
on them instead of maintaining our own.

Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to