Move statistical language identification from indexing to parsing step
----------------------------------------------------------------------
Key: NUTCH-894
URL: https://issues.apache.org/jira/browse/NUTCH-894
Project: Nutch
Issue Type: Improvement
Components: parser
Affects Versions: 2.0
Reporter: Julien Nioche
Assignee: Julien Nioche
Fix For: 2.0
The statistical identification of language is currently done part in the
indexing step, whereas the detection based on HTTP header and HTML code is done
during the parsing.
We could keep the same logic i.e. do the statistical detection only if nothing
has been found with the previous methods but as part of the parsing. This would
be useful for ParseFilters which need the language information or to use with
ScoringFilters e.g. to focus the crawl on a set of languages.
Since the statistical models have been ported to Tika we should probably rely
on them instead of maintaining our own.
Any thoughts on this?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.