[
https://issues.apache.org/jira/browse/NUTCH-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doğacan Güney closed NUTCH-894.
-------------------------------
Assignee: Doğacan Güney (was: Julien Nioche)
Resolution: Fixed
Committed as of rev. 1003608.
> Move statistical language identification from indexing to parsing step
> ----------------------------------------------------------------------
>
> Key: NUTCH-894
> URL: https://issues.apache.org/jira/browse/NUTCH-894
> Project: Nutch
> Issue Type: Improvement
> Components: parser
> Affects Versions: 2.0
> Reporter: Julien Nioche
> Assignee: Doğacan Güney
> Fix For: 2.0
>
> Attachments: NUTCH-894.patch
>
>
> The statistical identification of language is currently done part in the
> indexing step, whereas the detection based on HTTP header and HTML code is
> done during the parsing.
> We could keep the same logic i.e. do the statistical detection only if
> nothing has been found with the previous methods but as part of the parsing.
> This would be useful for ParseFilters which need the language information or
> to use with ScoringFilters e.g. to focus the crawl on a set of languages.
> Since the statistical models have been ported to Tika we should probably rely
> on them instead of maintaining our own.
> Any thoughts on this?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.