[
https://issues.apache.org/jira/browse/NUTCH-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916915#action_12916915
]
Julien Nioche commented on NUTCH-894:
-------------------------------------
Nice one, that's exactly what I had in mind.
+1 for commiting
> Move statistical language identification from indexing to parsing step
> ----------------------------------------------------------------------
>
> Key: NUTCH-894
> URL: https://issues.apache.org/jira/browse/NUTCH-894
> Project: Nutch
> Issue Type: Improvement
> Components: parser
> Affects Versions: 2.0
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Fix For: 2.0
>
> Attachments: NUTCH-894.patch
>
>
> The statistical identification of language is currently done part in the
> indexing step, whereas the detection based on HTTP header and HTML code is
> done during the parsing.
> We could keep the same logic i.e. do the statistical detection only if
> nothing has been found with the previous methods but as part of the parsing.
> This would be useful for ParseFilters which need the language information or
> to use with ScoringFilters e.g. to focus the crawl on a set of languages.
> Since the statistical models have been ported to Tika we should probably rely
> on them instead of maintaining our own.
> Any thoughts on this?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.