[jira] Commented: (NUTCH-894) Move statistical language identification from indexing to parsing step

Julien Nioche (JIRA) Fri, 01 Oct 2010 08:34:02 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916915#action_12916915
 ]


Julien Nioche commented on NUTCH-894:
-------------------------------------

Nice one, that's exactly what I had in mind.
+1 for commiting

> Move statistical language identification from indexing to parsing step
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-894
>                 URL: https://issues.apache.org/jira/browse/NUTCH-894
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.0
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 2.0
>
>         Attachments: NUTCH-894.patch
>
>
> The statistical identification of language is currently done part in the 
> indexing step, whereas the detection based on HTTP header and HTML code is 
> done during the parsing.
> We could keep the same logic i.e. do the statistical detection only if 
> nothing has been found with the previous methods but as part of the parsing. 
> This would be useful for ParseFilters which need the language information or 
> to use with ScoringFilters e.g. to focus the crawl on a set of languages.
> Since the statistical models have been ported to Tika we should probably rely 
> on them instead of maintaining our own.
> Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-894) Move statistical language identification from indexing to parsing step

Reply via email to