[
https://issues.apache.org/jira/browse/NUTCH-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813805#comment-13813805
]
İlhami KALKAN edited comment on NUTCH-1663 at 11/5/13 1:35 PM:
---------------------------------------------------------------
I added lanuage-filter plugin for filter pages by languages while crawling. For
use, language-identifier plugin must run before this plugin. language-filter
plugin looks metadata for language of url and remove or accept its outlinks
according to "language.filter.type" in nutch.site.xml. if this parameter set to
accept, this plugin allow only "language.filter.languages" entries which must
be ISO-639 language codes and remove outlinks to pages in other languages. If
set to filter, remove outlinks of pages which page lang equals
"language.filter.languages" entries.
was (Author: ilhamikalkan):
I added lanuage-filter plugin for filter pages by languages while crawling. For
use, language-identifier plugin must run before this plugin.
language-identifier plugin looks metadata for language of url and remove or
accept its outlinks according to "language.filter.type" in nutch.site.xml. if
this parameter set to accept, this plugin allow only
"language.filter.languages" entries which must be ISO-639 language codes and
remove outlinks to pages in other languages. If set to filter, remove outlinks
of pages which page lang equals "language.filter.languages" entries.
> Crawl page with specified language
> ----------------------------------
>
> Key: NUTCH-1663
> URL: https://issues.apache.org/jira/browse/NUTCH-1663
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 2.2.1
> Reporter: İlhami KALKAN
> Priority: Minor
> Fix For: 2.3
>
> Attachments: README.txt, language-filter.patch
>
>
> User can crawl pages with specified language. For example, we want to crawl
> pages which language is Turkish.
--
This message was sent by Atlassian JIRA
(v6.1#6144)