[
https://issues.apache.org/jira/browse/NUTCH-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
İlhami KALKAN updated NUTCH-1663:
---------------------------------
Attachment: README.txt
language-filter.patch
I added lanuage-filter plugin for filter pages by languages while crawling. For
use, language-identifier plugin must run before this plugin.
language-identifier plugin looks metadata for language of url and remove or
accept its outlinks according to "language.filter.type" in nutch.site.xml. if
this parameter set to accept, this plugin allow only
"language.filter.languages" entries which must be ISO-639 language codes and
remove outlinks to pages in other languages. If set to filter, remove outlinks
of pages which page lang equals "language.filter.languages" entries.
> Crawl page with specified language
> ----------------------------------
>
> Key: NUTCH-1663
> URL: https://issues.apache.org/jira/browse/NUTCH-1663
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 2.2.1
> Reporter: İlhami KALKAN
> Priority: Minor
> Fix For: 2.3
>
> Attachments: README.txt, language-filter.patch
>
>
> User can crawl pages with specified language. For example, we want to crawl
> pages which language is Turkish.
--
This message was sent by Atlassian JIRA
(v6.1#6144)