[
https://issues.apache.org/jira/browse/NUTCH-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961376#comment-13961376
]
Sebastian Nagel commented on NUTCH-1615:
----------------------------------------
No question, reading an entire [Wikimedia
dump|http://dumps.wikimedia.org/backup-index.html] into web table would provide
a nice playground to test content extraction, link rank algorithms, etc.
Crawling Wikipedia is no alternative because of its size and because you are
encouraged [not to
do|http://en.wikipedia.org/wiki/Wikipedia:Download#Please_do_not_use_a_web_crawler].
There are already tools to process Wikipedia dumps via Hadoop (e.g., search
for "[hadoop process wikipedia
dump|https://www.google.com/search?q=hadoop%20process%20wikipedia%20dump]").
But wiki markup is quite complex, and to convert it properly to HTML there is
hardly any other choice than to set up your own Mediawiki server and import
Wikipedia dumps. The situation for other content management systems isn't
better: usually dumps can be generated, but the format isn't standardized.
Consequently, there will be probably no way to implement a generalized tool
which allows to "fetch from website dumps".
> Implementing A Feature for Fetching From Websites Dump
> ------------------------------------------------------
>
> Key: NUTCH-1615
> URL: https://issues.apache.org/jira/browse/NUTCH-1615
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher
> Affects Versions: 2.1
> Reporter: cihad güzel
> Priority: Minor
>
> Some web sites provide dump (as like http://dumps.wikimedia.org/enwiki/ for
> wikipedia.org). We should fetch from dumps for such kind of web sites. Thus
> fetching will be quicker.
--
This message was sent by Atlassian JIRA
(v6.2#6252)