[ 
https://issues.apache.org/jira/browse/NUTCH-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961376#comment-13961376
 ] 

Sebastian Nagel commented on NUTCH-1615:
----------------------------------------

No question, reading an entire [Wikimedia 
dump|http://dumps.wikimedia.org/backup-index.html] into web table would provide 
a nice playground to test content extraction, link rank algorithms, etc. 
Crawling Wikipedia is no alternative because of its size and because you are 
encouraged [not to 
do|http://en.wikipedia.org/wiki/Wikipedia:Download#Please_do_not_use_a_web_crawler].
 There are already tools to process Wikipedia dumps via Hadoop (e.g., search 
for "[hadoop process wikipedia 
dump|https://www.google.com/search?q=hadoop%20process%20wikipedia%20dump]";). 
But wiki markup is quite complex, and to convert it properly to HTML there is 
hardly any other choice than to set up your own Mediawiki server and import 
Wikipedia dumps. The situation for other content management systems isn't 
better: usually dumps can be generated, but the format isn't standardized. 
Consequently, there will be probably no way to implement a generalized tool 
which allows to "fetch from website dumps".

> Implementing A Feature for Fetching From Websites Dump
> ------------------------------------------------------
>
>                 Key: NUTCH-1615
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1615
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 2.1
>            Reporter: cihad güzel
>            Priority: Minor
>
> Some web sites provide dump (as like http://dumps.wikimedia.org/enwiki/ for 
> wikipedia.org). We should fetch from dumps for such kind of web sites. Thus 
> fetching  will be quicker.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to