[ http://issues.apache.org/jira/browse/NUTCH-357?page=all ]

Stefan Groschupf updated NUTCH-357:
-----------------------------------

    Attachment: protocol-simulation-pluginV1.patch

A very first preview of a plugin that helps to simulate crawls. This protocol 
plugin can be used to replace the http protocol plugin and return defined 
content during a fetch. To simulate custom scenarios a interface names 
Simulator can be implemented with just one method. 
The plugin comes with a very simple basic Simulator implementation, however 
this already allows to simulate the by today known nutch scoring problems, like 
pages pointing to itself or link chains. 
For more details see the java doc, however I plan to improve the java doc with 
a native speaker. 

Feedback is welcome. 

> crawling simulation
> -------------------
>
>                 Key: NUTCH-357
>                 URL: http://issues.apache.org/jira/browse/NUTCH-357
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Stefan Groschupf
>             Fix For: 0.9.0
>
>         Attachments: protocol-simulation-pluginV1.patch
>
>
> We recently discovered  some serious issue related to crawling and scoring. 
> Reproducing these problems is a kind of difficult, since first of all it is 
> not polite to re-crawl a set of pages again and again, secondly it is 
> difficult to catch the page that cause a problem. 
> Therefore it would be very useful to have a testbed to simulate crawls where  
> we can control the response of  "web servers". 
> For the very beginning simulate very basic situation like a page points to it 
> self,  link chains or internal links would already be very usefully. 
> However later on simulate crawls against existing data collections like TREC 
> or a webgraph would be much more interesting, for instance to caculate the 
> quality of the nutch OPIC implementation against page rank scores of the 
> webgraph or evaluaing crawling strategies.    

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to