Lewis John McGibbney created NUTCH-2020:
-------------------------------------------

             Summary: Estalbish Butch - the Continuous Benchmarking Evaluation 
for Nutch
                 Key: NUTCH-2020
                 URL: https://issues.apache.org/jira/browse/NUTCH-2020
             Project: Nutch
          Issue Type: Bug
          Components: deployment
    Affects Versions: 2.4, 1.11
            Reporter: Lewis John McGibbney
            Assignee: Lewis John McGibbney
             Fix For: 2.4, 1.11


I would like to initiate something I've provisionally called BUTCH wit the aim 
of providing a continuous benchmarking evaluation for Nutch. 
I wrote a utility script called 
[nipt](https://github.com/lewismc/nipt/blob/master/bootstrap.sh) which 
essentially pulls the top 1M URL's from Alexa, does some simple reformatting 
using sed and provides us with a flat file containing the top 1M URLs.
Loads of these are obviously porn (and god knows whatever else) related so I 
would not advise injecting this garbage into any crawldb that you own or 
administer.
I want to augment the [Benchmark 
tool](https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/Benchmark.java)
 to imitate injecting the script and fetching the URLs. Essentially this could 
run continuously with us sending results to the dev@ list or making them 
available via some GUI.
The first step is for me to code this up. The second stage is for me to get 
Apache Infra to provide us with some nice machines (courtesy of Rackspace) 
which can host this for us. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to