[
https://issues.apache.org/jira/browse/NUTCH-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney updated NUTCH-2020:
----------------------------------------
Issue Type: New Feature (was: Bug)
> Estalbish Butch - the Continuous Benchmarking Evaluation for Nutch
> ------------------------------------------------------------------
>
> Key: NUTCH-2020
> URL: https://issues.apache.org/jira/browse/NUTCH-2020
> Project: Nutch
> Issue Type: New Feature
> Components: deployment
> Affects Versions: 2.4, 1.11
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Fix For: 2.4, 1.11
>
>
> I would like to initiate something I've provisionally called BUTCH wit the
> aim of providing a continuous benchmarking evaluation for Nutch.
> I wrote a utility script called
> [nipt](https://github.com/lewismc/nipt/blob/master/bootstrap.sh) which
> essentially pulls the top 1M URL's from Alexa, does some simple reformatting
> using sed and provides us with a flat file containing the top 1M URLs.
> Loads of these are obviously porn (and god knows whatever else) related so I
> would not advise injecting this garbage into any crawldb that you own or
> administer.
> I want to augment the [Benchmark
> tool](https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/Benchmark.java)
> to imitate injecting the script and fetching the URLs. Essentially this
> could run continuously with us sending results to the dev@ list or making
> them available via some GUI.
> The first step is for me to code this up. The second stage is for me to get
> Apache Infra to provide us with some nice machines (courtesy of Rackspace)
> which can host this for us.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)