Re: ScoringFilter always increasing a fetched site's score

2011-02-02 Thread Tim Pease
On Feb 2, 2011, at 5:18 AM, David Saile wrote: Hi all, I have a question concerning updating a site's score in Nutch 1.2. In org.apache.nutch.crawlCrawlDbReducer's reduce-method I found a call to scfilters.updateDbScore((Text)key, oldSet ? old : null, result, linkList);

Re: Solr 3.1

2011-05-05 Thread Tim Pease
On May 5, 2011, at 11:50 AM, Julien Nioche wrote: Tim, We're about to commit the upgrade of SOLR in the trunk and this should be released as 1.3 shortly. See https://issues.apache.org/jira/browse/NUTCH-983 Thanks for the update. I'll wait for the next release then and hold off on the

Solr 3.3

2011-07-06 Thread Tim Pease
Seems like I just finish upgrading to Solr 3.2 and a new version is released! Anyway, is the Solrj client shipping with Nutch 1.3 compatible with the new Solr 3.3 release? Is there any reason from the Nutch end to hold off on upgrading Solr? Apologies for the fairly simple question, but if

Re: optimizing crawl

2011-07-06 Thread Tim Pease
On Jul 6, 2011, at 10:59 AM, Cam Bazz wrote: Hello, I am crawling multiple sites, in range of hundreds, with 256 concurrent threads, and 4 conns per site at a time. It seems that if a site is having a bad day, all the threads slow down, and this site basically clogs all the threads.

Re: Solr 3.3

2011-07-06 Thread Tim Pease
On Jul 6, 2011, at 11:09 AM, Markus Jelsma wrote: Javabin version hasn't changed. You can use it. Thanks for the quick answer. Solr 3.3 is working flawlessly with our Nutch 1.3 install. Blessings, TwP On Wednesday 06 July 2011 18:59:55 Tim Pease wrote: Seems like I just finish

meta robots directive

2011-07-11 Thread Tim Pease
Currently Nutch supports the meta name=robots content=noindex directive in the head of individual pages. I would like to extend this feature to allow the http.agent.name as a valid name in addition to the robots directive. For example, in your nutch-site.xml file if you have the property

running tests from the command line

2011-07-12 Thread Tim Pease
At the root of the Nutch 1.3 project, what is the magic ant incantation to run only the tests for the plugin I'm currently hacking away on? I'm looking for the command line syntax. Blessings, TwP

Re: Giving priority to seeds

2011-10-04 Thread Tim Pease
On Oct 4, 2011, at 4:03 AM, Danicela nutch wrote: Hi, I want to make a ScoringFilter plugin which will give priority to seeds file. I mean, I have a crawdb and a seeds file with links, I set a topN=5 to test, and I want that my seeds links are fetched first, before what I have in the

detailed test output?

2011-11-28 Thread Tim Pease
I've made some modifications to Nutch to suite some requirements at work. However, my changes have caused one of the JUnit tests to fail. The output from running `ant test` is none too helpful. All it tells me is BUILD FAILED - good luck scrolling through a thousand lines of output to find that

Re: detailed test output?

2011-11-28 Thread Tim Pease
On Nov 28, 2011, at 10:38 PM, Tim Pease wrote: I've made some modifications to Nutch to suite some requirements at work. However, my changes have caused one of the JUnit tests to fail. The output from running `ant test` is none too helpful. All it tells me is BUILD FAILED - good luck

Download older versions of Nutch?

2011-11-28 Thread Tim Pease
I've noticed that the mirrors only contain downloadable assets for Nutch 1.4. Is there a location where older versions of Nutch can be downloaded? Blessings, TwP

new nutch tool

2011-12-05 Thread Tim Pease
I am in the process of writing a new Nutch tool that will index documents into the ElasticSearch [http://www.elasticsearch.org/] search engine. Can and should this tool be created as a plugin? Are there any examples of tools being created as plugins? More generally, how should a new tool be

Re: Trouble running solrindexer from Nutch 1.4

2011-12-07 Thread Tim Pease
On Dec 7, 2011, at 3:17 PM, Chip Calhoun wrote: This is probably just down to my not waiting for a 1.4 tutorial, but here goes. I've always used the following two commands to run my crawl and then index to Solr: # bin/nutch crawl urls -dir crawl -depth 1 -topN 50 # bin/nutch solrindex