Limit Nutch Crawl to Seed URLs

2009-03-13 Thread MyD
Hi @ all, is it possible to limit nutchs crawling process to the seed URLs? E.g. I have 1000 seed URLs and I want to crawl just this domains. Thanks in advance. Regards, MyD -- View this message in context: http://www.nabble.com/Limit-Nutch-Crawl-to-Seed-URLs-tp22493314p22493314.html Sent

Re: Limit Nutch Crawl to Seed URLs

2009-03-13 Thread Stevan Kovacevic
Hi, you can avoid going to other domains by editing the urlfilter file, but this is not too practical when you have a lot of seed urls, which you do. In nutch-default.xml file you have a property db.ignore.external.links which is by default set to false. Set this to true and you will only crawl

Re: Limit Nutch Crawl to Seed URLs

2009-03-13 Thread Jack Yu
good point,I use long urlfilter only long time ago On Fri, Mar 13, 2009 at 9:19 PM, Stevan Kovacevic skovacevi...@gmail.comwrote: Hi, you can avoid going to other domains by editing the urlfilter file, but this is not too practical when you have a lot of seed urls, which you do. In

error after adding indexes manually

2009-03-13 Thread alxsss
Hello, I used? lukeall-0.9.1.jar to manually add a new? record? to index? produced? by? nutch-0.9.? I? added only url and title fields, since I was not sure what to put on the other fields. Now for? search of any word I get this error HTTP Status 500 - type Exception report message

Re: error after adding indexes manually

2009-03-13 Thread Lyndon Maydwell
What versions of Lucene are Nutch and Luke using? When you play with the index you should ensure that the version of Lucene being used is the same as what Nutch is using. On Sat, Mar 14, 2009 at 8:41 AM, alx...@aim.com wrote: Hello, I used? lukeall-0.9.1.jar to manually add a new? record?

The Future of Nutch

2009-03-13 Thread Dennis Kubes
With the release of Nutch 1.0 I think it is a good time to begin a discussion about the future of Nutch. Here are some things to consider and would love to here everyones views on this Nutch's original intention was as a large-scale www search engine. That is a very specific goal. Only a

Re: Limit Nutch Crawl to Seed URLs

2009-03-13 Thread Dennis Kubes
There is a domain-urlfilter that should help do what you are looking for. Dennis MyD wrote: Hi @ all, is it possible to limit nutchs crawling process to the seed URLs? E.g. I have 1000 seed URLs and I want to crawl just this domains. Thanks in advance. Regards, MyD

Re: error after adding indexes manually

2009-03-13 Thread alxsss
Hi, I use nutch-0.9.? I downloaded lukeall-0.9.1.jar file from http://www.getopt.org/luke/ and doube click it in windows. That website says? It uses the official Lucene 2.4.0 release JARs Thanks. Alex. -Original Message- From: Lyndon Maydwell maydw...@gmail.com To:

Index Disaster Recovery

2009-03-13 Thread Eric J. Christeson
What do people do when 'something goes wrong' with a crawl? First some background; We are a small-ish university using nutch to crawl 60,000 - 100,000 pages across 50 or so domains. This probably puts us in a different category than most nutch users. Our crawl cycle consists of a script to crawl

Re: The Future of Nutch

2009-03-13 Thread John Martyniak
Dennis, I am with you, I am building a large scale www search engine. But might also build a vertical search as well. Aren't the requirements the same for building a large scale www search, against building a vertical www search, the only thing that seems to change is the scope. I like

Re: error after adding indexes manually

2009-03-13 Thread alxsss
btw, which version of lucene is in nutch-0.9? Thanks. Alex. -Original Message- From: Lyndon Maydwell maydw...@gmail.com To: nutch-user@lucene.apache.org Sent: Fri, 13 Mar 2009 5:14 pm Subject: Re: error after adding indexes manually What versions of Lucene are Nutch

Re: error after adding indexes manually

2009-03-13 Thread Lyndon Maydwell
I just checked. (I usually just have the trunk source). Nutch 0.9 used lucene-core-2.1.0.jar. When I've encountered this problem I've solved it in two ways. Either upgrading the version of lucene in nutch (will probably require rebuilding the indexes). Or replacing the version in Luke, etc.

Re: error after adding indexes manually

2009-03-13 Thread alxsss
I opened lukeall-0.9.1.jar file and replace org/apache/lucene with org/apache/lucene of? lucene-core-2.1.0.jar file and build a new likeall-0.9.2. jar. Now, when I double click it it says Failed to load Main-Class manifest attribute from lukeall-0.9.2.jar Thanks. Alex. -Original

Re: error after adding indexes manually

2009-03-13 Thread Lyndon Maydwell
Well if your index has already been converted to 2.4 format you will either have to rebuild your index, or use 2.4 in Nutch. That's what I've found anyway. Correct me if I'm on the wrong track guys :) On Sat, Mar 14, 2009 at 1:19 PM, alx...@aim.com wrote: ? I get the same error message when I

Re: Limit Nutch Crawl to Seed URLs

2009-03-13 Thread MyD
Where can I find the domain urlfilter? I'm using the branch 0.9... Cheers, Markus Dennis Kubes-2 wrote: There is a domain-urlfilter that should help do what you are looking for. Dennis MyD wrote: Hi @ all, is it possible to limit nutchs crawling process to the seed URLs? E.g. I