Hi @ all,
is it possible to limit nutchs crawling process to the seed URLs? E.g. I
have 1000 seed URLs and I want to crawl just this domains. Thanks in
advance.
Regards,
MyD
--
View this message in context:
http://www.nabble.com/Limit-Nutch-Crawl-to-Seed-URLs-tp22493314p22493314.html
Sent
Hi,
you can avoid going to other domains by editing the urlfilter file,
but this is not too practical when you have a lot of seed urls, which
you do. In nutch-default.xml file you have a property
db.ignore.external.links which is by default set to false. Set this to
true and you will only crawl
good point,I use long urlfilter only long time ago
On Fri, Mar 13, 2009 at 9:19 PM, Stevan Kovacevic skovacevi...@gmail.comwrote:
Hi,
you can avoid going to other domains by editing the urlfilter file,
but this is not too practical when you have a lot of seed urls, which
you do. In
Hello,
I used? lukeall-0.9.1.jar to manually add a new? record? to index? produced?
by? nutch-0.9.? I? added only url and title fields, since I was not sure what
to put on the other fields. Now for? search of any word I get this error
HTTP Status 500 -
type Exception report
message
What versions of Lucene are Nutch and Luke using? When you play with
the index you should ensure that the version of Lucene being used is
the same as what Nutch is using.
On Sat, Mar 14, 2009 at 8:41 AM, alx...@aim.com wrote:
Hello,
I used? lukeall-0.9.1.jar to manually add a new? record?
With the release of Nutch 1.0 I think it is a good time to begin a
discussion about the future of Nutch. Here are some things to consider
and would love to here everyones views on this
Nutch's original intention was as a large-scale www search engine. That
is a very specific goal. Only a
There is a domain-urlfilter that should help do what you are looking for.
Dennis
MyD wrote:
Hi @ all,
is it possible to limit nutchs crawling process to the seed URLs? E.g. I
have 1000 seed URLs and I want to crawl just this domains. Thanks in
advance.
Regards,
MyD
Hi,
I use nutch-0.9.? I downloaded lukeall-0.9.1.jar file from
http://www.getopt.org/luke/ and doube click it in windows. That website says?
It uses the official Lucene 2.4.0 release JARs
Thanks.
Alex.
-Original Message-
From: Lyndon Maydwell maydw...@gmail.com
To:
What do people do when 'something goes wrong' with a crawl?
First some background; We are a small-ish university using nutch to
crawl 60,000 - 100,000 pages across 50 or so domains. This probably
puts us in a different category than most nutch users. Our crawl cycle
consists of a script to crawl
Dennis,
I am with you, I am building a large scale www search engine. But
might also build a vertical search as well. Aren't the requirements
the same for building a large scale www search, against building a
vertical www search, the only thing that seems to change is the scope.
I like
btw, which version of lucene is in nutch-0.9?
Thanks.
Alex.
-Original Message-
From: Lyndon Maydwell maydw...@gmail.com
To: nutch-user@lucene.apache.org
Sent: Fri, 13 Mar 2009 5:14 pm
Subject: Re: error after adding indexes manually
What versions of Lucene are Nutch
I just checked. (I usually just have the trunk source).
Nutch 0.9 used lucene-core-2.1.0.jar.
When I've encountered this problem I've solved it in two ways. Either
upgrading the version of lucene in nutch (will probably require
rebuilding the indexes). Or replacing the version in Luke, etc.
I opened lukeall-0.9.1.jar file and replace org/apache/lucene with
org/apache/lucene of? lucene-core-2.1.0.jar file and build a new likeall-0.9.2.
jar. Now, when I double click it it says Failed to load Main-Class manifest
attribute from lukeall-0.9.2.jar
Thanks.
Alex.
-Original
Well if your index has already been converted to 2.4 format you will
either have to rebuild your index, or use 2.4 in Nutch.
That's what I've found anyway. Correct me if I'm on the wrong track guys :)
On Sat, Mar 14, 2009 at 1:19 PM, alx...@aim.com wrote:
? I get the same error message when I
Where can I find the domain urlfilter? I'm using the branch 0.9...
Cheers,
Markus
Dennis Kubes-2 wrote:
There is a domain-urlfilter that should help do what you are looking for.
Dennis
MyD wrote:
Hi @ all,
is it possible to limit nutchs crawling process to the seed URLs? E.g. I
15 matches
Mail list logo