2009/11/16 Mark Kerzner :
> Hi,
>
> I want to politely crawl a site with 1-2 million pages. With the speed of
> about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop,
> and can I coordinate the crawlers so as not to cause a DOS attack?
Nutch basically uses hadoop - or an older
e your emailer hasnt
tied the two together.
Alex
2009/8/11 Alex McLintock :
> Further information to this
>
> I'm running on a single machine in fake clustering mode.
>
> A tmp directory gets created, with nothing but another empty directory
> inside of it.
>
>
Try looking at how the indexers work. They *do* iterate through all
the documents in the crawl (or rather one segment at a time). However
they do it in a Hadoop way...
2009/8/11 Paul Tomblin :
> I want to iterate through all the documents that are in the crawl,
> programattically. The only code
dding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
Is Solr output a plugin, and is it not set up above?
2009/8/11 Alex McLintock :
> I'm trying to send my Nutch crawl to SolR. I've "generated, fetched,
> updated", several times. I've done an invertlinks.
> But
I'm trying to send my Nutch crawl to SolR. I've "generated, fetched,
updated", several times. I've done an invertlinks.
But when I try to do the solrindex it just sits there for ages and
doesnt seem to stress the solr server at all.
I'm using Nutch 1.0, Sun Java 1.6, Ubuntu Linux 9.04.
/local/app
I've been using a perl based focussed web crawler with a MySQL back
end, but am now looking at Nutch instead. It seems like a few other
people have done something similar. I'm wondering whether we could
pool our resources and work together on this?
It seems to me that we would be building a few ex
2009/7/25 Andrzej Bialecki :
> I solved this once by implementing a check in Fetcher.run() for a marker
> file on HDFS. If the presence of this file was detected, the FetcherThreads
> would be stopped one by one (again, by setting a flag in their run() methods
> to terminate the loop).
>
Personal
I am not sure if it solves your problem but you might do something
like disconnect your machines from the internet - preferably by making
your dns server return "dont know that domain"
This will relatively quickly cause the remaining part of the fetch to fail.
Just a suggestion...
Alex
2009/7/2
Why does your example say both monster.crawl and test.crawl ?
Are you perhaps entering the command wrong or is this just an error in
the email?
Alex
2009/7/18 Beats :
>
> hi,
>
> i m getting this weird error ( at least for me):
>
> i m trying to crawl a some web pages..
> with normal crawl comm
2009/7/14 :
> BUT, I think that I may have just gotten an idea about why this was not
> working.
>
> It looks like when I run the nutch crawl, the "index" and "indexes"
> directories are not being created until the crawl is completely done.
>
> [Is this normal nutch behavior???]
I believe that
2009/7/14 Jake Jacobson :
> I did attach it.
>
I am afraid that I cant see anything either. Can you perhaps upload it
somewhere and link to it?
I'd like to say thank you for your effort. We could do with more
tutorials which look at it in different ways.
Alex
> but I get a number of messages in crawl.log, like:
>
> Error parsing: http://lucene.apache.org/skin/getMenu.js:
> org.apache.nutch.parse.ParseException: parser not found for
> contentType=application/javascript
> url=http://lucene.apache.org/skin/getMenu.js
> at org.apache.nutch.parse.P
Hello Zaihan,
So you have your servlet container running providing a web application
- but it doesnt know where your crawled data is
Find the nutch-site file something like
/var/lib/tomcat6/webapps/ROOT/WEB-INF/classes/nutch-site.xml
And make sure it contains something like
se
I've looked at a lot of tutorials for linking Nutch and Solr but it
seems that this has been improved a lot in version 1.0.
Can anyone point me at documentation which takes this into account?
Cheers
Alex
Can anyone point me at some Nutch Plugin documentation which goes into
more detail than
http://wiki.apache.org/nutch/WritingPluginExample-0.9
I want to understand all the different places where you might put a
plugin and why/how you might do so.
Basically I am trying to extra some information ou
OK, here is how i fixed this in my ubuntu 9.04 setup using the normal
tomcatt6 ubuntu package.
I added this permission line to the /etc/tomcat6/policy.d/04webapps.policy
grant {
// Attempt to get Nutch working
permission java.security.AllPermission;
Now I can get the Nutch web app worki
2009/7/4 xiao yang :
> I have downloaded Tomcat6 from the official site and reinstalled it
> manually. It works!
> Maybe there's something wrong with Tomcat6 in the ubuntu mirror site.
Oops - I just posted exactly the same problem with the same stack trace,
same ubuntu version, same tomcat versio
I'm trying out nutch-1.0 release on Ubuntu 9.04. I'm trying to follow
the 0.8 version tutorial found at
http://lucene.apache.org/nutch/tutorial8.html
Now I seem to be able to do the web crawling and create indexes. I can
do searches from the command line using
eg
/local/apps/software/nutch$ bin
18 matches
Mail list logo