Re: Scalability for one site

2009-11-16 Thread Alex McLintock
2009/11/16 Mark Kerzner : > Hi, > > I want to politely crawl a site with 1-2 million pages. With the speed of > about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop, > and can I coordinate the crawlers so as not to cause a DOS attack? Nutch basically uses hadoop - or an older

Re: Nutch to SolR. First steps

2009-08-12 Thread Alex McLintock
e your emailer hasnt tied the two together. Alex 2009/8/11 Alex McLintock : > Further information to this > > I'm running on a single machine in fake clustering mode. > > A tmp directory gets created, with nothing but another empty directory > inside of it. > >

Re: How do I get all the documents in the index without searching?

2009-08-12 Thread Alex McLintock
Try looking at how the indexers work. They *do* iterate through all the documents in the crawl (or rather one segment at a time). However they do it in a Hadoop way... 2009/8/11 Paul Tomblin : > I want to iterate through all the documents that are in the crawl, > programattically.  The only code

Re: Nutch to SolR. First steps

2009-08-11 Thread Alex McLintock
dding org.apache.nutch.indexer.anchor.AnchorIndexingFilter Is Solr output a plugin, and is it not set up above? 2009/8/11 Alex McLintock : > I'm trying to send my Nutch crawl to SolR. I've "generated, fetched, > updated", several times. I've done an invertlinks. > But

Nutch to SolR. First steps

2009-08-11 Thread Alex McLintock
I'm trying to send my Nutch crawl to SolR. I've "generated, fetched, updated", several times. I've done an invertlinks. But when I try to do the solrindex it just sits there for ages and doesnt seem to stress the solr server at all. I'm using Nutch 1.0, Sun Java 1.6, Ubuntu Linux 9.04. /local/app

Focussed Web Crawling with Nutch

2009-07-31 Thread Alex McLintock
I've been using a perl based focussed web crawler with a MySQL back end, but am now looking at Nutch instead. It seems like a few other people have done something similar. I'm wondering whether we could pool our resources and work together on this? It seems to me that we would be building a few ex

Re: Gracefull stop in the middle of a fetch phase ?

2009-07-25 Thread Alex McLintock
2009/7/25 Andrzej Bialecki : > I solved this once by implementing a check in Fetcher.run() for a marker > file on HDFS. If the presence of this file was detected, the FetcherThreads > would be stopped one by one (again, by setting a flag in their run() methods > to terminate the loop). > Personal

Re: Gracefull stop in the middle of a fetch phase ?

2009-07-25 Thread Alex McLintock
I am not sure if it solves your problem but you might do something like disconnect your machines from the internet - preferably by making your dns server return "dont know that domain" This will relatively quickly cause the remaining part of the fetch to fail. Just a suggestion... Alex 2009/7/2

Re: error in using generate command

2009-07-23 Thread Alex McLintock
Why does your example say both monster.crawl and test.crawl ? Are you perhaps entering the command wrong or is this just an error in the email? Alex 2009/7/18 Beats : > > hi, > > i m getting this weird error ( at least for me): > > i m trying to crawl a some web pages.. > with normal crawl comm

Re: Tutorial followup - Nutch webapp not seeing stuff?

2009-07-15 Thread Alex McLintock
2009/7/14 : > BUT, I think that I may have just gotten an idea about why this was not > working. > > It looks like when I run the nutch crawl, the "index" and "indexes" > directories are not being created until the crawl is completely done. > > [Is this normal nutch behavior???] I believe that

Re: Nutch Tutorial 1.0 based off of the French Version

2009-07-14 Thread Alex McLintock
2009/7/14 Jake Jacobson : > I did attach it. > I am afraid that I cant see anything either. Can you perhaps upload it somewhere and link to it? I'd like to say thank you for your effort. We could do with more tutorials which look at it in different ways. Alex

Re: Just getting started w/tutorial- errors in crawl.log

2009-07-14 Thread Alex McLintock
> but I get a number of messages in crawl.log, like: > > Error parsing: http://lucene.apache.org/skin/getMenu.js: > org.apache.nutch.parse.ParseException: parser not found for > contentType=application/javascript > url=http://lucene.apache.org/skin/getMenu.js >        at org.apache.nutch.parse.P

Re: Integrating Nutch frontend with Backend.

2009-07-13 Thread Alex McLintock
Hello Zaihan, So you have your servlet container running providing a web application - but it doesnt know where your crawled data is Find the nutch-site file something like /var/lib/tomcat6/webapps/ROOT/WEB-INF/classes/nutch-site.xml And make sure it contains something like se

Solr Integration since v1.0 ?

2009-07-07 Thread Alex McLintock
I've looked at a lot of tutorials for linking Nutch and Solr but it seems that this has been improved a lot in version 1.0. Can anyone point me at documentation which takes this into account? Cheers Alex

Writing Plugins - Documentation?

2009-07-06 Thread Alex McLintock
Can anyone point me at some Nutch Plugin documentation which goes into more detail than http://wiki.apache.org/nutch/WritingPluginExample-0.9 I want to understand all the different places where you might put a plugin and why/how you might do so. Basically I am trying to extra some information ou

Re: Problems when deploy nutch-1.0.war

2009-07-04 Thread Alex McLintock
OK, here is how i fixed this in my ubuntu 9.04 setup using the normal tomcatt6 ubuntu package. I added this permission line to the /etc/tomcat6/policy.d/04webapps.policy grant { // Attempt to get Nutch working permission java.security.AllPermission; Now I can get the Nutch web app worki

Re: Problems when deploy nutch-1.0.war

2009-07-04 Thread Alex McLintock
2009/7/4 xiao yang : > I have downloaded Tomcat6 from the official site and reinstalled it > manually. It works! > Maybe there's something wrong with Tomcat6 in the ubuntu mirror site. Oops - I just posted exactly the same problem with the same stack trace, same ubuntu version, same tomcat versio

Getting Nutch1.0 example working in tomcat 6 (on ubuntu)

2009-07-04 Thread Alex McLintock
I'm trying out nutch-1.0 release on Ubuntu 9.04. I'm trying to follow the 0.8 version tutorial found at http://lucene.apache.org/nutch/tutorial8.html Now I seem to be able to do the web crawling and create indexes. I can do searches from the command line using eg /local/apps/software/nutch$ bin