repeatedly refetchnig the same site, without consent

2007-07-22 Thread Lyndon Maydwell
Hey guys, I'm running a crawl over my lighttpd server, and nutch is repeatedly fetching the pages specified in the urls directory, and making no progress from there. It is crawling the tomcat server fine, just not the lighttpd one. Has anyone come across this problem before? I've run using 0.8

Snippet contents.

2007-08-10 Thread Lyndon Maydwell
I've noticed that the snippets returned in nutch's search seem to have the formatting added to them, and are then escaped into xml strings. How would I go about changing the process so that the content was escaped, then formatting added, then the snippet escaped? the reason I want this is so that

Re: IRC channel for Nutch?

2007-08-21 Thread Lyndon Maydwell
Sounds like a good idea to me.

Slow search

2007-09-05 Thread Lyndon Maydwell
Hello, I'm using nutch through the opensearch interface, and am noticing very slow search speeds, ie: 3-4 seconds. I really need to find some way to speed the search up significantly. During the search 'top' indicates that it is using close to 100% CPU and around 40M of ram line from top when

Re: fetch errors?

2007-09-05 Thread Lyndon Maydwell
That did fix the problem thank you. On 7/13/07, Karol Rybak [EMAIL PROTECTED] wrote: Make sure that you configured proper file, if you are using crawl tool crawl-urlfilter is used. If you use fetch or fetch2 regex-urlfiter is used. On 7/13/07, Lyndon Maydwell [EMAIL PROTECTED] wrote: Hi

maintain crawl script is failing

2007-09-16 Thread Lyndon Maydwell
Hi people, I'm running a script to update the crawl database twice a week, and today it failed. readdb crawl/crawldb -stats still works fine, but searching using the web application is returning 0 results. I'm looking for some problem in the crawl script, but I'd also like to be able to rebuild

free disk space

2007-09-17 Thread Lyndon Maydwell
Hi again, a bit of an urgent question, Can I delete the temporary files in /tmp/hadoop-ceims/ without any bad concequences? I have run out of system space and cannot reboot at the moment.

Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help

2007-09-20 Thread Lyndon Maydwell
I've been having problems with the merge portion of the script too. My solution was to check the success status of the merge ( $? ), and if it failed, try again, or wait until next time. nutch_bin/nutch mergesegs $merged_segment -dir $segments if [ $? -ne 0 ] then echo merging segments

Re: Fetch failed due to space problems on /tmp (?)

2007-10-23 Thread Lyndon Maydwell
From what I have read, this has been solved in recent revisions, so downloading a new build or checking out the latest source should solve the problem. I am still using a version that has this problem, but should be switching shortly. My solution in the mean time has been to delete the temporary

Re: No space left on device

2007-11-21 Thread Lyndon Maydwell
Hi list. I've recently moved the location of the temp-files for my crawls to a 20GB partition specifically for this purpose. I'm once again getting this warning. Is there something that can be done to make Nutch more efficient with it's temp files? -- stuff: Nutch version Release 0.9 -

url normalization

2007-12-05 Thread Lyndon Maydwell
Is there a way to apply regex normalization on the urls currently in the database? e.g. I would like to make www.asdf.com equivalent to asdf.com

Re: Missing pages.

2007-12-12 Thread Lyndon Maydwell
Ah, yes, of course. I was a bit hasty with my question. I was really referring to the results returned from the Nutch web-application. I'm also getting a lot of requests to change some of the configuration options relating to addresses Nutch considers equivalent. Is it possible to alter the

Re: Missing pages.

2007-12-12 Thread Lyndon Maydwell
For example, on the sites that I'm crawling, all addresses starting with www.x are simply redirects to x.

filter / normalize from command line on existing db

2007-12-13 Thread Lyndon Maydwell
I'm attempting to run some new regex-normalize and regex-urlfilter rules on my existing crawl directory. for example: regex pattern(https?://)www\.(.*)/pattern substitution$1$2/substitution /regex I tried the updatedb command, and the mergedb command, but neither of these seem

re-fetching pages

2007-12-19 Thread Lyndon Maydwell
I'm getting search results for many pages which are now 404s. I have set property namedb.default.fetch.interval/name value1/value descriptionThe default number of days between re-fetches of a page./description /property in my nutch-site.xml, but when I look at the fetch part of the logs

strange page rank

2008-01-30 Thread Lyndon Maydwell
Hi list. I'm having trouble figuring out why certain pages are being ranked much higher than others on my Nutch installation. For example, not long ago, the department of computing's homepage was ranked #1 for the query computing department. However, recently it has dropped in the rankings

Re: strange page rank

2008-02-06 Thread Lyndon Maydwell
Sorry to bump this, but I just noticed that the scores for my recent crawler are very high. -- old crawler (sensible results) -- min score: 0.0 avg score: 0.505 max score: 7736.152 -- new crawler (poor results) -- min score: 0.0 avg score: 9.4379096E7 max score:

Re: strange page rank

2008-02-07 Thread Lyndon Maydwell
Thanks for your help Dennis. I'm not sure that the problem is coming from the link.internal boost. Some pages with very high scores have relatively few inbound links, yet pages that seem to match more criteria for boosts, and have a far greater number of inbound links receive a much lower score.

Re: strange page rank

2008-02-10 Thread Lyndon Maydwell
Hi list. I've reached a dead end with my page rankings. I dumped my crawldb and extracted the urls which I used to recrawl from scratch. The score problem now seems to have resolved itself, with the stats: min score: 0.01 avg score: 1.456 max score: 1769.588 However, my rankings

Re: strange page rank

2008-02-11 Thread Lyndon Maydwell
I'll give it a shot with a very low internal boost. Thanks a lot for your assistance.

Re: strange page rank

2008-02-11 Thread Lyndon Maydwell
Thanks guys. Problem solved. It was the ignore property that was really throwing me, as the dumping urls from the linkdb wasn't showing them to me. Setting the internal link boost to 0.01 seems to have solved my problem completely.

Re: Spell checker or did you mean...? plugin

2008-02-22 Thread Lyndon Maydwell
It would be great if there were a spell checker plugin that played nicely with Nutch's rss output, as I've just written a pl/sql spell checker to use in conjunction with Nutch, and while fun, might not be as good as it could be.

Re: How To Fetch for '?' URLs

2008-03-12 Thread Lyndon Maydwell
# skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] http://www.giantfood.com/corporate/company_press_display.htm?press_id=380 has '=' in it. Nutch bases the inclusion / exclusion upon the first regex that matches the url.

Stemming plugin problem.

2008-04-01 Thread Lyndon Maydwell
Hi list, I'm trying to get stemming working on nutch-1.0-dev using the instructions found on the wiki for version 0.8 ( http://wiki.apache.org/nutch/Stemming ). I've set up everything pretty much how it was outlined in the walkthrough, but I'm getting errors when I try to use the plugin.

Re: Can two different urls be configured as same ?

2008-04-21 Thread Lyndon Maydwell
regex-normalize.xml This allows you to transform urls based on regular expressions. So you could make one appear to be the other, or vice versa, or both appear to be a third. Rules are written like so: regex-normalize regex pattern(https?://)www\.(.*)/pattern substitution$1$3/substitution

Disk consumption.

2008-05-11 Thread Lyndon Maydwell
Hi Nutch-user. I've been crawling our internal sites for a while now and the index is growing rapidly. I have filled the 100G partition I have been alotted and am looking at finding a sustainable way to maintain the indexes at around this size rather than continuously expanding them. I read in

Re: Nutch index vs Lucene index

2008-06-25 Thread Lyndon Maydwell
I've had no problems using it as a pure lucene index.

Re: Problems with highlighter

2008-09-12 Thread Lyndon Maydwell
I'd be happy to try trawl through the code for you :) I've been looking for stemming code that will run on 1.0 for ages now!

Re: Does Nutch support the boolean OR operator in a search query?

2009-01-19 Thread Lyndon Maydwell
Lucene has support for OR queries, so it should be possible to do it, but support for this in nutch isn't available as far as I know. I'd also be intersted if anyone has managed to implement this. On Tue, Jan 20, 2009 at 1:50 AM, M S Ram ms...@cse.iitk.ac.in wrote: Oh! That's sad! :( What is the

Re: error after adding indexes manually

2009-03-13 Thread Lyndon Maydwell
What versions of Lucene are Nutch and Luke using? When you play with the index you should ensure that the version of Lucene being used is the same as what Nutch is using. On Sat, Mar 14, 2009 at 8:41 AM, alx...@aim.com wrote: Hello, I used? lukeall-0.9.1.jar to manually add a new? record?

Re: error after adding indexes manually

2009-03-13 Thread Lyndon Maydwell
. You're probably safer replacing it in the index editing utility. On Sat, Mar 14, 2009 at 10:33 AM, alx...@aim.com wrote:  btw, which version of lucene is in nutch-0.9? Thanks. Alex. -Original Message- From: Lyndon Maydwell maydw...@gmail.com To: nutch-user@lucene.apache.org

Re: error after adding indexes manually

2009-03-13 Thread Lyndon Maydwell
click lucene-core-2.1.0.jar -Original Message- From: Lyndon Maydwell maydw...@gmail.com To: nutch-user@lucene.apache.org Sent: Fri, 13 Mar 2009 8:20 pm Subject: Re: error after adding indexes manually I just checked. (I usually just have the trunk source). Nutch 0.9

Re: lukeall-0.9.1 to manually add indexes

2009-04-01 Thread Lyndon Maydwell
I've noticed that you need to optimize the index for nutch to pick up changes. Have you tried this? On Wed, Apr 1, 2009 at 12:42 PM, alx...@aim.com wrote:  Thanks for you response. In luke there is also option to commit. I opened new index again, and there is the document I created. But the

Re: Crawl depth problem

2010-01-23 Thread Lyndon Maydwell
Have you set up your regex-urlfilter.txt correctly? I've been caught out by this before. On Sat, Jan 23, 2010 at 4:31 PM, zud praveenmotur...@gmail.com wrote: hi   i am running nutch crawl and i have specified the depth as 200 but in the console it is showing  Stopping at depth=1 - no more