Hey guys, I'm running a crawl over my lighttpd server, and nutch is
repeatedly fetching the pages specified in the urls directory, and
making no progress from there. It is crawling the tomcat server fine,
just not the lighttpd one. Has anyone come across this problem before?
I've run using 0.8
I've noticed that the snippets returned in nutch's search seem to have
the formatting added to them, and are then escaped into xml strings.
How would I go about changing the process so that the content was
escaped, then formatting added, then the snippet escaped?
the reason I want this is so that
Sounds like a good idea to me.
Hello,
I'm using nutch through the opensearch interface, and am noticing very
slow search speeds, ie: 3-4 seconds.
I really need to find some way to speed the search up significantly.
During the search 'top' indicates that it is using close to 100% CPU
and around 40M of ram
line from top when
That did fix the problem thank you.
On 7/13/07, Karol Rybak [EMAIL PROTECTED] wrote:
Make sure that you configured proper file, if you are using crawl tool
crawl-urlfilter is used. If you use fetch or fetch2 regex-urlfiter is used.
On 7/13/07, Lyndon Maydwell [EMAIL PROTECTED] wrote:
Hi
Hi people,
I'm running a script to update the crawl database twice a week, and
today it failed.
readdb crawl/crawldb -stats still works fine, but searching using
the web application is returning 0 results.
I'm looking for some problem in the crawl script, but I'd also like to
be able to rebuild
Hi again, a bit of an urgent question,
Can I delete the temporary files in /tmp/hadoop-ceims/ without any bad
concequences?
I have run out of system space and cannot reboot at the moment.
I've been having problems with the merge portion of the script too.
My solution was to check the success status of the merge ( $? ), and
if it failed, try again, or wait until next time.
nutch_bin/nutch mergesegs $merged_segment -dir $segments
if [ $? -ne 0 ]
then
echo merging segments
From what I have read, this has been solved in recent revisions, so
downloading a new build or checking out the latest source should solve
the problem. I am still using a version that has this problem, but
should be switching shortly. My solution in the mean time has been to
delete the temporary
Hi list.
I've recently moved the location of the temp-files for my crawls to a
20GB partition specifically for this purpose. I'm once again getting
this warning. Is there something that can be done to make Nutch more
efficient with it's temp files?
-- stuff:
Nutch version
Release 0.9 -
Is there a way to apply regex normalization on the urls currently in
the database?
e.g. I would like to make www.asdf.com equivalent to asdf.com
Ah, yes, of course. I was a bit hasty with my question.
I was really referring to the results returned from the Nutch web-application.
I'm also getting a lot of requests to change some of the configuration
options relating to addresses Nutch considers equivalent. Is it
possible to alter the
For example, on the sites that I'm crawling, all addresses starting
with www.x are simply redirects to x.
I'm attempting to run some new regex-normalize and regex-urlfilter
rules on my existing crawl directory.
for example:
regex
pattern(https?://)www\.(.*)/pattern
substitution$1$2/substitution
/regex
I tried the updatedb command, and the mergedb command, but neither of
these seem
I'm getting search results for many pages which are now 404s.
I have set
property
namedb.default.fetch.interval/name
value1/value
descriptionThe default number of days between re-fetches of a
page./description
/property
in my nutch-site.xml, but when I look at the fetch part of the logs
Hi list.
I'm having trouble figuring out why certain pages are being ranked
much higher than others on my Nutch installation.
For example, not long ago, the department of computing's homepage was
ranked #1 for the query computing department.
However, recently it has dropped in the rankings
Sorry to bump this, but I just noticed that the scores for my recent
crawler are very high.
-- old crawler (sensible results) --
min score: 0.0
avg score: 0.505
max score: 7736.152
-- new crawler (poor results) --
min score: 0.0
avg score: 9.4379096E7
max score:
Thanks for your help Dennis.
I'm not sure that the problem is coming from the link.internal boost.
Some pages with very high scores have relatively few inbound links,
yet pages that seem to match more criteria for boosts, and have a far
greater number of inbound links receive a much lower score.
Hi list.
I've reached a dead end with my page rankings.
I dumped my crawldb and extracted the urls which I used to recrawl
from scratch. The score problem now seems to have resolved itself,
with the stats:
min score: 0.01
avg score: 1.456
max score: 1769.588
However, my rankings
I'll give it a shot with a very low internal boost.
Thanks a lot for your assistance.
Thanks guys. Problem solved.
It was the ignore property that was really throwing me, as the dumping
urls from the linkdb wasn't showing them to me.
Setting the internal link boost to 0.01 seems to have solved my
problem completely.
It would be great if there were a spell checker plugin that played
nicely with Nutch's rss output, as I've just written a pl/sql spell
checker to use in conjunction with Nutch, and while fun, might not be
as good as it could be.
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
http://www.giantfood.com/corporate/company_press_display.htm?press_id=380
has '=' in it.
Nutch bases the inclusion / exclusion upon the first regex that matches the url.
Hi list,
I'm trying to get stemming working on nutch-1.0-dev using the
instructions found on the wiki for version 0.8 (
http://wiki.apache.org/nutch/Stemming ). I've set up everything pretty
much how it was outlined in the walkthrough, but I'm getting errors
when I try to use the plugin.
regex-normalize.xml
This allows you to transform urls based on regular expressions.
So you could make one appear to be the other, or vice versa, or both
appear to be a third.
Rules are written like so:
regex-normalize
regex
pattern(https?://)www\.(.*)/pattern
substitution$1$3/substitution
Hi Nutch-user.
I've been crawling our internal sites for a while now and the index is
growing rapidly. I have filled the 100G partition I have been alotted
and am looking at finding a sustainable way to maintain the indexes at
around this size rather than continuously expanding them. I read in
I've had no problems using it as a pure lucene index.
I'd be happy to try trawl through the code for you :) I've been looking for
stemming code that will run on 1.0 for ages now!
Lucene has support for OR queries, so it should be possible to do it,
but support for this in nutch isn't available as far as I know. I'd
also be intersted if anyone has managed to implement this.
On Tue, Jan 20, 2009 at 1:50 AM, M S Ram ms...@cse.iitk.ac.in wrote:
Oh! That's sad! :( What is the
What versions of Lucene are Nutch and Luke using? When you play with
the index you should ensure that the version of Lucene being used is
the same as what Nutch is using.
On Sat, Mar 14, 2009 at 8:41 AM, alx...@aim.com wrote:
Hello,
I used? lukeall-0.9.1.jar to manually add a new? record?
.
You're probably safer replacing it in the index editing utility.
On Sat, Mar 14, 2009 at 10:33 AM, alx...@aim.com wrote:
btw, which version of lucene is in nutch-0.9?
Thanks.
Alex.
-Original Message-
From: Lyndon Maydwell maydw...@gmail.com
To: nutch-user@lucene.apache.org
click lucene-core-2.1.0.jar
-Original Message-
From: Lyndon Maydwell maydw...@gmail.com
To: nutch-user@lucene.apache.org
Sent: Fri, 13 Mar 2009 8:20 pm
Subject: Re: error after adding indexes manually
I just checked. (I usually just have the trunk source).
Nutch 0.9
I've noticed that you need to optimize the index for nutch to pick up changes.
Have you tried this?
On Wed, Apr 1, 2009 at 12:42 PM, alx...@aim.com wrote:
Thanks for you response. In
luke there is also option to commit. I opened new index again, and
there is the document I created. But the
Have you set up your regex-urlfilter.txt correctly? I've been caught
out by this before.
On Sat, Jan 23, 2010 at 4:31 PM, zud praveenmotur...@gmail.com wrote:
hi
i am running nutch crawl and i have specified the depth as 200 but in the
console it is showing Stopping at depth=1 - no more
34 matches
Mail list logo