[Fwd: Recrawling... methodology?]

2006-07-31 Thread Matthew Holt
Can anyone offer any insight into this? If I am correct and the recrawl script is currently not working properly, I will update the script and make it available to the community. Thanks.. Matt --- Begin Message --- I need some help clarifying if recrawling is doing exactly what I think it is

Recrawling... methodology?

2006-07-28 Thread Matthew Holt
I need some help clarifying if recrawling is doing exactly what I think it is. Here's the current scenario of how I think a recrawl should work: I crawl my intranet with a depth of 2. Later, I recrawl using the script found below: http://wiki.apache.org/nutch/IntranetRecrawl

How to get the crawl database free of links to recrawl only from seed URL?

2007-08-24 Thread Ismael
Hello, I'm using Nutch 0.9 's jar to programming in Java to make crawls with a predefined depth and I am having a problem when trying to recrawl, and I don't know if I am solving it in the right way: In the first crawl i have no problems, but when I recrawl in my crawl database

Please Help.. recrawl script.. will send out to the list when finished for 0.8.0

2006-07-20 Thread Matthew Holt
I sent out a few emails regarding a recrawl script I wrote. However, if it' be easier for anyone to help, can you please check that all of the below steps are the only ones that need to be taken to recrawl? Or if there is a resource online that describes manually recrawling, that'd be

Re: Recrawl urls

2006-08-03 Thread Nahuel ANGELINETTI
But the websites just added hasn't been yet crawled... And they're not crawled during recrawl... Does "bin/nutch purge" will restart all ? Le Thu, 3 Aug 2006 09:21:04 -0300, "Lourival Júnior" <[EMAIL PROTECTED]> a écrit : > In the nutch conf/nutch

Recrawl error pages optimization

2007-05-05 Thread karthik085
Hi, I crawled a website. Around 500 out of 5000 pages generated errors/exceptions. I would like to recrawl only these 500 pages. The errors appear to be something similar to this: Segment#1: 0 errors Segment#2: 120 errors Segment#3: 10 errors Segment#4: 370 errors Segment#5: 0 errors Q1: If I

Re: intranet recrawl 0.9

2007-08-09 Thread Brian Demers
Yeah, that blog is part of the reason why I sent this email. There seems to be lots of confusion around maintaining a crawl (crawl/recrawl) Can anyone fill in the gaps? On 8/9/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote: > I have written about the phases of crawl and re-crawl.

Re: recrawl in 1.0

2008-06-06 Thread ogjunk-nutch
riginal Message > From: scottyd <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Thursday, June 5, 2008 2:44:21 PM > Subject: recrawl in 1.0 > > > I was wondering how to accomplish a recrawl in the trunk release of nutch. > > I've read through some other post

Problem with recrawl

2008-01-10 Thread [EMAIL PROTECTED]
Hi there, I'm actually having weird problems with my recrawl procedure (nutch0.9). The situation is the following: First, I crawl a couple of domains. Then, I start a seperate crawl with a pages resulting from the first crawl and finally merge these two crawls. What I basically wa

intranet recrawl 0.9

2007-08-09 Thread Brian Demers
All, Does anyone have an updated recrawl script for 0.9? Also, does anyone have a link that describes each phase of a crawl / recrawl (for 0.9) it looks like it changes each version. I searched the wiki, but i am still unclear. thanks

AW: recrawl index

2006-12-29 Thread Otto, Frank
Thanks for your answer and for the hint. Has someone do this as java main class? > -Ursprüngliche Nachricht- > Von: Damian Florczyk [mailto:[EMAIL PROTECTED] > Gesendet: Freitag, 29. Dezember 2006 14:22 > An: nutch-user@lucene.apache.org > Betreff: Re: recrawl index >

Re: Recrawl urls

2006-08-03 Thread Lourival Júnior
Which version are you using? On 8/3/06, Nahuel ANGELINETTI <[EMAIL PROTECTED]> wrote: But the websites just added hasn't been yet crawled... And they're not crawled during recrawl... Does "bin/nutch purge" will restart all ? Le Thu, 3 Aug 2006 09:21:04 -0300,

Re: Recrawl urls

2006-08-03 Thread Nahuel ANGELINETTI
0.7.2 of nutch Le Thu, 3 Aug 2006 09:37:24 -0300, "Lourival Júnior" <[EMAIL PROTECTED]> a écrit : > Which version are you using? > > On 8/3/06, Nahuel ANGELINETTI <[EMAIL PROTECTED]> wrote: > > > > But the websites just added hasn't been yet crawl

Re: Recrawl urls

2006-08-03 Thread Lourival Júnior
This command "bin/nutch purge" doesn't exist. Well I can't say you what is happening. Give me the output when you run the recrawl. On 8/3/06, Nahuel ANGELINETTI <[EMAIL PROTECTED]> wrote: 0.7.2 of nutch Le Thu, 3 Aug 2006 09:37:24 -0300, "Lourival Júni

recrawl question

2006-12-11 Thread Nancy Snyder
Hi I am using nutch-0.8.1 and copied the recrawl script from the web. I did a simple crawl on url http://www.saic.com at depth 2 with -topN 100 and got 18 records. But when I do a recrawl with -topN 100 and -adddays 31 (forcing all pages to be refetched), I get 132 documents. The initial

Re: difference in time between an initial crawl and recrawl with a full crawldb

2009-12-16 Thread xiao yang
It depends on your crawldb size, and the number of urls you fetch. Crawldb stores the urls fetched and to be fetched. When you recrawl with seperated command, first you will read data from crawldb and generate the urls will be fetched this round. An initial crawl first injects seed urls into

Re: Recrawling question

2006-06-06 Thread Matthew Holt
Just FYI.. After I do the recrawl, I do stop and start tomcat, and still the newly created page can not be found. Matthew Holt wrote: The recrawl worked this time, and I recrawled the entire db using the -adddays argument (in my case ./recrawl crawl 10 31). However, it didn't find a

Re: intranet recrawl 0.9

2007-08-09 Thread Kai_testing Middleton
nutch-user@lucene.apache.org Sent: Thursday, August 9, 2007 8:04:20 AM Subject: intranet recrawl 0.9 All, Does anyone have an updated recrawl script for 0.9? Also, does anyone have a link that describes each phase of a crawl / recrawl (for 0.9) it looks like it changes each version. I searched the

RE: RE : Problem with crawl and recrawl

2008-12-08 Thread José Mestre
Hi again, I have no answer. Why are my documents unfetched when I do a recrawl please ? Thks. José Mestre -Message d'origine- De : José Mestre [mailto:[EMAIL PROTECTED] Envoyé : mardi 2 décembre 2008 14:07 À : nutch-user@lucene.apache.org Objet : RE : RE : Problem with craw

Re: Recrawl urls

2006-08-03 Thread Lourival Júnior
In the nutch conf/nutch-default.xml configuration file exist a property call db.default.fetch.interval. When you crawl a site, nutch schedules the next fetch to "today + db.default.fetch.interval" days. If execute the recrawl command and the pages that you fetch don't reach this d

Re: Problem with recrawl

2008-01-10 Thread Susam Pal
'conf/crawl-urlfilter.txt' file would be used instead. Regards, Susam Pal On Jan 10, 2008 6:34 PM, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > Hi there, > > I'm actually having weird problems with my recrawl procedure (nutch0.9). > > The situation is the follo

Re: recrawl index

2006-12-29 Thread Damian Florczyk
Otto, Frank napisał(a): hi, I'm new to nutch. I have crawled my website. But we can I recrawl/refresh the index without delete the crawl folder? kind regards frank Well, google is your friend but if you cant use it try this link: http://today.java.net/pub/a/today/2006/

Re: Recrawl urls

2006-08-03 Thread Nahuel ANGELINETTI
I have another question, I done what you give me... But it inject the new urls and "recrawl" it, but against the first crawl It doesn't download the web pages and really crawl them... perhaps I'm mistaking somewhere... Any idea ? Regards, -- Nahuel ANGELINETTI Le Thu, 3 Aug

RE: difference in time between an initial crawl and recrawl with a full crawldb

2009-12-17 Thread BELLINI ADAM
but i configured nutch to fetch every 6 hours, and i'm crawling every day at 3 am, and even pages didnt change i see them been fetched every day !! > Date: Fri, 18 Dec 2009 00:04:12 +0100 > Subject: Re: difference in time between an initial crawl and recrawl with a >

Re: difference in time between an initial crawl and recrawl with a full crawldb

2009-12-18 Thread MilleBii
y 6 hours, and i'm crawling every day at > 3 am, and even pages didnt change i see them been fetched every day !! > > > > > > >> Date: Fri, 18 Dec 2009 00:04:12 +0100 >> Subject: Re: difference in time between an initial crawl and recrawl with >> a

Re: difference in time between an initial crawl and recrawl with a full crawldb

2009-12-17 Thread MilleBii
Well it is somewhat more subtil... nutch will only recrawl a page every 30 days by default, and if it finds that it did not change in the meantime it will delay even further to more than 30 days the next recrawl. After 90 days everything is recrawled no matter what. So actually it does make a

Re: how to force nutch to do a recrawl

2009-12-09 Thread MilleBii
Nutch only recrawl every 30 days by default. So you set the numberDays adequately and it wil recrawl read nutch-default.xml to get the details 2009/12/9, xiao yang : > What do you mean by "recrawl"? > Does the following command meets what you need? > bin/nutch crawl urls -dir c

Re: Recrawl urls

2009-05-14 Thread aidahaj
Thanks for these information about recrawling. I am running a recrawling operation but every time I do it, I don't get the same results as the first crawl(different documents , not the same web pages). So how can I handle to recrawl same pages? Maybe fixe the property db.default.fetch.int

difference in time between an initial crawl and recrawl with a full crawldb

2009-12-16 Thread BELLINI ADAM
hi, i just want to know the difference between a first initial crawl and a recrawl using the fetch, generate, update commands is there a diffence in time between using an initial crawl every time (by deleting the crawl_folder ) and using a recrawl without deleting the initial crawl_folder

Re: recrawl question

2006-12-12 Thread Mathijs Homminga
spect your crawldb. At this moment, you'll have two segments (one for each depth). With your recrawl command you are telling Nutch to fetch the 100 best scoring unfetched urls from the crawldb. This might include the 18 urls which were fetched in the initial crawl since you used -addays 31, bu

RE: difference in time between an initial crawl and recrawl with a full crawldb

2009-12-17 Thread BELLINI ADAM
having a full crawldb > Date: Thu, 17 Dec 2009 16:08:38 +0800 > Subject: Re: difference in time between an initial crawl and recrawl with a > full crawldb > From: yangxiao9...@gmail.com > To: nutch-user@lucene.apache.org > > If you crawl with "bin/nutch crawl ...

Re: RE : Problem with crawl and recrawl

2008-12-08 Thread Julien Nioche
http://www.digitalpebble.com 2008/12/8 José Mestre <[EMAIL PROTECTED]> > Hi again, I have no answer. > Why are my documents unfetched when I do a recrawl please ? > > Thks. > > José Mestre > > -Message d'origine- > De : José Mestre [mailto:[EMAIL PROTECTED]

Re: difference in time between an initial crawl and recrawl with a full crawldb

2009-12-17 Thread xiao yang
If you crawl with "bin/nutch crawl ..." command without deleting the crawldb. The result will be the same with recrawl. It only wastes the initial injection phase and crawldb update phase, but that won't affect the final result. On Thu, Dec 17, 2009 at 3:56 AM, BELLINI ADAM wro

Re: Recrawling question

2006-06-06 Thread Stefan Neufeind
egards, Stefan Matthew Holt wrote: > Just FYI.. After I do the recrawl, I do stop and start tomcat, and still > the newly created page can not be found. > > Matthew Holt wrote: > >> The recrawl worked this time, and I recrawled the entire db using the >> -adddays

RE: Is it necce necessary to restart Servlet/JSP container after recrawl?

2010-03-29 Thread Arkadi.Kosmynin
> Subject: Is it necce necessary to restart Servlet/JSP container after > recrawl? > > I have a question about nutch recrawl, every time after recawl, if I > don't restart tomcat, I got 0 search result. Is it necessary to restart > the container? > >

Intranet Recrawl Script for 0.8.0

2006-07-14 Thread Matthew Holt
Does anyone have a good Intranet recrawl script for nutch-0.8.0? Thanks.. Matt

Updating index without restarting the app server

2008-11-07 Thread shree lakshmi
Hi, When the recrawl is being done, the app server requires a restart to get the new indexes reflected. If the folder where the recrawl must be done is pointed by the web app, a folder named merge-output is created inside the index folder once the recrawl is completed. Is there any way to

Re: How to get the crawl database free of links to recrawl only from seed URL?

2007-08-24 Thread John Mendenhall
> In the first crawl i have no problems, but when I recrawl in my crawl > database there are pages and links from the previous operation, so if > I first crawl with depth 1 and later I recrawl with depth 1 again is > like a depth 2 crawling. From an example: > > I make a d

Re: Recrawling question

2006-06-06 Thread Matthew Holt
The recrawl worked this time, and I recrawled the entire db using the -adddays argument (in my case ./recrawl crawl 10 31). However, it didn't find a newly created page. If I delete the database and do the initial crawl over again, the new page is found. Any idea what I'm doing wr

Re: Recrawl not following crawl-urlfilter.txt

2007-02-08 Thread chee wu
The crawl command use "crawl-tool.xml" as default nutch config,but the recrawl script use "nutch-site". So just copy the all configuration in "crawl-tool.xml" to "nutch-site.xml". Concerning the selecting of "crawl-urlfiltertxt&

RE : RE : Problem with crawl and recrawl

2008-12-02 Thread José Mestre
Hi, 62 docs are in the index. José De : Alexander Aristov [EMAIL PROTECTED] Date d'envoi : mardi 2 décembre 2008 06:58 À : nutch-user@lucene.apache.org Objet : Re: RE : Problem with crawl and recrawl Maybe silly question but How to know how many

Adddays confusion - easy question for the experts

2006-06-24 Thread Honda-Search Administrator
Reader's Digest version: How can I ensure that nutch only crawls the urls I inject into the fetchlist and not recrawl the entire webdb? Can anyone explain to me (in simple terms) exactly what adddays does? Long version: My setup is simple. I crawl a number of internet forums. This req

Re: Recrawling

2006-09-07 Thread Tomi NA
On 9/6/06, Andrei Hajdukewycz <[EMAIL PROTECTED]> wrote: Another problem I've noticed is that it seems the db grows *rapidly* with each successive recrawl. Mine started at 379MB, and it seems to increase by roughly 350MB every time I run a recrawl, despite there not being anywher

RE: difference in time between an initial crawl and recrawl with a full crawldb

2009-12-16 Thread BELLINI ADAM
thx for the explanation, so if i well understood using the separates commands i dont have to run as many times as i did it in the initial crawl (with depth 10). in my recrawl i'm also doing it in a loop of 10 !! am i wrong looping 10 times (generateting fetching parsing updating ) ?? m

Re: Recrawl not following crawl-urlfilter.txt

2007-02-08 Thread Steve Kallestad
Thanks! You're the man!!! Now I can automate this thing :). Steve http://www.stevekallestad.com/ On 2/8/07, chee wu <[EMAIL PROTECTED]> wrote: The crawl command use "crawl-tool.xml" as default nutch config,but the recrawl script use "nutch-site". So just copy

RE : Problem with crawl and recrawl

2008-12-01 Thread José Mestre
Here is the result with a recrawl: CrawlDb statistics start: crawl_fetcher/crawldb Statistics for CrawlDb: crawl_fetcher/crawldb TOTAL urls: 3266 retry 0:3266 min score: 0.19 avg score: 1.0285031 max score: 10.229 status 1 (DB_unfetched):3204 status 2

Recrawl Script segment merging

2006-09-15 Thread Jacob Brunson
I'm looking over the Intranet Recrawl script here: http://wiki.apache.org/nutch/IntranetRecrawl and I'm a little confused about segment merging and deleting. Start code snip # Merge segments and cleanup unused segments mergesegs_dir=$crawl_dir/mergesegs_dir $nutch_dir/nutch

file recrawl

2006-12-13 Thread Aïcha
Hi, I also work on the recrawl, but concerning a file system, every thing works fine, I modified a file and I want to make a recrawl to index this new version of my file, I am using nutch-0.8.1 and I test the recrawl script and the merge script, they work fine, but to build the new index

recrawl index

2006-12-29 Thread Otto, Frank
hi, I'm new to nutch. I have crawled my website. But we can I recrawl/refresh the index without delete the crawl folder? kind regards frank

Re: Recrawl urls

2006-08-03 Thread Lourival Júnior
Hi Nahuel! You could use the command bin/nutch inject $nutch-dir/db -urlfile urlfile.txt. To recrawl your WebDB you can use this script.<http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html> Take a look to the adddays argument and to the configuration pr

Re: MergeSegments - map reduce thread death

2009-11-05 Thread fadzi
t of ideas on this. Any suggestions will be quite welcome. >>> >>> Here is my set up: >>> >>> RAM: 4G >>> JVM HEAP: 2G >>> mapred.child.java.opts = 1024M >>> hadoop-0.19.1-core.jar >>> nutch-1.0 >>> Xen VPS. >>> &g

RE: difference in time between an initial crawl and recrawl with a full crawldb

2009-12-16 Thread Peters, Vijaya
lease notify us immediately by telephone at 866-584-2143. -Original Message- From: xiao yang [mailto:yangxiao9...@gmail.com] Sent: Wednesday, December 16, 2009 2:21 PM To: nutch-user@lucene.apache.org Subject: Re: difference in time between an initial crawl and recrawl with a full crawldb It

Recrawl urls

2006-08-03 Thread Nahuel ANGELINETTI
Hello, I was searching for the method to add new url to the crawling url list and how to recrawl all urls... Can you help me ? thanks, -- Nahuel ANGELINETTI

RE: difference in time between an initial crawl and recrawl with a full crawldb

2009-12-16 Thread BELLINI ADAM
in my case i didnt noticed thatbut mabe recrawling with a full crawldb seems to be more quick than the initial crawl...but i needed someone tell me i'm right or not, mabe with some metrics > Subject: RE: difference in time between an initial crawl and recrawl with a > f

Recrawl and crawl-urlfilter.txt

2010-03-12 Thread Joshua J Pavel
I'm having multiple problems recrawling with nutch 0.9. Here are 2 questions. :-) Right now, using the script I find here ( http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html ), I think I'm close to a workable solution, but the recrawl doesn't re

Recrawl script for 0.8.0 completed...

2006-07-21 Thread Matthew Holt
Thanks for putting up with all the messages to the list... Here is the recrawl script for 0.8.0 if anyone is interested. Matt --- #!/bin/bash # Nutch recrawl script. # Based on 0.7.2 script at http://today.java.net/pub/a/today/2006/02/16/introduction-to

Re: Intranet Recrawl Script for 0.8.0

2006-07-14 Thread kevin
where can i download nutch version 0.8 ?? can't find it on nutch website. Matthew Holt 写道: Does anyone have a good Intranet recrawl script for nutch-0.8.0? Thanks.. Matt

Re: RE : Problem with crawl and recrawl

2008-12-01 Thread Alexander Aristov
Maybe silly question but How to know how many docs are in the index? thanks Alex 2008/12/2 José Mestre <[EMAIL PROTECTED]> > Here is the result with a recrawl: > > CrawlDb statistics start: crawl_fetcher/crawldb > Statistics for CrawlDb: crawl_fetcher/crawldb > TOTAL url

Re: Recrawl urls

2006-08-03 Thread Nahuel ANGELINETTI
n/nutch inject $nutch-dir/db -urlfile > urlfile.txt. To recrawl your WebDB you can use this > script.<http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html> > > Take a look to the adddays argument and to the configuration property > db.default.fetch.inter

Is it necce necessary to restart Servlet/JSP container after recrawl?

2010-03-29 Thread 段军义
I have a question about nutch recrawl, every time after recawl, if I don't restart tomcat, I got 0 search result. Is it necessary to restart the container?

RE: difference in time between an initial crawl and recrawl with a full crawldb

2009-12-18 Thread BELLINI ADAM
ct: Re: difference in time between an initial crawl and recrawl with a > full crawldb > From: mille...@gmail.com > To: nutch-user@lucene.apache.org > > Wait 30 days and you should see the diffence ... Since settings are > time based if you crawl every day or hour it doesn&#

Re: Recrawling

2006-09-06 Thread Andrei Hajdukewycz
Another problem I've noticed is that it seems the db grows *rapidly* with each successive recrawl. Mine started at 379MB, and it seems to increase by roughly 350MB every time I run a recrawl, despite there not being anywhere near that many additional pages. This seems like a pretty s

Re: Intranet Recrawl Script for 0.8.0

2006-07-15 Thread Matthew Holt
kevin wrote: where can i download nutch version 0.8 ?? can't find it on nutch website. Matthew Holt 写道: Does anyone have a good Intranet recrawl script for nutch-0.8.0? Thanks.. Matt From trunk in the SVN repository.

Recrawl a specific web Page

2006-07-13 Thread Lourival Júnior
How can i recrawl a specific web page. For example I have a html page that is constantly update. There a command for that? -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]

Re: difference in time between an initial crawl and recrawl with a full crawldb

2009-12-18 Thread MilleBii
Nutch will recrawl initially every "interval.default" seconds the urls. If it finds that the page has not changed it will increase the interval to a limit "interval.max" That is, if you don't delete the whole crawldb every day. Like you seem to do. So in your case a

0.8 Recrawl script updated

2006-08-02 Thread Matthew Holt
Just letting everyone know that I updated the recrawl script on the Wiki. It now merges the created segments them deletes the old segs to prevent a lot of unneeded data remaining/growing on the hard drive. Matt http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head

adddays / recrawl

2009-10-30 Thread Fadzi Ushewokunze
hi, i am sure this has been asked before. i cant find a satisfactory answer in the forums. for recrawling and limiting to fetching pages 30+ days old, i am using adddays=30 but it seems to still recrawl everything! whats the best the best way to config this?

Re: error nutch recrawl

2009-07-07 Thread xiao yang
You can use bin/hadoop fs -rmr crawl to delete the whole directory and Recrawl. On Tue, Jul 7, 2009 at 1:47 AM, Maurizio Croci wrote: > Hi, I try to REcrawl (with a shell-script. I have already a webDB...) a > website (with some links to other webpage, .html, .doc, .pdf, ...) but this &

ERROR when recrawling... can ANYONE help?

2006-06-23 Thread Honda-Search Administrator
I'm hoping that my emails actually reach other people, as they've been ignored so far. I just ran a recrawl today to crawl a few injected URLs that I have. At the end of the recrawl I received the following error: 060623 122916 merging segment indexes to: /home/honda/nutch-0

Re: Recrawling question

2006-06-06 Thread Stefan Neufeind
er you fetched >> everything use: >> >> nutch invertlinks ... >> nutch index ... >> >> Hope that helps. Otherwise let me know and I'll dig out the complete >> commandlines for you. >> >> >> Regards, >> Stefan >>

RE: how to force nutch to do a recrawl

2009-12-09 Thread Peters, Vijaya
To: nutch-user@lucene.apache.org Subject: Re: how to force nutch to do a recrawl Nutch only recrawl every 30 days by default. So you set the numberDays adequately and it wil recrawl read nutch-default.xml to get the details 2009/12/9, xiao yang : > What do you mean by "recrawl"? > Does the follow

Re: Recrawling question

2006-06-06 Thread Matthew Holt
to run a Nutch re-crawl if [ -n "$1" ] then crawl_dir=$1 else echo "Usage: recrawl crawl_dir [depth] [adddays]" exit 1 fi if [ -n "$2" ] then depth=$2 else depth=5 fi if [ -n "$3" ] then adddays=$3 else adddays=0 fi webdb_dir=$crawl_dir/db segm

Re: Adddays confusion - easy question for the experts

2006-07-11 Thread Matthew Holt
Honda-Search Administrator wrote: Reader's Digest version: How can I ensure that nutch only crawls the urls I inject into the fetchlist and not recrawl the entire webdb? Can anyone explain to me (in simple terms) exactly what adddays does? Long version: My setup is simple. I crawl a n

RE: RE : Problem with crawl and recrawl

2008-12-08 Thread José Mestre
nutch readdb crawl_fetcher/crawldb -stats José Mestre -Message d'origine- De : Julien Nioche [mailto:[EMAIL PROTECTED] Envoyé : lundi 8 décembre 2008 18:22 À : nutch-user@lucene.apache.org Objet : Re: RE : Problem with crawl and recrawl Bonjour Jose, Sorry if I am suggesting something

RE: difference in time between an initial crawl and recrawl with a full crawldb

2009-12-18 Thread BELLINI ADAM
okk :) so i have to set also interval.max because i didnt it yet ! now it is 90 days. so i will set it 24 hours and make a try thx very much > Date: Fri, 18 Dec 2009 17:45:42 +0100 > Subject: Re: difference in time between an initial crawl and recrawl with a > full crawl

How to reIndex after reCrawl?

2007-04-26 Thread Ilya Vishnevsky
Another question on the similiar subject: For example I made a recrawl and found some new pages. I want to add them to my Index. I use Indexer to create Indexes. How could I add this Indexes into my already existing Index now? If I just merge Indexes, I'll lose documents placed in my old Index.

Recrawl using org.apache.nutch.crawl.Crawl

2008-01-31 Thread Susam Pal
his crawl directory? If yes, I have written a small patch that can be used to recrawl over the same crawl directory by adding a "-force" option to the "bin/nutch crawl" command line. With this patch, one can crawl and recrawl in the following manner:- bin/nutch crawl url

Re: MergeSegments - map reduce thread death

2009-11-05 Thread fadzi
sequence of execution; step 1 setup. * first crawl was done using "bin/nutch crawl.." - urls = 1500 - depth = 10 - topN = 500 (so it should do all by round 3 right? what happens at rounds 4 to 10?) step 2 to 5 setup. * recrawl (repeat) - topN = 1 - depth = 10 - db.default.fetch.int

Re: 0.8 Recrawl script updated

2006-08-03 Thread Matthew Holt
I'm currently pretty busy at work. If I have I'll do it later. The version 0.8 recrawl script has a working version online now. I temporarily modified it on the website yesterday when I ran into some problems, but I further tested it and the actual working code is modified now. So

To avoid recrawl to index unchanged content.

2007-12-23 Thread pavankumar
changed duirng a certain period. So I need to fetch all of them. But I want to index only those files/urls which have their content changed after the last crawl so that the recrawl time gets reduced. Actually my re-crawl is taking more time compared to a fresh crawl. How can I improve the time spent

How to recrawl urls

2005-12-18 Thread Kumar Limbu
Hi everyone, I have browsed through the nutch documentation but I have not found enough information on how to recrawl the urls that I have already crawled. Do we have to do a recrawling ourselves or the nutch application will do it? More information on this regard will be highly appreciated

Re: how to force nutch to do a recrawl

2009-12-09 Thread xiao yang
use of the contents of this information is strictly > prohibited.  If you have received this electronic information in error, > please notify us immediately by telephone at 866-584-2143. > > -Original Message- > From: MilleBii [mailto:mille...@gmail.com] > Sent: Wednesday, Decem

Re: Adddays confusion - easy question for the experts

2006-07-11 Thread Honda-Search Administrator
;s Digest version: How can I ensure that nutch only crawls the urls I inject into the fetchlist and not recrawl the entire webdb? Can anyone explain to me (in simple terms) exactly what adddays does? Long version: My setup is simple. I crawl a number of internet forums. This requires me to sca

Re: Recrawling question

2006-06-06 Thread Matthew Holt
ds, Stefan Matthew Holt wrote: Just FYI.. After I do the recrawl, I do stop and start tomcat, and still the newly created page can not be found. Matthew Holt wrote: The recrawl worked this time, and I recrawled the entire db using the -adddays argument (in my case ./recrawl cra

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Renaud Richardet
Hi Matt and Lourival, Matt, thank you for the recrawl script. Any plans to commit it to trunk? Lourival, here's in the script what "reloads Tomcat", not the cleanest, but it should work # Tell Tomcat to reload index touch $nutch_dir/WEB-INF/web.xml HTH, Renaud Lourival Jú

How can i crawl more site?

2006-10-12 Thread martin
Now i've crawled a site and use the http://wiki.apache.org/nutch/IntranetRecrawl script to recrawl,but i want to add more urls to crawl and merge them.Can i add urls directly in the current urls dir,or should i run a new crawl and them merge? I use the nutch inject a url and run the re

hadoop on single machine

2007-08-30 Thread Tomislav Poljak
Would it be recommended to use hadoop for crawling (100 sites with 1000 pages each) on a single machine? What would be the benefit? Something like described on: http://wiki.apache.org/nutch/NutchHadoopTutorial but on a single machine. Or is the simple crawl/recrawl (without hadoop, like

Built-in Recrawl

2006-07-15 Thread Matthew Holt
I'm sure there is a good answer for this, whether it be lack of time, or not enough demand, but just was wondering why there is not a 'recrawl' option that goes with the intranet crawl. I'm looking into making one for myself, and was just wondering if one is in development

Re[2]: Adddays confusion - easy question for the experts

2006-07-12 Thread Dima Mazmanov
Hi,Matthew. Could you please show your reindex script once again? You wrote 12 èþëÿ 2006 ã., 1:51:21: > Honda-Search Administrator wrote: >> Reader's Digest version: >> How can I ensure that nutch only crawls the urls I inject into the >> fetchlist and not recraw

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Matthew Holt
Renaud Richardet wrote: Hi Matt and Lourival, Matt, thank you for the recrawl script. Any plans to commit it to trunk? Lourival, here's in the script what "reloads Tomcat", not the cleanest, but it should work # Tell Tomcat to reload index touch $nutch_dir/WEB-INF/web.x

Nutch recrawl script for 0.9 doesn't work with trunk. Help

2007-09-19 Thread Alexis Votta
The recrawl script for 0.9 I found in http://wiki.apache.org/nutch/IntranetRecrawl is not working. It works first time successfully. Second time, it fails with this error. merging indexes to: crawl/index IndexMerger: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory crawl

nutch fetches already fetched urls again and again

2009-02-26 Thread NutchDeveloper
I use this script to crawl and recrawl web: http://wiki.apache.org/nutch/Crawl I noticed that database grow very slow (depth=2, topn=1000, adddays=30) because it fetches the same urls several times in different recrawl loops. What I should do to force Nutch to fetch ONLY unfetched urls from

Re: nutch fetches already fetched urls again and again

2009-02-26 Thread Bartosz Gadzimski
NutchDeveloper pisze: I use this script to crawl and recrawl web: http://wiki.apache.org/nutch/Crawl I noticed that database grow very slow (depth=2, topn=1000, adddays=30) because it fetches the same urls several times in different recrawl loops. What I should do to force Nutch to fetch ONLY

Recrawl not following crawl-urlfilter.txt

2007-02-08 Thread Steve Kallestad
Please oh please, don't shoot me for being a newbie. I have set up a site-search using nutch, and I have the crawl-urlfilter.txtfile configured so that everything works properly when I call something similar to: bin/nutch crawl urls -dir crawl -depth 3 -topN 100 I grabbed the Intranet Re

adding domain to recrawl

2007-12-18 Thread [EMAIL PROTECTED]
Hi there, I have the following problem to solve: I already crawled a couple of domains and can also recrawl them frequently. But what if I want to add additional domains to my crawl lateron? I could imagine to solutions: 1. Add the new domain somehow to the ?crawldb? so it is considered

Re: how to force nutch to do a recrawl

2009-12-09 Thread xiao yang
What do you mean by "recrawl"? Does the following command meets what you need? bin/nutch crawl urls -dir crawl -depth 3 -topN 50 Change the destination directory to a different one with the last crawl. On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya wrote: > I'm running Nu

Re: ERROR when recrawling... can ANYONE help?

2006-06-23 Thread TDLN
AIL PROTECTED]> wrote: I'm hoping that my emails actually reach other people, as they've been ignored so far. I just ran a recrawl today to crawl a few injected URLs that I have. At the end of the recrawl I received the following error: 060623 122916 merging segment indexes to: /ho

Re: How to get the crawl database free of links to recrawl only from seed URL?

2007-08-25 Thread Ismael
expanding in the desired depth from this seed URL's. Again, thank you for answering. Ismael 2007/8/25, John Mendenhall <[EMAIL PROTECTED]>: > > In the first crawl i have no problems, but when I recrawl in my crawl > > database there are pages and links from the previous oper

Re: hadoop on single machine

2007-08-31 Thread Tomislav Poljak
s: Hadoop implements MapReduce, using the HDFS If there is no distributing file sistem over computer nodes (single machine configuration) what does Hadoop do? When running crawl/recrawl cycle-> generate/fetch/update what processes is Hadoop running? How can I monitor them to see what is going

[Fwd: Reworked recrawl script for 0.8.0]

2006-07-20 Thread Matthew Holt
Stefan, The nutch-user mailing list seems to be down, or at least unavailable to my personal account. I have spent several hours looking into creating/modifying a Intranet recrawl script for 0.8.0. I have it where it does not error out, however, when I search for something using the

  1   2   3   4   5   6   >