Can anyone offer any insight into this? If I am correct and the recrawl
script is currently not working properly, I will update the script and
make it available to the community. Thanks..
Matt
--- Begin Message ---
I need some help clarifying if recrawling is doing exactly what I think
it is
I need some help clarifying if recrawling is doing exactly what I think
it is. Here's the current scenario of how I think a recrawl should work:
I crawl my intranet with a depth of 2. Later, I recrawl using the script
found below:
http://wiki.apache.org/nutch/IntranetRecrawl
Hello,
I'm using Nutch 0.9 's jar to programming in Java to make crawls with
a predefined depth and I am having a problem when trying to recrawl,
and I don't know if I am solving it in the right way:
In the first crawl i have no problems, but when I recrawl in my crawl
database
I sent out a few emails regarding a recrawl script I wrote. However, if
it' be easier for anyone to help, can you please check that all of the
below steps are the only ones that need to be taken to recrawl? Or if
there is a resource online that describes manually recrawling, that'd be
But the websites just added hasn't been yet crawled... And they're not
crawled during recrawl...
Does "bin/nutch purge" will restart all ?
Le Thu, 3 Aug 2006 09:21:04 -0300,
"Lourival Júnior" <[EMAIL PROTECTED]> a écrit :
> In the nutch conf/nutch
Hi,
I crawled a website. Around 500 out of 5000 pages generated
errors/exceptions. I would like to recrawl only these 500 pages. The errors
appear to be something similar to this:
Segment#1: 0 errors
Segment#2: 120 errors
Segment#3: 10 errors
Segment#4: 370 errors
Segment#5: 0 errors
Q1: If I
Yeah, that blog is part of the reason why I sent this email. There
seems to be lots of confusion around maintaining a crawl
(crawl/recrawl)
Can anyone fill in the gaps?
On 8/9/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote:
> I have written about the phases of crawl and re-crawl.
riginal Message
> From: scottyd <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Thursday, June 5, 2008 2:44:21 PM
> Subject: recrawl in 1.0
>
>
> I was wondering how to accomplish a recrawl in the trunk release of nutch.
>
> I've read through some other post
Hi there,
I'm actually having weird problems with my recrawl procedure (nutch0.9).
The situation is the following:
First, I crawl a couple of domains. Then, I start a seperate crawl with a pages
resulting from the first crawl and finally merge these two crawls.
What I basically wa
All,
Does anyone have an updated recrawl script for 0.9?
Also, does anyone have a link that describes each phase of a crawl /
recrawl (for 0.9)
it looks like it changes each version. I searched the wiki, but i am
still unclear.
thanks
Thanks for your answer and for the hint.
Has someone do this as java main class?
> -Ursprüngliche Nachricht-
> Von: Damian Florczyk [mailto:[EMAIL PROTECTED]
> Gesendet: Freitag, 29. Dezember 2006 14:22
> An: nutch-user@lucene.apache.org
> Betreff: Re: recrawl index
>
Which version are you using?
On 8/3/06, Nahuel ANGELINETTI <[EMAIL PROTECTED]> wrote:
But the websites just added hasn't been yet crawled... And they're not
crawled during recrawl...
Does "bin/nutch purge" will restart all ?
Le Thu, 3 Aug 2006 09:21:04 -0300,
0.7.2 of nutch
Le Thu, 3 Aug 2006 09:37:24 -0300,
"Lourival Júnior" <[EMAIL PROTECTED]> a écrit :
> Which version are you using?
>
> On 8/3/06, Nahuel ANGELINETTI <[EMAIL PROTECTED]> wrote:
> >
> > But the websites just added hasn't been yet crawl
This command "bin/nutch purge" doesn't exist. Well I can't say you what is
happening. Give me the output when you run the recrawl.
On 8/3/06, Nahuel ANGELINETTI <[EMAIL PROTECTED]> wrote:
0.7.2 of nutch
Le Thu, 3 Aug 2006 09:37:24 -0300,
"Lourival Júni
Hi
I am using nutch-0.8.1 and copied the recrawl script from the web.
I did a simple crawl on url http://www.saic.com at depth 2 with -topN
100 and got 18 records.
But when I do a recrawl with -topN 100 and -adddays 31 (forcing all
pages to be refetched), I
get 132 documents. The initial
It depends on your crawldb size, and the number of urls you fetch.
Crawldb stores the urls fetched and to be fetched. When you recrawl
with seperated command, first you will read data from crawldb and
generate the urls will be fetched this round.
An initial crawl first injects seed urls into
Just FYI.. After I do the recrawl, I do stop and start tomcat, and still
the newly created page can not be found.
Matthew Holt wrote:
The recrawl worked this time, and I recrawled the entire db using the
-adddays argument (in my case ./recrawl crawl 10 31). However, it
didn't find a
nutch-user@lucene.apache.org
Sent: Thursday, August 9, 2007 8:04:20 AM
Subject: intranet recrawl 0.9
All,
Does anyone have an updated recrawl script for 0.9?
Also, does anyone have a link that describes each phase of a crawl /
recrawl (for 0.9)
it looks like it changes each version. I searched the
Hi again, I have no answer.
Why are my documents unfetched when I do a recrawl please ?
Thks.
José Mestre
-Message d'origine-
De : José Mestre [mailto:[EMAIL PROTECTED]
Envoyé : mardi 2 décembre 2008 14:07
À : nutch-user@lucene.apache.org
Objet : RE : RE : Problem with craw
In the nutch conf/nutch-default.xml configuration file exist a property call
db.default.fetch.interval. When you crawl a site, nutch schedules the next
fetch to "today + db.default.fetch.interval" days. If execute the recrawl
command and the pages that you fetch don't reach this d
'conf/crawl-urlfilter.txt' file would be used instead.
Regards,
Susam Pal
On Jan 10, 2008 6:34 PM,
[EMAIL PROTECTED]
<[EMAIL PROTECTED]> wrote:
> Hi there,
>
> I'm actually having weird problems with my recrawl procedure (nutch0.9).
>
> The situation is the follo
Otto, Frank napisał(a):
hi,
I'm new to nutch. I have crawled my website. But we can I recrawl/refresh the index without delete the crawl folder?
kind regards
frank
Well, google is your friend but if you cant use it try this link:
http://today.java.net/pub/a/today/2006/
I have another question, I done what you give me... But it inject the
new urls and "recrawl" it, but against the first crawl It doesn't
download the web pages and really crawl them... perhaps I'm mistaking
somewhere...
Any idea ?
Regards,
--
Nahuel ANGELINETTI
Le Thu, 3 Aug
but i configured nutch to fetch every 6 hours, and i'm crawling every day at 3
am, and even pages didnt change i see them been fetched every day !!
> Date: Fri, 18 Dec 2009 00:04:12 +0100
> Subject: Re: difference in time between an initial crawl and recrawl with a
>
y 6 hours, and i'm crawling every day at
> 3 am, and even pages didnt change i see them been fetched every day !!
>
>
>
>
>
>
>> Date: Fri, 18 Dec 2009 00:04:12 +0100
>> Subject: Re: difference in time between an initial crawl and recrawl with
>> a
Well it is somewhat more subtil... nutch will only recrawl a page every 30
days by default, and if it finds that it did not change in the meantime it
will delay even further to more than 30 days the next recrawl. After 90 days
everything is recrawled no matter what.
So actually it does make a
Nutch only recrawl every 30 days by default. So you set the numberDays
adequately and it wil recrawl read nutch-default.xml to get the
details
2009/12/9, xiao yang :
> What do you mean by "recrawl"?
> Does the following command meets what you need?
> bin/nutch crawl urls -dir c
Thanks for these information about recrawling.
I am running a recrawling operation but every time I do it, I don't get the
same results as the first crawl(different documents , not the same web
pages). So how can I handle to recrawl same pages?
Maybe fixe the property db.default.fetch.int
hi,
i just want to know the difference between a first initial crawl and a recrawl
using the fetch, generate, update commands
is there a diffence in time between using an initial crawl every time (by
deleting the crawl_folder ) and using a recrawl without deleting the initial
crawl_folder
spect your crawldb.
At this moment, you'll have two segments (one for each depth).
With your recrawl command you are telling Nutch to fetch the 100 best
scoring unfetched urls from the crawldb. This might include the 18 urls
which were fetched in the initial crawl since you used -addays 31, bu
having
a full crawldb
> Date: Thu, 17 Dec 2009 16:08:38 +0800
> Subject: Re: difference in time between an initial crawl and recrawl with a
> full crawldb
> From: yangxiao9...@gmail.com
> To: nutch-user@lucene.apache.org
>
> If you crawl with "bin/nutch crawl ...
http://www.digitalpebble.com
2008/12/8 José Mestre <[EMAIL PROTECTED]>
> Hi again, I have no answer.
> Why are my documents unfetched when I do a recrawl please ?
>
> Thks.
>
> José Mestre
>
> -Message d'origine-
> De : José Mestre [mailto:[EMAIL PROTECTED]
If you crawl with "bin/nutch crawl ..." command without deleting the
crawldb. The result will be the same with recrawl. It only wastes the
initial injection phase and crawldb update phase, but that won't
affect the final result.
On Thu, Dec 17, 2009 at 3:56 AM, BELLINI ADAM wro
egards,
Stefan
Matthew Holt wrote:
> Just FYI.. After I do the recrawl, I do stop and start tomcat, and still
> the newly created page can not be found.
>
> Matthew Holt wrote:
>
>> The recrawl worked this time, and I recrawled the entire db using the
>> -adddays
> Subject: Is it necce necessary to restart Servlet/JSP container after
> recrawl?
>
> I have a question about nutch recrawl, every time after recawl, if I
> don't restart tomcat, I got 0 search result. Is it necessary to restart
> the container?
>
>
Does anyone have a good Intranet recrawl script for nutch-0.8.0? Thanks..
Matt
Hi,
When the recrawl is being done, the app server requires a restart to get the
new indexes reflected.
If the folder where the recrawl must be done is pointed by the web app, a
folder named merge-output is created inside the index folder once the
recrawl is completed.
Is there any way to
> In the first crawl i have no problems, but when I recrawl in my crawl
> database there are pages and links from the previous operation, so if
> I first crawl with depth 1 and later I recrawl with depth 1 again is
> like a depth 2 crawling. From an example:
>
> I make a d
The recrawl worked this time, and I recrawled the entire db using the
-adddays argument (in my case ./recrawl crawl 10 31). However, it didn't
find a newly created page.
If I delete the database and do the initial crawl over again, the new
page is found. Any idea what I'm doing wr
The crawl command use "crawl-tool.xml" as default nutch config,but the recrawl
script use "nutch-site". So just copy the all configuration in "crawl-tool.xml"
to "nutch-site.xml". Concerning the selecting of "crawl-urlfiltertxt&
Hi,
62 docs are in the index.
José
De : Alexander Aristov [EMAIL PROTECTED]
Date d'envoi : mardi 2 décembre 2008 06:58
À : nutch-user@lucene.apache.org
Objet : Re: RE : Problem with crawl and recrawl
Maybe silly question but
How to know how many
Reader's Digest version:
How can I ensure that nutch only crawls the urls I inject into the fetchlist
and not recrawl the entire webdb?
Can anyone explain to me (in simple terms) exactly what adddays does?
Long version:
My setup is simple. I crawl a number of internet forums. This req
On 9/6/06, Andrei Hajdukewycz <[EMAIL PROTECTED]> wrote:
Another problem I've noticed is that it seems the db grows *rapidly* with each
successive recrawl. Mine started at 379MB, and it seems to increase by roughly
350MB every time I run a recrawl, despite there not being anywher
thx for the explanation,
so if i well understood using the separates commands i dont have to run as many
times as i did it in the initial crawl (with depth 10).
in my recrawl i'm also doing it in a loop of 10 !! am i wrong looping 10 times
(generateting fetching parsing updating ) ?? m
Thanks! You're the man!!!
Now I can automate this thing :).
Steve
http://www.stevekallestad.com/
On 2/8/07, chee wu <[EMAIL PROTECTED]> wrote:
The crawl command use "crawl-tool.xml" as default nutch config,but the recrawl script use "nutch-site". So just copy
Here is the result with a recrawl:
CrawlDb statistics start: crawl_fetcher/crawldb
Statistics for CrawlDb: crawl_fetcher/crawldb
TOTAL urls: 3266
retry 0:3266
min score: 0.19
avg score: 1.0285031
max score: 10.229
status 1 (DB_unfetched):3204
status 2
I'm looking over the Intranet Recrawl script here:
http://wiki.apache.org/nutch/IntranetRecrawl
and I'm a little confused about segment merging and deleting.
Start code snip
# Merge segments and cleanup unused segments
mergesegs_dir=$crawl_dir/mergesegs_dir
$nutch_dir/nutch
Hi,
I also work on the recrawl, but concerning a file system,
every thing works fine, I modified a file and I want to make a recrawl to
index this new version of my file,
I am using nutch-0.8.1 and I test the recrawl script and the merge script,
they work fine, but to build the new index
hi,
I'm new to nutch. I have crawled my website. But we can I recrawl/refresh the
index without delete the crawl folder?
kind regards
frank
Hi Nahuel!
You could use the command bin/nutch inject $nutch-dir/db -urlfile
urlfile.txt. To recrawl your WebDB you can use this
script.<http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html>
Take a look to the adddays argument and to the configuration pr
t of ideas on this. Any suggestions will be quite welcome.
>>>
>>> Here is my set up:
>>>
>>> RAM: 4G
>>> JVM HEAP: 2G
>>> mapred.child.java.opts = 1024M
>>> hadoop-0.19.1-core.jar
>>> nutch-1.0
>>> Xen VPS.
>>>
&g
lease notify us immediately by
telephone at 866-584-2143.
-Original Message-
From: xiao yang [mailto:yangxiao9...@gmail.com]
Sent: Wednesday, December 16, 2009 2:21 PM
To: nutch-user@lucene.apache.org
Subject: Re: difference in time between an initial crawl and recrawl
with a full crawldb
It
Hello,
I was searching for the method to add new url to the crawling url list
and how to recrawl all urls...
Can you help me ?
thanks,
--
Nahuel ANGELINETTI
in my case i didnt noticed thatbut mabe recrawling with a full crawldb
seems to be more quick than the initial crawl...but i needed someone tell me
i'm right or not, mabe with some metrics
> Subject: RE: difference in time between an initial crawl and recrawl with a
> f
I'm having multiple problems recrawling with nutch 0.9. Here are 2
questions. :-)
Right now, using the script I find here (
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
), I think I'm close to a workable solution, but the recrawl doesn't
re
Thanks for putting up with all the messages to the list... Here is the
recrawl script for 0.8.0 if anyone is interested.
Matt
---
#!/bin/bash
# Nutch recrawl script.
# Based on 0.7.2 script at
http://today.java.net/pub/a/today/2006/02/16/introduction-to
where can i download nutch version 0.8 ?? can't find it on nutch website.
Matthew Holt 写道:
Does anyone have a good Intranet recrawl script for nutch-0.8.0? Thanks..
Matt
Maybe silly question but
How to know how many docs are in the index?
thanks
Alex
2008/12/2 José Mestre <[EMAIL PROTECTED]>
> Here is the result with a recrawl:
>
> CrawlDb statistics start: crawl_fetcher/crawldb
> Statistics for CrawlDb: crawl_fetcher/crawldb
> TOTAL url
n/nutch inject $nutch-dir/db -urlfile
> urlfile.txt. To recrawl your WebDB you can use this
> script.<http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html>
>
> Take a look to the adddays argument and to the configuration property
> db.default.fetch.inter
I have a question about nutch recrawl, every time after recawl, if I
don't restart tomcat, I got 0 search result. Is it necessary to restart
the container?
ct: Re: difference in time between an initial crawl and recrawl with a
> full crawldb
> From: mille...@gmail.com
> To: nutch-user@lucene.apache.org
>
> Wait 30 days and you should see the diffence ... Since settings are
> time based if you crawl every day or hour it doesn
Another problem I've noticed is that it seems the db grows *rapidly* with each
successive recrawl. Mine started at 379MB, and it seems to increase by roughly
350MB every time I run a recrawl, despite there not being anywhere near that
many additional pages.
This seems like a pretty s
kevin wrote:
where can i download nutch version 0.8 ?? can't find it on nutch website.
Matthew Holt 写道:
Does anyone have a good Intranet recrawl script for nutch-0.8.0?
Thanks..
Matt
From trunk in the SVN repository.
How can i recrawl a specific web page. For example I have a html page that
is constantly update. There a command for that?
--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]
Nutch will recrawl initially every "interval.default" seconds the urls.
If it finds that the page has not changed it will increase the
interval to a limit "interval.max"
That is, if you don't delete the whole crawldb every day. Like you seem to do.
So in your case a
Just letting everyone know that I updated the recrawl script on the
Wiki. It now merges the created segments them deletes the old segs to
prevent a lot of unneeded data remaining/growing on the hard drive.
Matt
http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head
hi,
i am sure this has been asked before. i cant find a satisfactory answer
in the forums.
for recrawling and limiting to fetching pages 30+ days old, i am using
adddays=30 but it seems to still recrawl everything!
whats the best the best way to config this?
You can use bin/hadoop fs -rmr crawl to delete the whole directory and
Recrawl.
On Tue, Jul 7, 2009 at 1:47 AM, Maurizio Croci wrote:
> Hi, I try to REcrawl (with a shell-script. I have already a webDB...) a
> website (with some links to other webpage, .html, .doc, .pdf, ...) but this
&
I'm hoping that my emails actually reach other people, as they've been
ignored so far.
I just ran a recrawl today to crawl a few injected URLs that I have. At the
end of the recrawl I received the following error:
060623 122916 merging segment indexes to:
/home/honda/nutch-0
er you fetched
>> everything use:
>>
>> nutch invertlinks ...
>> nutch index ...
>>
>> Hope that helps. Otherwise let me know and I'll dig out the complete
>> commandlines for you.
>>
>>
>> Regards,
>> Stefan
>>
To: nutch-user@lucene.apache.org
Subject: Re: how to force nutch to do a recrawl
Nutch only recrawl every 30 days by default. So you set the numberDays
adequately and it wil recrawl read nutch-default.xml to get the
details
2009/12/9, xiao yang :
> What do you mean by "recrawl"?
> Does the follow
to run a Nutch re-crawl
if [ -n "$1" ]
then
crawl_dir=$1
else
echo "Usage: recrawl crawl_dir [depth] [adddays]"
exit 1
fi
if [ -n "$2" ]
then
depth=$2
else
depth=5
fi
if [ -n "$3" ]
then
adddays=$3
else
adddays=0
fi
webdb_dir=$crawl_dir/db
segm
Honda-Search Administrator wrote:
Reader's Digest version:
How can I ensure that nutch only crawls the urls I inject into the
fetchlist and not recrawl the entire webdb?
Can anyone explain to me (in simple terms) exactly what adddays does?
Long version:
My setup is simple. I crawl a n
nutch readdb crawl_fetcher/crawldb -stats
José Mestre
-Message d'origine-
De : Julien Nioche [mailto:[EMAIL PROTECTED]
Envoyé : lundi 8 décembre 2008 18:22
À : nutch-user@lucene.apache.org
Objet : Re: RE : Problem with crawl and recrawl
Bonjour Jose,
Sorry if I am suggesting something
okk :)
so i have to set also interval.max because i didnt it yet ! now it is 90 days.
so i will set it 24 hours and make a try
thx very much
> Date: Fri, 18 Dec 2009 17:45:42 +0100
> Subject: Re: difference in time between an initial crawl and recrawl with a
> full crawl
Another question on the similiar subject:
For example I made a recrawl and found some new pages. I want to add
them to my Index. I use Indexer to create Indexes. How could I add this
Indexes into my already existing Index now?
If I just merge Indexes, I'll lose documents placed in my old Index.
his crawl directory? If
yes, I have written a small patch that can be used to recrawl over the
same crawl directory by adding a "-force" option to the "bin/nutch
crawl" command line. With this patch, one can crawl and recrawl in the
following manner:-
bin/nutch crawl url
sequence of execution;
step 1 setup.
* first crawl was done using "bin/nutch crawl.."
- urls = 1500
- depth = 10
- topN = 500
(so it should do all by round 3 right? what happens at rounds 4 to 10?)
step 2 to 5 setup.
* recrawl (repeat)
- topN = 1
- depth = 10
- db.default.fetch.int
I'm currently pretty busy at work. If I have I'll do it later.
The version 0.8 recrawl script has a working version online now. I
temporarily modified it on the website yesterday when I ran into some
problems, but I further tested it and the actual working code is
modified now. So
changed duirng a certain period. So I need to fetch all of them. But I want
to index only those files/urls which have their content changed after the
last crawl so that the recrawl time gets reduced. Actually my re-crawl is
taking more time compared to a fresh crawl. How can I improve the time spent
Hi everyone,
I have browsed through the nutch documentation but I have not found enough
information on how to recrawl the urls that I have already crawled. Do we
have to do a recrawling ourselves or the nutch application will do it?
More information on this regard will be highly appreciated
use of the contents of this information is strictly
> prohibited. If you have received this electronic information in error,
> please notify us immediately by telephone at 866-584-2143.
>
> -Original Message-
> From: MilleBii [mailto:mille...@gmail.com]
> Sent: Wednesday, Decem
;s Digest version:
How can I ensure that nutch only crawls the urls I inject into the
fetchlist and not recrawl the entire webdb?
Can anyone explain to me (in simple terms) exactly what adddays does?
Long version:
My setup is simple. I crawl a number of internet forums. This requires
me to sca
ds,
Stefan
Matthew Holt wrote:
Just FYI.. After I do the recrawl, I do stop and start tomcat, and still
the newly created page can not be found.
Matthew Holt wrote:
The recrawl worked this time, and I recrawled the entire db using the
-adddays argument (in my case ./recrawl cra
Hi Matt and Lourival,
Matt, thank you for the recrawl script. Any plans to commit it to trunk?
Lourival, here's in the script what "reloads Tomcat", not the cleanest,
but it should work
# Tell Tomcat to reload index
touch $nutch_dir/WEB-INF/web.xml
HTH,
Renaud
Lourival Jú
Now i've crawled a site and use the
http://wiki.apache.org/nutch/IntranetRecrawl script to recrawl,but i want to
add more urls to crawl and merge them.Can i add urls directly in the current
urls dir,or should i run a new crawl and them merge?
I use the nutch inject a url and run the re
Would it be recommended to use hadoop for crawling (100 sites with 1000
pages each) on a single machine? What would be the benefit?
Something like described on:
http://wiki.apache.org/nutch/NutchHadoopTutorial but on a single
machine.
Or is the simple crawl/recrawl (without hadoop, like
I'm sure there is a good answer for this, whether it be lack of time, or
not enough demand, but just was wondering why there is not a 'recrawl'
option that goes with the intranet crawl. I'm looking into making one
for myself, and was just wondering if one is in development
Hi,Matthew.
Could you please show your reindex script once again?
You wrote 12 èþëÿ 2006 ã., 1:51:21:
> Honda-Search Administrator wrote:
>> Reader's Digest version:
>> How can I ensure that nutch only crawls the urls I inject into the
>> fetchlist and not recraw
Renaud Richardet wrote:
Hi Matt and Lourival,
Matt, thank you for the recrawl script. Any plans to commit it to trunk?
Lourival, here's in the script what "reloads Tomcat", not the
cleanest, but it should work
# Tell Tomcat to reload index
touch $nutch_dir/WEB-INF/web.x
The recrawl script for 0.9 I found in
http://wiki.apache.org/nutch/IntranetRecrawl is not working. It works
first time successfully. Second time, it fails with this error.
merging indexes to: crawl/index
IndexMerger: org.apache.hadoop.mapred.FileAlreadyExistsException:
Output directory crawl
I use this script to crawl and recrawl web:
http://wiki.apache.org/nutch/Crawl
I noticed that database grow very slow (depth=2, topn=1000, adddays=30)
because it fetches the same urls several times in different recrawl loops.
What I should do to force Nutch to fetch ONLY unfetched urls from
NutchDeveloper pisze:
I use this script to crawl and recrawl web:
http://wiki.apache.org/nutch/Crawl
I noticed that database grow very slow (depth=2, topn=1000, adddays=30)
because it fetches the same urls several times in different recrawl loops.
What I should do to force Nutch to fetch ONLY
Please oh please, don't shoot me for being a newbie.
I have set up a site-search using nutch, and I have the
crawl-urlfilter.txtfile configured so that everything works properly
when I call something
similar to:
bin/nutch crawl urls -dir crawl -depth 3 -topN 100
I grabbed the Intranet Re
Hi there,
I have the following problem to solve:
I already crawled a couple of domains and can also recrawl them frequently. But
what if I want to add additional domains to my crawl lateron?
I could imagine to solutions:
1. Add the new domain somehow to the ?crawldb? so it is considered
What do you mean by "recrawl"?
Does the following command meets what you need?
bin/nutch crawl urls -dir crawl -depth 3 -topN 50
Change the destination directory to a different one with the last crawl.
On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya wrote:
> I'm running Nu
AIL PROTECTED]> wrote:
I'm hoping that my emails actually reach other people, as they've been
ignored so far.
I just ran a recrawl today to crawl a few injected URLs that I have. At the
end of the recrawl I received the following error:
060623 122916 merging segment indexes to:
/ho
expanding in
the desired depth from this seed URL's.
Again, thank you for answering.
Ismael
2007/8/25, John Mendenhall <[EMAIL PROTECTED]>:
> > In the first crawl i have no problems, but when I recrawl in my crawl
> > database there are pages and links from the previous oper
s: Hadoop implements MapReduce, using the HDFS
If there is no distributing file sistem over computer nodes (single
machine configuration) what does Hadoop do?
When running crawl/recrawl cycle-> generate/fetch/update
what processes is Hadoop running? How can I monitor them to see what is
going
Stefan,
The nutch-user mailing list seems to be down, or at least unavailable
to my personal account. I have spent several hours looking into
creating/modifying a Intranet recrawl script for 0.8.0. I have it where
it does not error out, however, when I search for something using the
1 - 100 of 534 matches
Mail list logo