Re: AW: AW: AW: How to index while fetcher works

Bartosz Gadzimski Thu, 19 Feb 2009 14:23:32 -0800

Dear Nadine,

Your case is very interesting, can you tell us more about how to dealwith sutch situation ? As you said it looks that you have to rank newsaccording to dates, how you are achiving it? Keeping sites up to datelooks like really cool feature.

Anyway I am surprised that you are using nutch crawler for such specificfield. I would use something like content scrapping (very popular in seoand spam when you know your source, just php + regexp :) Ofcourse youcan use this only when you know your source website.

Thank you for advice with large segments, I must remember this, itcouses a lot of problems (starting with waiting for fetch job to finishand as you said later problems with merging and indexing).


In my quick tests with intel dual core 2GHz, 2GB RAM, 250GB SATA hdd server
invertlinks on 1.5GB segment took 22 minutes which is a little bit long

Regards,
Bartosz


Höchstötter Nadine pisze:

Hi,we do news crawling, that is why we have different ranking issues, such as up to dateness and article recognition.

I have two scripts, one for the generate, fetch, parse cycle, where I also 
update crawldb and linkdb. And another script to merge segments and build 
indexes. For me, it is most important to have the newest pages of websites. For 
you it will be better to have all, but not every page will be updated that 
frequently, so if you fetch them regularly, you will have them all after a 
while. But long crawl cycles produce huge segments. We had some performance 
problems to merge and index them quickly.

-----UrsprĂźngliche Nachricht-----

Von: Bartosz Gadzimski [mailto:[email protected]]Gesendet: Donnerstag, 19. Februar 2009 15:38

An: [email protected]
Betreff: Re: AW: AW: How to index while fetcher works

Dear Nadine,

So when you are doing depth 1 or depth 2 crawls can you crawl wholewebsite? I can just imagine that with depth 2 you will crawl wholewebsite only when links from other pages appear. But it will take a lotof time to get it all. Any modern website has a lot of "levels" do godepth in it (guessing 4-5 minimum).


About dmoz - it's only for testing. Good place with lot of links :)

Ad. script - I didn't realize that you are not doing invertlinks - isthis necessary for proper indexing and searching?


Thanks,
Bartosz

HĂśchstĂśtter Nadine pisze:

We also do depth 1 or two crawls, so the crawldb is also up to date.
Be careful with Dmoz, there is a lot of Spam out there.

The loop is also useful for invertinglinks etc. whenever it is important to have single segments and not the whole directory.

-----UrsprÄ�Ĺşngliche Nachricht-----

Von: Bartosz Gadzimski [mailto:[email protected]]Gesendet: Donnerstag, 19. Februar 2009 14:56

An: [email protected]
Betreff: Re: AW: How to index while fetcher works

Thanks Nadine, I am few days ahead thanks to your script :)

Nutch is really nice pice of software, it just takes time to know it better.

Regards,
Bartosz

HÄ�Ĺ�chstÄ�Ĺ�tter Nadine pisze:

Hi. This is my version of an incremental index: I have one working dir for all 
the new segments flying in and a routine every four hours to build a new index 
for a special webindex folder which is nearly up to date.
I merge segments in another folder with YYYYMMDDHH Pattern in my working 
segment dir. With this I can always recognize which segments have already been 
indexed. Move or copy the merged segment under YYYYMMDDHH folder to your fresh 
webindex segment folder and also everything under $merge_dir (new index) to 
your index folder in webindex dir. This dir has same structure as your working 
crawl dir.
It is also good for backup reasons. Call the script below with a cron and add 
cp, mv, rm, or tar commands wherever you like. I zip my crawldb and linkdb with 
this cron, too, as a backup.


index_dir=/nutchcrawl/indexes/$CRAWLNAME/index
TIMEH=`date +%Y%m%d%H`
merge_dir=/nutchcrawl/indexes/$CRAWLNAME/indexmerged$TIMEH
# Update segments

for segment in `ls -d  /nutchcrawl/indexes/$CRAWLNAME/segments/* | grep 
'[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]' `
do

if [ -d $segment/_temporary ]; thenecho "$segment is temporary"

else

echo "$segment"segments="$segments $segment"

fi
done
mergesegs_dir=/nutchcrawl/$CRAWLNAME/segments/$TIMEH
/bin/nutch mergesegs $mergesegs_dir $segments

indexes=/nutchcrawl/indexes/$CRAWLNAME/indexes$TIMEH

NEW=`ls -d  /nutchcrawl/indexes/$CRAWLNAME/segments/$TIMEH/*`
echo "$NEW"
bin/nutch index $indexes $webdb_dir $linkdb_dir $NEW/

for allindex in `ls -d /nutchcrawl/indexes/$CRAWLNAME/indexes*`
do
allindexes="$allindexes $allindex"
done


bin/nutch merge $merge_dir $allindexes

cheers, Nadine.

-----UrsprĂ�ďż˝ÄšĹ�ngliche Nachricht-----

Von: DoÄ�ďż˝Ä�ĹźË�acan GĂ�ďż˝ÄšĹ�ney [mailto:[email protected]]Gesendet: Donnerstag, 19. Februar 2009 12:35

An: [email protected]
Betreff: Re: How to index while fetcher works

Hi,


On Thu, Feb 19, 2009 at 13:28, Bartek <[email protected]> wrote:

Hello,

I started to crawl huge amount of websites (dmoz with no limits in
crawl-urlfilter.txt) with -depth 10 and -topN 1 mln

My /tmp/hadoop-root/ is more than 18GB for now (map-reduce jobs)


This fetching will not stop soon :) so I would like to convert already made
segments (updatedb, invertlinks, index) but there are parts missing in them:

[r...@server nutch]# bin/nutch invertlinks crawls/linkdb -dir
crawls/segments/20090216142840/

If you use -dir option then you pass segments directory not individual
segments, e.g:

bin/nutch invertlinks crawls/linkdb -dir crawls/segments

which will read every directory under segments

To pass individual directories skip -dir option:

bin/nutch invertlinks crawls/linkdb crawls/segments/20090216142840

LinkDb: adding segment:
file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate

...

LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist:
file:/usr/local/nutch/crawls/segments/20090216142840/crawl_generate/parse_data

etc.

When manualy trying to bin/parse segments it says that they are parsed.


So my question is how to design whole proces of crawling large amount of
 websites without limiting them for specific domains (like in regular search
engine eg. google)?

Should I make loops of small amount of links? Like -topN 1000 and then
updatedb,invertlinks, index ?


For now I can start crawling and any data will appear in weeks.

I found that in 1.0 (so made already) you are introducing live indexing in
nutch. Are there any docs that I can use of ?

Regards,
Bartosz Gadzimski

Re: AW: AW: AW: How to index while fetcher works

Reply via email to