Sorry to be asking so many questions.. Below is the current script I'm using. It's indexing the segments.. so do I use invertlinks directly after the fetch? I'm kind of confused.. thanks. matt
------------------------------------------------------- #!/bin/bash # A simple script to run a Nutch re-crawl if [ -n "$1" ] then crawl_dir=$1 else echo "Usage: recrawl crawl_dir [depth] [adddays]" exit 1 fi if [ -n "$2" ] then depth=$2 else depth=5 fi if [ -n "$3" ] then adddays=$3 else adddays=0 fi webdb_dir=$crawl_dir/db segments_dir=$crawl_dir/segments index_dir=$crawl_dir/index # The generate/fetch/update cycle for ((i=1; i <= depth ; i++)) do bin/nutch generate $webdb_dir $segments_dir -adddays $adddays segment=`ls -d $segments_dir/* | tail -1` bin/nutch fetch $segment bin/nutch updatedb $webdb_dir $segment done # Update segments mkdir tmp bin/nutch updatesegs $webdb_dir $segments_dir tmp rm -R tmp # Index segments for segment in `ls -d $segments_dir/* | tail -$depth` do bin/nutch index $segment done # De-duplicate indexes # "bogus" argument is ignored but needed due to # a bug in the number of args expected bin/nutch dedup $segments_dir bogus # Merge indexes ls -d $segments_dir/* | xargs bin/nutch merge $index_dir --------------------------------------------------------------- Stefan Neufeind wrote: >You miss actually indexing the pages :-) This is done inside the >"crawl"-command which does everything in one. After you fetched >everything use: > >nutch invertlinks ... >nutch index ... > >Hope that helps. Otherwise let me know and I'll dig out the complete >commandlines for you. > > >Regards, > Stefan > >Matthew Holt wrote: > > >>Just FYI.. After I do the recrawl, I do stop and start tomcat, and still >>the newly created page can not be found. >> >>Matthew Holt wrote: >> >> >> >>>The recrawl worked this time, and I recrawled the entire db using the >>>-adddays argument (in my case ./recrawl crawl 10 31). However, it >>>didn't find a newly created page. >>> >>>If I delete the database and do the initial crawl over again, the new >>>page is found. Any idea what I'm doing wrong or why it isn't finding it? >>> >>>Thanks! >>>Matt >>> >>>Matthew Holt wrote: >>> >>> >>> >>>>Stefan, >>>> Thanks a bunch! I see what you mean.. >>>>matt >>>> >>>>Stefan Neufeind wrote: >>>> >>>> >>>> >>>>>Matthew Holt wrote: >>>>> >>>>> >>>>> >>>>> >>>>>>Hi all, >>>>>> I have already successfuly indexed all the files on my domain only >>>>>>(as >>>>>>specified in the conf/crawl-urlfilter.txt file). >>>>>> >>>>>>Now when I use the below script (./recrawl crawl 10 31) to recrawl the >>>>>>domain, it begins indexing pages off of my domain (such as wikipedia, >>>>>>etc). How do I prevent this? Thanks! >>>>>> >>>>>> >>>>>> >>>>> >>>>>Hi Matt, >>>>> >>>>>have a look at regex-urlfilter. "crawl" is special in some ways. >>>>>Actually it's "shortcut" for several steps. And it has a special >>>>>urlfilter-file. But if you do it in several steps that >>>>>urlfilter-file is >>>>>no longer used. >>>>> >>>>> > > > _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
