Re: Recrawl script for 0.8.0 completed...

Lourival Júnior Fri, 21 Jul 2006 10:09:08 -0700

Ok. However a few minutes ago I ran the script exactly you said and I still
get this error:


Exception in thread "main" java.io.IOException: Cannot delete _0.f0
       at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
       at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
       at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
:141)
       at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
       at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:92)
       at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:160)

I dont know but I thing it occurs because nutch tries to delete some file
that tomcat loads to the memory, giving permission access error. Any idea?

On 7/21/06, Matthew Holt <[EMAIL PROTECTED]> wrote:


Lourival Júnior wrote:
> I thing it wont work with me because i'm using the Nutch version 0.7.2.
> Actually I use this script (some comments are in Portuguese):
>
> #!/bin/bash
>
> # A simple script to run a Nutch re-crawl
> # Fonte do script:
>
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
>
> #{
>
> if [ -n "$1" ]
> then
>  crawl_dir=$1
> else
>  echo "Usage: recrawl crawl_dir [depth] [adddays]"
>  exit 1
> fi
>
> if [ -n "$2" ]
> then
>  depth=$2
> else
>  depth=5
> fi
>
> if [ -n "$3" ]
> then
>  adddays=$3
> else
>  adddays=0
> fi
>
> webdb_dir=$crawl_dir/db
> segments_dir=$crawl_dir/segments
> index_dir=$crawl_dir/index
>
> #Para o serviço do TomCat
> #net stop "Apache Tomcat"
>
> # The generate/fetch/update cycle
> for ((i=1; i <= depth ; i++))
> do
>  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>  segment=`ls -d $segments_dir/* | tail -1`
>  bin/nutch fetch $segment
>  bin/nutch updatedb $webdb_dir $segment
>  echo
>  echo "Fim do ciclo $i."
>  echo
> done
>
> # Update segments
> echo
> echo "Atualizando os Segmentos..."
> echo
> mkdir tmp
> bin/nutch updatesegs $webdb_dir $segments_dir tmp
> rm -R tmp
>
> # Index segments
> echo "Indexando os segmentos..."
> echo
> for segment in `ls -d $segments_dir/* | tail -$depth`
> do
>  bin/nutch index $segment
> done
>
> # De-duplicate indexes
> # "bogus" argument is ignored but needed due to
> # a bug in the number of args expected
> bin/nutch dedup $segments_dir bogus
>
> # Merge indexes
> #echo "Unindo os segmentos..."
> #echo
> ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
>
> chmod 777 -R $index_dir
>
> #Inicia o serviço do TomCat
> #net start "Apache Tomcat"
>
> echo "Fim."
>
> #} > recrawl.log 2>&1
>
> How you suggested I used the touch command instead stops the tomcat.
> However
> I get that error posted in previous message. I'm running nutch in
windows
> plataform with cygwin. I only get no errors when I stops the tomcat. I
> use
> this command to call the script:
>
> ./recrawl crawl-legislacao 1
>
> Could you give me more clarifications?
>
> Thanks a lot!
>
> On 7/21/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
>>
>> Lourival Júnior wrote:
>> > Hi Renaud!
>> >
>> > I'm newbie with shell scripts and I know stops tomcat service is
>> not the
>> > better way to do this. The problem is, when a run the re-crawl script
>> > with
>> > tomcat started I get this error:
>> >
>> > 060721 132224 merging segment indexes to: crawl-legislacao2\index
>> > Exception in thread "main" java.io.IOException: Cannot delete _0.f0
>> >        at
>> > org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
>> >        at
>> org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
>> >        at
>> > org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
>> > :141)
>> >        at
>> > org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
>> >        at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java
>> :92)
>> >        at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java
>> :160)
>> >
>> > So, I want another way to re-crawl my pages without this error and
>> > without
>> > restarting the tomcat. Could you suggest one?
>> >
>> > Thanks a lot!
>> >
>> >
>> Try this updated script and tell me what command exactly you run to
call
>> the script. Let me know the error message then.
>>
>> Matt
>>
>>
>> #!/bin/bash
>>
>> # Nutch recrawl script.
>> # Based on 0.7.2 script at
>>
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
>>
>> # Modified by Matthew Holt
>>
>> if [ -n "$1" ]
>> then
>>   nutch_dir=$1
>> else
>>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>>   echo "servlet_path - Path of the nutch servlet (i.e.
>> /usr/local/tomcat/webapps/ROOT)"
>>   echo "crawl_dir - Name of the directory the crawl is located in."
>>   echo "[depth] - The link depth from the root page that should be
>> crawled."
>>   echo "[adddays] - Advance the clock # of days for fetchlist
>> generation."
>>   exit 1
>> fi
>>
>> if [ -n "$2" ]
>> then
>>   crawl_dir=$2
>> else
>>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
>>   echo "servlet_path - Path of the nutch servlet (i.e.
>> /usr/local/tomcat/webapps/ROOT)"
>>   echo "crawl_dir - Name of the directory the crawl is located in."
>>   echo "[depth] - The link depth from the root page that should be
>> crawled."
>>   echo "[adddays] - Advance the clock # of days for fetchlist
>> generation."
>>   exit 1
>> fi
>>
>> if [ -n "$3" ]
>> then
>>   depth=$3
>> else
>>   depth=5
>> fi
>>
>> if [ -n "$4" ]
>> then
>>   adddays=$4
>> else
>>   adddays=0
>> fi
>>
>> # Only change if your crawl subdirectories are named something
different
>> webdb_dir=$crawl_dir/crawldb
>> segments_dir=$crawl_dir/segments
>> linkdb_dir=$crawl_dir/linkdb
>> index_dir=$crawl_dir/index
>>
>> # The generate/fetch/update cycle
>> for ((i=1; i <= depth ; i++))
>> do
>>   bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>>   segment=`ls -d $segments_dir/* | tail -1`
>>   bin/nutch fetch $segment
>>   bin/nutch updatedb $webdb_dir $segment
>> done
>>
>> # Update segments
>> bin/nutch invertlinks $linkdb_dir -dir $segments_dir
>>
>> # Index segments
>> new_indexes=$crawl_dir/newindexes
>> #ls -d $segments_dir/* | tail -$depth | xargs
>> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*
>>
>> # De-duplicate indexes
>> bin/nutch dedup $new_indexes
>>
>> # Merge indexes
>> bin/nutch merge $index_dir $new_indexes
>>
>> # Tell Tomcat to reload index
>> touch $nutch_dir/WEB-INF/web.xml
>>
>> # Clean up
>> rm -rf $new_indexes
>>
>>
>
>
Oh yea, you're right the one i sent out was for 0.8.... you should just
be able to put this at the end of your script..

# Tell Tomcat to reload index
touch $nutch_dir/WEB-INF/web.xml

and fill in the appropriate path of course.
gluck
matt




--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: Recrawl script for 0.8.0 completed...

Reply via email to