Re: Recrawl script for 0.8.0 completed...

Lourival Júnior Tue, 25 Jul 2006 13:13:11 -0700

You wanna say that only in windows this error occurs? I haven't tested in
linux yet. Has anyone a solution for this problem in windows/tomcat?


On 7/25/06, Thomas Delnoij <[EMAIL PROTECTED]> wrote:


Lourival.

I have typically seen the same issues on a cygwin/windows setup. The
only thing that worked for me was shutting down and restarting tomcat,
instead of just reloading the context. On linux now I don't have these
issues anymore.

Rgrds, Thomas

On 7/21/06, Lourival Júnior <[EMAIL PROTECTED]> wrote:
> Ok. However a few minutes ago I ran the script exactly you said and I
still
> get this error:
>
> Exception in thread "main" java.io.IOException: Cannot delete _0.f0
>         at org.apache.lucene.store.FSDirectory.create(FSDirectory.java
:195)
>         at org.apache.lucene.store.FSDirectory.init(FSDirectory.java
:176)
>         at org.apache.lucene.store.FSDirectory.getDirectory(
FSDirectory.java
> :141)
>         at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java
:225)
>         at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java
:92)
>         at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java
:160)
>
> I dont know but I thing it occurs because nutch tries to delete some
file
> that tomcat loads to the memory, giving permission access error. Any
idea?
>
> On 7/21/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
> >
> > Lourival Júnior wrote:
> > > I thing it wont work with me because i'm using the Nutch version
0.7.2.
> > > Actually I use this script (some comments are in Portuguese):
> > >
> > > #!/bin/bash
> > >
> > > # A simple script to run a Nutch re-crawl
> > > # Fonte do script:
> > >
> >
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
> > >
> > > #{
> > >
> > > if [ -n "$1" ]
> > > then
> > >  crawl_dir=$1
> > > else
> > >  echo "Usage: recrawl crawl_dir [depth] [adddays]"
> > >  exit 1
> > > fi
> > >
> > > if [ -n "$2" ]
> > > then
> > >  depth=$2
> > > else
> > >  depth=5
> > > fi
> > >
> > > if [ -n "$3" ]
> > > then
> > >  adddays=$3
> > > else
> > >  adddays=0
> > > fi
> > >
> > > webdb_dir=$crawl_dir/db
> > > segments_dir=$crawl_dir/segments
> > > index_dir=$crawl_dir/index
> > >
> > > #Para o serviço do TomCat
> > > #net stop "Apache Tomcat"
> > >
> > > # The generate/fetch/update cycle
> > > for ((i=1; i <= depth ; i++))
> > > do
> > >  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
> > >  segment=`ls -d $segments_dir/* | tail -1`
> > >  bin/nutch fetch $segment
> > >  bin/nutch updatedb $webdb_dir $segment
> > >  echo
> > >  echo "Fim do ciclo $i."
> > >  echo
> > > done
> > >
> > > # Update segments
> > > echo
> > > echo "Atualizando os Segmentos..."
> > > echo
> > > mkdir tmp
> > > bin/nutch updatesegs $webdb_dir $segments_dir tmp
> > > rm -R tmp
> > >
> > > # Index segments
> > > echo "Indexando os segmentos..."
> > > echo
> > > for segment in `ls -d $segments_dir/* | tail -$depth`
> > > do
> > >  bin/nutch index $segment
> > > done
> > >
> > > # De-duplicate indexes
> > > # "bogus" argument is ignored but needed due to
> > > # a bug in the number of args expected
> > > bin/nutch dedup $segments_dir bogus
> > >
> > > # Merge indexes
> > > #echo "Unindo os segmentos..."
> > > #echo
> > > ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
> > >
> > > chmod 777 -R $index_dir
> > >
> > > #Inicia o serviço do TomCat
> > > #net start "Apache Tomcat"
> > >
> > > echo "Fim."
> > >
> > > #} > recrawl.log 2>&1
> > >
> > > How you suggested I used the touch command instead stops the tomcat.
> > > However
> > > I get that error posted in previous message. I'm running nutch in
> > windows
> > > plataform with cygwin. I only get no errors when I stops the tomcat.
I
> > > use
> > > this command to call the script:
> > >
> > > ./recrawl crawl-legislacao 1
> > >
> > > Could you give me more clarifications?
> > >
> > > Thanks a lot!
> > >
> > > On 7/21/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
> > >>
> > >> Lourival Júnior wrote:
> > >> > Hi Renaud!
> > >> >
> > >> > I'm newbie with shell scripts and I know stops tomcat service is
> > >> not the
> > >> > better way to do this. The problem is, when a run the re-crawl
script
> > >> > with
> > >> > tomcat started I get this error:
> > >> >
> > >> > 060721 132224 merging segment indexes to: crawl-legislacao2\index
> > >> > Exception in thread "main" java.io.IOException: Cannot delete
_0.f0
> > >> >        at
> > >> > org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
> > >> >        at
> > >> org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
> > >> >        at
> > >> > org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
> > >> > :141)
> > >> >        at
> > >> > org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225)
> > >> >        at org.apache.nutch.indexer.IndexMerger.merge(
IndexMerger.java
> > >> :92)
> > >> >        at org.apache.nutch.indexer.IndexMerger.main(
IndexMerger.java
> > >> :160)
> > >> >
> > >> > So, I want another way to re-crawl my pages without this error
and
> > >> > without
> > >> > restarting the tomcat. Could you suggest one?
> > >> >
> > >> > Thanks a lot!
> > >> >
> > >> >
> > >> Try this updated script and tell me what command exactly you run to
> > call
> > >> the script. Let me know the error message then.
> > >>
> > >> Matt
> > >>
> > >>
> > >> #!/bin/bash
> > >>
> > >> # Nutch recrawl script.
> > >> # Based on 0.7.2 script at
> > >>
> >
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
> > >>
> > >> # Modified by Matthew Holt
> > >>
> > >> if [ -n "$1" ]
> > >> then
> > >>   nutch_dir=$1
> > >> else
> > >>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
> > >>   echo "servlet_path - Path of the nutch servlet (i.e.
> > >> /usr/local/tomcat/webapps/ROOT)"
> > >>   echo "crawl_dir - Name of the directory the crawl is located in."
> > >>   echo "[depth] - The link depth from the root page that should be
> > >> crawled."
> > >>   echo "[adddays] - Advance the clock # of days for fetchlist
> > >> generation."
> > >>   exit 1
> > >> fi
> > >>
> > >> if [ -n "$2" ]
> > >> then
> > >>   crawl_dir=$2
> > >> else
> > >>   echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]"
> > >>   echo "servlet_path - Path of the nutch servlet (i.e.
> > >> /usr/local/tomcat/webapps/ROOT)"
> > >>   echo "crawl_dir - Name of the directory the crawl is located in."
> > >>   echo "[depth] - The link depth from the root page that should be
> > >> crawled."
> > >>   echo "[adddays] - Advance the clock # of days for fetchlist
> > >> generation."
> > >>   exit 1
> > >> fi
> > >>
> > >> if [ -n "$3" ]
> > >> then
> > >>   depth=$3
> > >> else
> > >>   depth=5
> > >> fi
> > >>
> > >> if [ -n "$4" ]
> > >> then
> > >>   adddays=$4
> > >> else
> > >>   adddays=0
> > >> fi
> > >>
> > >> # Only change if your crawl subdirectories are named something
> > different
> > >> webdb_dir=$crawl_dir/crawldb
> > >> segments_dir=$crawl_dir/segments
> > >> linkdb_dir=$crawl_dir/linkdb
> > >> index_dir=$crawl_dir/index
> > >>
> > >> # The generate/fetch/update cycle
> > >> for ((i=1; i <= depth ; i++))
> > >> do
> > >>   bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
> > >>   segment=`ls -d $segments_dir/* | tail -1`
> > >>   bin/nutch fetch $segment
> > >>   bin/nutch updatedb $webdb_dir $segment
> > >> done
> > >>
> > >> # Update segments
> > >> bin/nutch invertlinks $linkdb_dir -dir $segments_dir
> > >>
> > >> # Index segments
> > >> new_indexes=$crawl_dir/newindexes
> > >> #ls -d $segments_dir/* | tail -$depth | xargs
> > >> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*
> > >>
> > >> # De-duplicate indexes
> > >> bin/nutch dedup $new_indexes
> > >>
> > >> # Merge indexes
> > >> bin/nutch merge $index_dir $new_indexes
> > >>
> > >> # Tell Tomcat to reload index
> > >> touch $nutch_dir/WEB-INF/web.xml
> > >>
> > >> # Clean up
> > >> rm -rf $new_indexes
> > >>
> > >>
> > >
> > >
> > Oh yea, you're right the one i sent out was for 0.8.... you should
just
> > be able to put this at the end of your script..
> >
> > # Tell Tomcat to reload index
> > touch $nutch_dir/WEB-INF/web.xml
> >
> > and fill in the appropriate path of course.
> > gluck
> > matt
> >
>
>
>
> --
> Lourival Junior
> Universidade Federal do Pará
> Curso de Bacharelado em Sistemas de Informação
> http://www.ufpa.br/cbsi
> Msn: [EMAIL PROTECTED]
>
>




--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: Recrawl script for 0.8.0 completed...

Reply via email to