Hi List I try to use this script with hadoop but don't work. I try to change ls with bin/hadoop dfs -ls But the script don't work because is ls -d and don't ls only. Someone can help me Best Regards Roberto Navoni
-----Messaggio originale----- Da: Matthew Holt [mailto:[EMAIL PROTECTED] Inviato: venerdì 21 luglio 2006 18.58 A: [email protected] Oggetto: Re: Recrawl script for 0.8.0 completed... Lourival Júnior wrote: > I thing it wont work with me because i'm using the Nutch version 0.7.2. > Actually I use this script (some comments are in Portuguese): > > #!/bin/bash > > # A simple script to run a Nutch re-crawl > # Fonte do script: > http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html > > #{ > > if [ -n "$1" ] > then > crawl_dir=$1 > else > echo "Usage: recrawl crawl_dir [depth] [adddays]" > exit 1 > fi > > if [ -n "$2" ] > then > depth=$2 > else > depth=5 > fi > > if [ -n "$3" ] > then > adddays=$3 > else > adddays=0 > fi > > webdb_dir=$crawl_dir/db > segments_dir=$crawl_dir/segments > index_dir=$crawl_dir/index > > #Para o serviço do TomCat > #net stop "Apache Tomcat" > > # The generate/fetch/update cycle > for ((i=1; i <= depth ; i++)) > do > bin/nutch generate $webdb_dir $segments_dir -adddays $adddays > segment=`ls -d $segments_dir/* | tail -1` > bin/nutch fetch $segment > bin/nutch updatedb $webdb_dir $segment > echo > echo "Fim do ciclo $i." > echo > done > > # Update segments > echo > echo "Atualizando os Segmentos..." > echo > mkdir tmp > bin/nutch updatesegs $webdb_dir $segments_dir tmp > rm -R tmp > > # Index segments > echo "Indexando os segmentos..." > echo > for segment in `ls -d $segments_dir/* | tail -$depth` > do > bin/nutch index $segment > done > > # De-duplicate indexes > # "bogus" argument is ignored but needed due to > # a bug in the number of args expected > bin/nutch dedup $segments_dir bogus > > # Merge indexes > #echo "Unindo os segmentos..." > #echo > ls -d $segments_dir/* | xargs bin/nutch merge $index_dir > > chmod 777 -R $index_dir > > #Inicia o serviço do TomCat > #net start "Apache Tomcat" > > echo "Fim." > > #} > recrawl.log 2>&1 > > How you suggested I used the touch command instead stops the tomcat. > However > I get that error posted in previous message. I'm running nutch in windows > plataform with cygwin. I only get no errors when I stops the tomcat. I > use > this command to call the script: > > ./recrawl crawl-legislacao 1 > > Could you give me more clarifications? > > Thanks a lot! > > On 7/21/06, Matthew Holt <[EMAIL PROTECTED]> wrote: >> >> Lourival Júnior wrote: >> > Hi Renaud! >> > >> > I'm newbie with shell scripts and I know stops tomcat service is >> not the >> > better way to do this. The problem is, when a run the re-crawl script >> > with >> > tomcat started I get this error: >> > >> > 060721 132224 merging segment indexes to: crawl-legislacao2\index >> > Exception in thread "main" java.io.IOException: Cannot delete _0.f0 >> > at >> > org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195) >> > at >> org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176) >> > at >> > org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java >> > :141) >> > at >> > org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:225) >> > at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java >> :92) >> > at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java >> :160) >> > >> > So, I want another way to re-crawl my pages without this error and >> > without >> > restarting the tomcat. Could you suggest one? >> > >> > Thanks a lot! >> > >> > >> Try this updated script and tell me what command exactly you run to call >> the script. Let me know the error message then. >> >> Matt >> >> >> #!/bin/bash >> >> # Nutch recrawl script. >> # Based on 0.7.2 script at >> http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html >> >> # Modified by Matthew Holt >> >> if [ -n "$1" ] >> then >> nutch_dir=$1 >> else >> echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]" >> echo "servlet_path - Path of the nutch servlet (i.e. >> /usr/local/tomcat/webapps/ROOT)" >> echo "crawl_dir - Name of the directory the crawl is located in." >> echo "[depth] - The link depth from the root page that should be >> crawled." >> echo "[adddays] - Advance the clock # of days for fetchlist >> generation." >> exit 1 >> fi >> >> if [ -n "$2" ] >> then >> crawl_dir=$2 >> else >> echo "Usage: recrawl servlet_path crawl_dir [depth] [adddays]" >> echo "servlet_path - Path of the nutch servlet (i.e. >> /usr/local/tomcat/webapps/ROOT)" >> echo "crawl_dir - Name of the directory the crawl is located in." >> echo "[depth] - The link depth from the root page that should be >> crawled." >> echo "[adddays] - Advance the clock # of days for fetchlist >> generation." >> exit 1 >> fi >> >> if [ -n "$3" ] >> then >> depth=$3 >> else >> depth=5 >> fi >> >> if [ -n "$4" ] >> then >> adddays=$4 >> else >> adddays=0 >> fi >> >> # Only change if your crawl subdirectories are named something different >> webdb_dir=$crawl_dir/crawldb >> segments_dir=$crawl_dir/segments >> linkdb_dir=$crawl_dir/linkdb >> index_dir=$crawl_dir/index >> >> # The generate/fetch/update cycle >> for ((i=1; i <= depth ; i++)) >> do >> bin/nutch generate $webdb_dir $segments_dir -adddays $adddays >> segment=`ls -d $segments_dir/* | tail -1` >> bin/nutch fetch $segment >> bin/nutch updatedb $webdb_dir $segment >> done >> >> # Update segments >> bin/nutch invertlinks $linkdb_dir -dir $segments_dir >> >> # Index segments >> new_indexes=$crawl_dir/newindexes >> #ls -d $segments_dir/* | tail -$depth | xargs >> bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/* >> >> # De-duplicate indexes >> bin/nutch dedup $new_indexes >> >> # Merge indexes >> bin/nutch merge $index_dir $new_indexes >> >> # Tell Tomcat to reload index >> touch $nutch_dir/WEB-INF/web.xml >> >> # Clean up >> rm -rf $new_indexes >> >> > > Oh yea, you're right the one i sent out was for 0.8.... you should just be able to put this at the end of your script.. # Tell Tomcat to reload index touch $nutch_dir/WEB-INF/web.xml and fill in the appropriate path of course. gluck matt -- No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.394 / Virus Database: 268.10.3/394 - Release Date: 20/07/2006 ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
