[Nutch Wiki] Update of Whole-Web Crawling incremental script by Gabriele Kahlout
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Whole-Web Crawling incremental script page has been changed by Gabriele Kahlout. The comment on this change is: fixed a bug with depth, simplified, and initial support (untested) for resuming crawls. http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script?action=diffrev1=16rev2=17 -- while [[ $i -lt $depth ]] do + echo + echo generate-fetch-updatedb-invertlinks-index-merge iteration $i: cmd=bin/nutch generate $it_crawldb crawl/segments -topN $it_size output=`$cmd` if [[ $output == *'0 records selected for fetching'* ]] @@ -289, +291 @@ depth=$3 fi + j=0 while [[ $indexedPlus1 -le $allUrls ]] #repeat generate-fetch-updatedb-invertlinks-index-merge loop until all urls are fetched do @@ -296, +299 @@ tail -n+$indexedPlus1 $seedsDir/urls-local-only | head -n$it_size $it_seedsDir/urls-local-only bin/hadoop dfs -moveFromLocal $it_seedsDir/urls-local-only $it_seedsDir/urls - + - it_crawldb=crawl/crawldb/$j/0 + it_crawldb=crawl/crawldb/$j + if [[ -d $it_crawldb ]] # resuming crawl + then + bin/hadoop dfs -rmr crawl/segments #should be empty if we indexed them - atomicity at $j level. + else - bin/hadoop dfs -mkdir $it_crawldb + bin/hadoop dfs -mkdir $it_crawldb - - echo + echo - echoThenRun bin/nutch inject $it_crawldb $it_seedsDir + echoThenRun bin/nutch inject $it_crawldb $it_seedsDir - i=0 - + fi + + i=0 while [[ $i -lt $depth ]] # depth-first - do + do - echo generate-fetch-updatedb-invertlinks-index-merge iteration $i: - - it_crawldb=crawl/crawldb/$j/$i - echo cmd=bin/nutch generate $it_crawldb crawl/segments -topN $it_size echo $cmd @@ -327, +330 @@ echoThenRun bin/nutch invertlinks crawl/linkdb -dir crawl/segments - echoThenRun bin/nutch solrindex $solrIndex $it_crawldb crawl/linkdb crawl/segments/* - # you can now search the index with http://localhost:8080/solr/admin/ (if setup) or http://code.google.com/p/luke/ . The index is stored in $NUTCH_HOME/solr/data/index. + # you can now search the index with http://localhost:8080/solr/admin/ (if setup) or http://code.google.com/p/luke/ . The index is stored in crawl/indexes, while if Solr is used then in $NUTCH_HOME/solr/data/index. + + bin/hadoop dfs -rmr crawl/segments/2* ((i++)) ((indexedPlus1+=$it_size)) # maybe should readdb crawl/crawldb -stats number of actually fetched, but (! going to fetch a page) -- infinite loop echo done echoThenRun bin/nutch readdb $it_crawldb -stats - - allcrawldb=crawl/allcrawldb - temp_crawldb=crawl/temp_crawldb - merge_dbs=$it_crawldb $allcrawldb - - # work-around for https://issues.apache.org/jira/browse/NUTCH-972 (Patch available) - if [[ ! -d $allcrawldb ]] - then - merge_dbs=$it_crawldb - fi - - #echoThenRun bin/nutch mergedb $temp_crawldb $merge_dbs - - #rm -r $allcrawldb $it_crawldb crawl/segments crawl/linkdb - #mv $temp_crawldb $allcrawldb ((j++)) done
[Nutch Wiki] Update of FabioGiavazzi by FabioGiavazzi
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FabioGiavazzi page has been changed by FabioGiavazzi. http://wiki.apache.org/nutch/FabioGiavazzi -- New page: ##language:en Fabio Giavazzi Email: - ... CategoryHomepage
[Nutch Wiki] Update of Incremental Crawling Scripts Test by Gabriele Kahlout
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Incremental Crawling Scripts Test page has been changed by Gabriele Kahlout. The comment on this change is: updated output. http://wiki.apache.org/nutch/Incremental%20Crawling%20Scripts%20Test?action=diffrev1=4rev2=5 -- == 3. == {{{ - $ ./whole-web-crawling-incremental urls-input/MR6 + $ ./whole-web-crawling-incremental -i 15 -d 2 seeds/MR6 - bin/hadoop dfs -rmr crawl - Deleted file:/Users/simpatico/nutch-1.2/crawl - - curl --fail http://localhost:8080/solr/update?commit=true -d 'deletequery*:*/query/delete' - ?xml version=1.0 encoding=UTF-8? - response - lst name=responseHeaderint name=status0/intint name=QTime8/int/lst - /response - - rmr: cannot remove urls-input/MR6/it_seeds: No such file or directory. + rmr: cannot remove seeds/MR6/it_seeds: No such file or directory. - bin/hadoop dfs -get urls-input/MR6/2urls urls-input/MR6/urls-local-only + bin/hadoop dfs -get seeds/MR6/20simple-urls seeds/MR6/urls-local-only - 2 urls to crawl + 20 urls to crawl - rm: cannot remove urls-input/MR6/it_seeds/urls: No such file or directory. + rm: cannot remove seeds/MR6/it_seeds/urls: No such file or directory. - bin/nutch inject crawl/crawldb/0/0 urls-input/MR6/it_seeds + bin/nutch inject crawl/crawldb/0 seeds/MR6/it_seeds - Injector: starting at 2011-03-28 23:37:13 + Injector: starting at 2011-03-29 11:46:14 - Injector: crawlDb: crawl/crawldb/0/0 + Injector: crawlDb: crawl/crawldb/0 - Injector: urlDir: urls-input/MR6/it_seeds + Injector: urlDir: seeds/MR6/it_seeds Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. - Injector: finished at 2011-03-28 23:37:20, elapsed: 00:00:07 + Injector: finished at 2011-03-29 11:46:27, elapsed: 00:00:13 + generate-fetch-updatedb-invertlinks-index-merge iteration 0: - - bin/nutch generate crawl/crawldb/0/0 crawl/segments -topN 10 + bin/nutch generate crawl/crawldb/0 crawl/segments -topN 15 - Generator: starting at 2011-03-28 23:37:22 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 10 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20110328233727 Generator: finished at 2011-03-28 23:37:30, elapsed: 00:00:07 + Generator: starting at 2011-03-29 11:46:31 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 15 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20110329114641 Generator: finished at 2011-03-29 11:46:45, elapsed: 00:00:13 - bin/nutch fetch crawl/segments/20110328233727 + bin/nutch fetch crawl/segments/20110329114641 - Fetcher: starting at 2011-03-28 23:37:31 + Fetcher: starting at 2011-03-29 11:46:49 - Fetcher: segment: crawl/segments/20110328233727 + Fetcher: segment: crawl/segments/20110329114641 Fetcher: threads: 10 - QueueFeeder finished: total 2 records + hit by time limit :0 + QueueFeeder finished: total 15 records + hit by time limit :0 - fetching http://localhost:8080/qui/2.html + fetching http://simple.wikipedia.org/wiki/%C2%A3sd + -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=14 + -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=14 - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 + -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=14 - * queue: http://localhost - maxThreads= 1 - inProgress= 0 - crawlDelay= 5000 - minCrawlDelay = 0 - nextFetchTime = 1301348260190 - now = 1301348255771 - 0. http://localhost:8080/qui/1.html - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 + -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=14 - * queue: http://localhost - maxThreads= 1 - inProgress= 0 - crawlDelay= 5000 - minCrawlDelay = 0 - nextFetchTime = 1301348260190 - now = 1301348256777 - 0. http://localhost:8080/qui/1.html - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 + -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=14 - * queue: http://localhost - maxThreads= 1 - inProgress= 0 - crawlDelay= 5000 - minCrawlDelay = 0 - nextFetchTime = 1301348260190 - now = 1301348257779 - 0. http://localhost:8080/qui/1.html - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 + -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=14 + fetching http://simple.wikipedia.org/wiki/%2B44 - * queue: http://localhost - maxThreads= 1 - inProgress= 0 - crawlDelay= 5000 - minCrawlDelay = 0 - nextFetchTime = 1301348260190 - now = 1301348258780 - 0.
Build failed in Jenkins: Nutch-trunk #1441
See https://hudson.apache.org/hudson/job/Nutch-trunk/1441/ -- [...truncated 1009 lines...] A src/plugin/subcollection/src/java/org/apache/nutch/collection A src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html A src/plugin/subcollection/src/java/org/apache/nutch/indexer A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java A src/plugin/subcollection/README.txt A src/plugin/subcollection/plugin.xml A src/plugin/subcollection/build.xml A src/plugin/index-more A src/plugin/index-more/ivy.xml A src/plugin/index-more/src A src/plugin/index-more/src/test A src/plugin/index-more/src/test/org A src/plugin/index-more/src/test/org/apache A src/plugin/index-more/src/test/org/apache/nutch A src/plugin/index-more/src/test/org/apache/nutch/indexer A src/plugin/index-more/src/test/org/apache/nutch/indexer/more A src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java A src/plugin/index-more/src/java A src/plugin/index-more/src/java/org A src/plugin/index-more/src/java/org/apache A src/plugin/index-more/src/java/org/apache/nutch A src/plugin/index-more/src/java/org/apache/nutch/indexer A src/plugin/index-more/src/java/org/apache/nutch/indexer/more A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html A src/plugin/index-more/plugin.xml A src/plugin/index-more/build.xml AUsrc/plugin/plugin.dtd A src/plugin/parse-ext A src/plugin/parse-ext/ivy.xml A src/plugin/parse-ext/src A src/plugin/parse-ext/src/test A src/plugin/parse-ext/src/test/org A src/plugin/parse-ext/src/test/org/apache A src/plugin/parse-ext/src/test/org/apache/nutch A src/plugin/parse-ext/src/test/org/apache/nutch/parse A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java A src/plugin/parse-ext/src/java A src/plugin/parse-ext/src/java/org A src/plugin/parse-ext/src/java/org/apache A src/plugin/parse-ext/src/java/org/apache/nutch A src/plugin/parse-ext/src/java/org/apache/nutch/parse A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java A src/plugin/parse-ext/plugin.xml A src/plugin/parse-ext/build.xml A src/plugin/parse-ext/command A src/plugin/urlnormalizer-pass A src/plugin/urlnormalizer-pass/ivy.xml A src/plugin/urlnormalizer-pass/src A src/plugin/urlnormalizer-pass/src/test A src/plugin/urlnormalizer-pass/src/test/org A src/plugin/urlnormalizer-pass/src/test/org/apache A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java A src/plugin/urlnormalizer-pass/src/java A src/plugin/urlnormalizer-pass/src/java/org A src/plugin/urlnormalizer-pass/src/java/org/apache A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java AUsrc/plugin/urlnormalizer-pass/plugin.xml AUsrc/plugin/urlnormalizer-pass/build.xml A src/plugin/parse-html A src/plugin/parse-html/ivy.xml A src/plugin/parse-html/lib A src/plugin/parse-html/lib/tagsoup.LICENSE.txt A src/plugin/parse-html/src A src/plugin/parse-html/src/test A src/plugin/parse-html/src/test/org A src/plugin/parse-html/src/test/org/apache A src/plugin/parse-html/src/test/org/apache/nutch A src/plugin/parse-html/src/test/org/apache/nutch/parse A