[Nutch Wiki] Update of Whole-Web Crawling incremental script by Gabriele Kahlout

2011-03-29 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Whole-Web Crawling incremental script page has been changed by Gabriele 
Kahlout.
The comment on this change is: fixed a bug with depth, simplified, and initial 
support (untested) for resuming crawls.
http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script?action=diffrev1=16rev2=17

--

  
  while [[ $i -lt $depth ]]
  do
+ echo
+ echo generate-fetch-updatedb-invertlinks-index-merge 
iteration $i:
  cmd=bin/nutch generate $it_crawldb crawl/segments -topN 
$it_size
  output=`$cmd`
  if [[ $output == *'0 records selected for fetching'* ]]
@@ -289, +291 @@

depth=$3
  fi
  
+ 
  j=0
  while [[ $indexedPlus1 -le $allUrls ]] #repeat 
generate-fetch-updatedb-invertlinks-index-merge loop until all urls are fetched
  do
@@ -296, +299 @@


tail -n+$indexedPlus1 $seedsDir/urls-local-only | head -n$it_size  
$it_seedsDir/urls-local-only
bin/hadoop dfs -moveFromLocal $it_seedsDir/urls-local-only 
$it_seedsDir/urls
-   
+   
-   it_crawldb=crawl/crawldb/$j/0
+   it_crawldb=crawl/crawldb/$j
+   if [[ -d $it_crawldb ]] # resuming crawl
+   then
+   bin/hadoop dfs -rmr crawl/segments #should be empty if we 
indexed them - atomicity at $j level.
+   else
-   bin/hadoop dfs -mkdir $it_crawldb
+   bin/hadoop dfs -mkdir $it_crawldb
-   
-   echo
+   echo
-   echoThenRun bin/nutch inject $it_crawldb $it_seedsDir
+   echoThenRun bin/nutch inject $it_crawldb $it_seedsDir
-   i=0
-   
+   fi
+ 
+ i=0
while [[ $i -lt $depth ]] # depth-first
-   do
+   do  
-   echo generate-fetch-updatedb-invertlinks-index-merge iteration 
$i:
-   
-   it_crawldb=crawl/crawldb/$j/$i
-   
echo
cmd=bin/nutch generate $it_crawldb crawl/segments -topN 
$it_size
echo $cmd
@@ -327, +330 @@

  
echoThenRun bin/nutch invertlinks crawl/linkdb -dir 
crawl/segments
  
- 
echoThenRun bin/nutch solrindex $solrIndex $it_crawldb 
crawl/linkdb crawl/segments/*
  
-   # you can now search the index with 
http://localhost:8080/solr/admin/ (if setup) or http://code.google.com/p/luke/ 
. The index is stored in $NUTCH_HOME/solr/data/index.
+   # you can now search the index with 
http://localhost:8080/solr/admin/ (if setup) or http://code.google.com/p/luke/ 
. The index is stored in crawl/indexes, while if Solr is used then in 
$NUTCH_HOME/solr/data/index.
+   
+   bin/hadoop dfs -rmr crawl/segments/2*
((i++))
((indexedPlus1+=$it_size)) # maybe should readdb crawl/crawldb 
-stats number of actually fetched, but (! going to fetch a page) -- infinite 
loop
echo
done
  
echoThenRun bin/nutch readdb $it_crawldb -stats
- 
-   allcrawldb=crawl/allcrawldb
-   temp_crawldb=crawl/temp_crawldb
-   merge_dbs=$it_crawldb $allcrawldb
- 
-   # work-around for https://issues.apache.org/jira/browse/NUTCH-972 
(Patch available)
-   if [[ ! -d $allcrawldb ]]
-   then
-   merge_dbs=$it_crawldb
-   fi
- 
-   #echoThenRun bin/nutch mergedb $temp_crawldb $merge_dbs
- 
-   #rm -r $allcrawldb $it_crawldb crawl/segments crawl/linkdb
-   #mv $temp_crawldb $allcrawldb
((j++))
  done
  


[Nutch Wiki] Update of FabioGiavazzi by FabioGiavazzi

2011-03-29 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FabioGiavazzi page has been changed by FabioGiavazzi.
http://wiki.apache.org/nutch/FabioGiavazzi

--

New page:
##language:en
Fabio Giavazzi

Email: -

...


CategoryHomepage


[Nutch Wiki] Update of Incremental Crawling Scripts Test by Gabriele Kahlout

2011-03-29 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Incremental Crawling Scripts Test page has been changed by Gabriele 
Kahlout.
The comment on this change is: updated output.
http://wiki.apache.org/nutch/Incremental%20Crawling%20Scripts%20Test?action=diffrev1=4rev2=5

--

  
  == 3. ==
  {{{
- $ ./whole-web-crawling-incremental urls-input/MR6
+ $ ./whole-web-crawling-incremental -i 15 -d 2 seeds/MR6
- bin/hadoop dfs -rmr crawl
- Deleted file:/Users/simpatico/nutch-1.2/crawl
- 
- curl --fail http://localhost:8080/solr/update?commit=true -d 
'deletequery*:*/query/delete'
- ?xml version=1.0 encoding=UTF-8?
- response
- lst name=responseHeaderint name=status0/intint 
name=QTime8/int/lst
- /response
- 
- rmr: cannot remove urls-input/MR6/it_seeds: No such file or directory.
+ rmr: cannot remove seeds/MR6/it_seeds: No such file or directory.
- bin/hadoop dfs -get urls-input/MR6/2urls urls-input/MR6/urls-local-only
+ bin/hadoop dfs -get seeds/MR6/20simple-urls seeds/MR6/urls-local-only
  
- 2 urls to crawl
+ 20 urls to crawl
- rm: cannot remove urls-input/MR6/it_seeds/urls: No such file or directory.
+ rm: cannot remove seeds/MR6/it_seeds/urls: No such file or directory.
  
- bin/nutch inject crawl/crawldb/0/0 urls-input/MR6/it_seeds
+ bin/nutch inject crawl/crawldb/0 seeds/MR6/it_seeds
- Injector: starting at 2011-03-28 23:37:13
+ Injector: starting at 2011-03-29 11:46:14
- Injector: crawlDb: crawl/crawldb/0/0
+ Injector: crawlDb: crawl/crawldb/0
- Injector: urlDir: urls-input/MR6/it_seeds
+ Injector: urlDir: seeds/MR6/it_seeds
  Injector: Converting injected urls to crawl db entries.
  Injector: Merging injected urls into crawl db.
- Injector: finished at 2011-03-28 23:37:20, elapsed: 00:00:07
+ Injector: finished at 2011-03-29 11:46:27, elapsed: 00:00:13
+ 
  
  generate-fetch-updatedb-invertlinks-index-merge iteration 0:
- 
- bin/nutch generate crawl/crawldb/0/0 crawl/segments -topN 10
+ bin/nutch generate crawl/crawldb/0 crawl/segments -topN 15
- Generator: starting at 2011-03-28 23:37:22 Generator: Selecting best-scoring 
urls due for fetch. Generator: filtering: true Generator: normalizing: true 
Generator: topN: 10 Generator: jobtracker is 'local', generating exactly one 
partition. Generator: Partitioning selected urls for politeness. Generator: 
segment: crawl/segments/20110328233727 Generator: finished at 2011-03-28 
23:37:30, elapsed: 00:00:07
+ Generator: starting at 2011-03-29 11:46:31 Generator: Selecting best-scoring 
urls due for fetch. Generator: filtering: true Generator: normalizing: true 
Generator: topN: 15 Generator: jobtracker is 'local', generating exactly one 
partition. Generator: Partitioning selected urls for politeness. Generator: 
segment: crawl/segments/20110329114641 Generator: finished at 2011-03-29 
11:46:45, elapsed: 00:00:13
  
- bin/nutch fetch crawl/segments/20110328233727
+ bin/nutch fetch crawl/segments/20110329114641
- Fetcher: starting at 2011-03-28 23:37:31
+ Fetcher: starting at 2011-03-29 11:46:49
- Fetcher: segment: crawl/segments/20110328233727
+ Fetcher: segment: crawl/segments/20110329114641
  Fetcher: threads: 10
- QueueFeeder finished: total 2 records + hit by time limit :0
+ QueueFeeder finished: total 15 records + hit by time limit :0
- fetching http://localhost:8080/qui/2.html
+ fetching http://simple.wikipedia.org/wiki/%C2%A3sd
+ -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=14
+ -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=14
- -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
+ -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=14
- * queue: http://localhost
-   maxThreads= 1
-   inProgress= 0
-   crawlDelay= 5000
-   minCrawlDelay = 0
-   nextFetchTime = 1301348260190
-   now   = 1301348255771
-   0. http://localhost:8080/qui/1.html
- -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
+ -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=14
- * queue: http://localhost
-   maxThreads= 1
-   inProgress= 0
-   crawlDelay= 5000
-   minCrawlDelay = 0
-   nextFetchTime = 1301348260190
-   now   = 1301348256777
-   0. http://localhost:8080/qui/1.html
- -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
+ -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=14
- * queue: http://localhost
-   maxThreads= 1
-   inProgress= 0
-   crawlDelay= 5000
-   minCrawlDelay = 0
-   nextFetchTime = 1301348260190
-   now   = 1301348257779
-   0. http://localhost:8080/qui/1.html
- -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
+ -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=14
+ fetching http://simple.wikipedia.org/wiki/%2B44
- * queue: http://localhost
-   maxThreads= 1
-   inProgress= 0
-   crawlDelay= 5000
-   minCrawlDelay = 0
-   nextFetchTime = 1301348260190
-   now   = 1301348258780
-   0. 

Build failed in Jenkins: Nutch-trunk #1441

2011-03-29 Thread Apache Hudson Server
See https://hudson.apache.org/hudson/job/Nutch-trunk/1441/

--
[...truncated 1009 lines...]
A src/plugin/subcollection/src/java/org/apache/nutch/collection
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html
A src/plugin/subcollection/src/java/org/apache/nutch/indexer
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
A src/plugin/subcollection/README.txt
A src/plugin/subcollection/plugin.xml
A src/plugin/subcollection/build.xml
A src/plugin/index-more
A src/plugin/index-more/ivy.xml
A src/plugin/index-more/src
A src/plugin/index-more/src/test
A src/plugin/index-more/src/test/org
A src/plugin/index-more/src/test/org/apache
A src/plugin/index-more/src/test/org/apache/nutch
A src/plugin/index-more/src/test/org/apache/nutch/indexer
A src/plugin/index-more/src/test/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
A src/plugin/index-more/src/java
A src/plugin/index-more/src/java/org
A src/plugin/index-more/src/java/org/apache
A src/plugin/index-more/src/java/org/apache/nutch
A src/plugin/index-more/src/java/org/apache/nutch/indexer
A src/plugin/index-more/src/java/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html
A src/plugin/index-more/plugin.xml
A src/plugin/index-more/build.xml
AUsrc/plugin/plugin.dtd
A src/plugin/parse-ext
A src/plugin/parse-ext/ivy.xml
A src/plugin/parse-ext/src
A src/plugin/parse-ext/src/test
A src/plugin/parse-ext/src/test/org
A src/plugin/parse-ext/src/test/org/apache
A src/plugin/parse-ext/src/test/org/apache/nutch
A src/plugin/parse-ext/src/test/org/apache/nutch/parse
A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java
A src/plugin/parse-ext/src/java
A src/plugin/parse-ext/src/java/org
A src/plugin/parse-ext/src/java/org/apache
A src/plugin/parse-ext/src/java/org/apache/nutch
A src/plugin/parse-ext/src/java/org/apache/nutch/parse
A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java
A src/plugin/parse-ext/plugin.xml
A src/plugin/parse-ext/build.xml
A src/plugin/parse-ext/command
A src/plugin/urlnormalizer-pass
A src/plugin/urlnormalizer-pass/ivy.xml
A src/plugin/urlnormalizer-pass/src
A src/plugin/urlnormalizer-pass/src/test
A src/plugin/urlnormalizer-pass/src/test/org
A src/plugin/urlnormalizer-pass/src/test/org/apache
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java
A src/plugin/urlnormalizer-pass/src/java
A src/plugin/urlnormalizer-pass/src/java/org
A src/plugin/urlnormalizer-pass/src/java/org/apache
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java
AUsrc/plugin/urlnormalizer-pass/plugin.xml
AUsrc/plugin/urlnormalizer-pass/build.xml
A src/plugin/parse-html
A src/plugin/parse-html/ivy.xml
A src/plugin/parse-html/lib
A src/plugin/parse-html/lib/tagsoup.LICENSE.txt
A src/plugin/parse-html/src
A src/plugin/parse-html/src/test
A src/plugin/parse-html/src/test/org
A src/plugin/parse-html/src/test/org/apache
A src/plugin/parse-html/src/test/org/apache/nutch
A src/plugin/parse-html/src/test/org/apache/nutch/parse
A