Re: http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling
On Mon, Mar 28, 2011 at 10:43 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Gabriele you don't need to have 2 *and *3. The hadoop commands will work on the local fs in a completely transparent way, it all depends on the way hadoop is configured. It isolates the way data are stored (local or distrib) from the client code i.e Nutch. By adding a separate script using fs, you'd add more confusion and lead beginners to think that they HAVE to use fs. I apologize for not having yet looked into hadoop in detail but I had understood that it would abstract over the single machine fs. No problems. It would be worth spending a bit of time reading about Hadoop if you want to get a better understanding of Nutch. Tom White's book is an excellent reference but the wikis and tutorials would be a good start However, to get up and running after downloading nutch will the script just work or will I have to configure hadoop? I assume the latter. Nope. By default Hadoop uses the local FS. Nutch relies on the Hadoop API for getting its inputs, so when you run it as you did what actually happens is that you are getting the data from the local FS via Hadoop. I'll look into it and update the script accordingly. From a beginner prospective I like to reduce the magic (at first) and see through the commands, and get up and running asap. Hence 2. I'll be using 3. Hadoop already reduces the magic for you :-) Okay, if so I'll put the equivalent unix commands (mv/rm) in the comment of the hadoop cmds and get rid of 2. As for the legacy-lucene vs SOLR what about having a parameter to determine which one should be used and have a single script? Excellent idea. The default is solr for 1 and 3, but one passes parameter 'll' it will use the legacy lucene. The default for 2 is ll since we want to get up and running fast (before knowing what solr is and set it up). It would be nice to have a third possible value (i.e. none) for the parameter -indexer (besides solr and lucene). A lot of people use Nutch as a crawling platform but do not do any indexing agreed. Will add that too. Why do you want to get the info about ALL the urls? There is a readdb -stats command which gives an summary of the content of the crawldb. If you need to check a particular URL or domain, just use readdb -url and readdb -regex (or whatever the name of the param is) At least when debugging/troubleshooting I found it useful to see which urls were fetched and the responses (robot_blocked, etc..). I can do that examining each $it_crawlddb in turn, since i don't know when that url was fetched (although since the fetching is pretty linear I could also find out, sth. like index in seeds/urls / $it_size. better to do that by looking at the content of the segments using 'nutch readseg -dump' or using 'hadoop fs -libjars nutch.job segment/SEGMENTNUM/crawl_data' for instance. That's probably not something that most people will want to do so maybe comment it out in your script? running hadoop in peudo distributed mode and looking at the hadoop web guis (http://*localhost*:*50030*) gives you a lot of information about your crawl It would definitely be better to have a single crawldb in your script. agreed, maybe again an option and the default is none. But keep every $it_crawldb instead of deleting and merging them. I should be looking into the necessary Hadoop today and start updating the script accordingly. Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
[Nutch Wiki] Update of Whole-Web Crawling incremental script by Gabriele Kahlout
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Whole-Web Crawling incremental script page has been changed by Gabriele Kahlout. http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script?action=diffrev1=15rev2=16 -- === Script Editions: === 1. Abridged using Solr (tersest) - 1. Unabridged with explanations and using nutch index (beginner) + 1. Unabridged with explanations and using nutch index and local fs cmds (beginner) - 1. TODO: Unabridged with explanations, using solr and Hadoop fs (most advanced) + 1. Unabridged with explanations, using solr and Hadoop fs cmds (advanced) Please report any bug you find on the mailing list and to [[Gabriele Kahlout|me]]. - == 1. Abridged script using Solr == + == 1. Abridged using Solr (tersest) == {{{ #!/bin/sh @@ -83, +83 @@ rm -r $it_seedsDir }}} - == 2. Unabridged script with explanations and using nutch index == + == 2. Unabridged with explanations and using nutch index and local fs cmds (beginner) == {{{ @@ -223, +223 @@ bin/nutch readdb $allcrawldb -stats }}} + == 3. Unabridged with explanations, using solr and Hadoop fs cmds (advanced) == + {{{ + #!/bin/sh + + # + # Created by Gabriele Kahlout on 27.03.11. + # The following script crawls the whole-web incrementally; Specifying a list of urls to crawl, nutch will continuously fetch $it_size urls from a specified list of urls, index and merge them with our whole-web index, so that they can be immediately searched, until all urls have been fetched. + # + # TO USE: + # 1. $ mv whole-web-crawling-incremental $NUTCH_HOME/whole-web-crawling-incremental + # 2. $ cd $NUTCH_HOME + # 3. $ chmod +x whole-web-crawling-incremental + # 4. $ ./whole-web-crawling-incremental + + # Usage: ./whole-web-crawling-incremental [it_seedsDir-path urls-to-fetch-per-iteration depth] + # Start + + function echoThenRun () { # echo and then run the command + echo $1 + $1 + echo + } + + echoThenRun bin/hadoop dfs -rmr crawl # fresh crawl + + solrIndex=http://localhost:8080/solr; + echoThenRun curl --fail $solrIndex/update?commit=true -d 'deletequery*:*/query/delete' #empty index + + + if [[ ! -d build ]] + then + echoThenRun ant + fi + + seedsDir=seeds + if [[ $1 != ]] + then + seedsDir=$1 + fi + + it_size=10 + if [[ $2 != ]] + then + it_size=$2 + fi + + indexedPlus1=1 #indexedPlus1 urls+1 because of tail. Never printed out + it_seedsDir=$seedsDir/it_seeds + + bin/hadoop dfs -rmr $it_seedsDir + bin/hadoop dfs -mkdir $it_seedsDir + bin/hadoop dfs -mkdir crawl/crawldb + rm $seedsDir/urls-local-only + + echoThenRun bin/hadoop dfs -get $seedsDir/*url* $seedsDir/urls-local-only + + allUrls=`cat $seedsDir/urls-local-only | wc -l | sed -e s/^ *//` + echo $allUrls urls to crawl + + + depth=1 + if [[ $3 != ]] + then + depth=$3 + fi + + j=0 + while [[ $indexedPlus1 -le $allUrls ]] #repeat generate-fetch-updatedb-invertlinks-index-merge loop until all urls are fetched + do + bin/hadoop dfs -rm $it_seedsDir/urls + + tail -n+$indexedPlus1 $seedsDir/urls-local-only | head -n$it_size $it_seedsDir/urls-local-only + bin/hadoop dfs -moveFromLocal $it_seedsDir/urls-local-only $it_seedsDir/urls + + it_crawldb=crawl/crawldb/$j/0 + bin/hadoop dfs -mkdir $it_crawldb + + echo + echoThenRun bin/nutch inject $it_crawldb $it_seedsDir + i=0 + + while [[ $i -lt $depth ]] # depth-first + do + echo generate-fetch-updatedb-invertlinks-index-merge iteration $i: + + it_crawldb=crawl/crawldb/$j/$i + + echo + cmd=bin/nutch generate $it_crawldb crawl/segments -topN $it_size + echo $cmd + output=`$cmd` + echo $output + echo + if [[ $output == *'0 records selected for fetching'* ]] #all the urls of this iteration have been fetched + then + break; + fi + + echoThenRun bin/nutch fetch crawl/segments/2* + + echoThenRun bin/nutch updatedb $it_crawldb crawl/segments/2* + + echoThenRun bin/nutch invertlinks crawl/linkdb -dir crawl/segments + + + echoThenRun bin/nutch solrindex $solrIndex $it_crawldb crawl/linkdb crawl/segments/* + + # you can now search the index with http://localhost:8080/solr/admin/ (if setup) or http://code.google.com/p/luke/ . The index is stored in $NUTCH_HOME/solr/data/index. + ((i++)) + ((indexedPlus1+=$it_size)) # maybe should readdb crawl/crawldb -stats number of actually fetched, but (! going to fetch a page) -- infinite loop + echo + done + + echoThenRun bin/nutch readdb
[Nutch Wiki] Update of Incremental Crawling Scripts Test by Gabriele Kahlout
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Incremental Crawling Scripts Test page has been changed by Gabriele Kahlout. http://wiki.apache.org/nutch/Incremental%20Crawling%20Scripts%20Test?action=diffrev1=3rev2=4 -- - 1. Abridged script using Solr + == 1. == {{{ ./whole-web-crawling-incremental seeds 10 1 rm: seeds/it_seeds/urls: No such file or directory @@ -628, +628 @@ }}} - 2. Unabridged script with explanations and using nutch index: + == 2. == {{{ $ ./whole-web-crawling-incremental urls-input/MR6 5 2 @@ -797, +797 @@ CrawlDb statistics: done }}} + == 3. == + {{{ + $ ./whole-web-crawling-incremental urls-input/MR6 + bin/hadoop dfs -rmr crawl + Deleted file:/Users/simpatico/nutch-1.2/crawl + + curl --fail http://localhost:8080/solr/update?commit=true -d 'deletequery*:*/query/delete' + ?xml version=1.0 encoding=UTF-8? + response + lst name=responseHeaderint name=status0/intint name=QTime8/int/lst + /response + + rmr: cannot remove urls-input/MR6/it_seeds: No such file or directory. + bin/hadoop dfs -get urls-input/MR6/2urls urls-input/MR6/urls-local-only + + 2 urls to crawl + rm: cannot remove urls-input/MR6/it_seeds/urls: No such file or directory. + + bin/nutch inject crawl/crawldb/0/0 urls-input/MR6/it_seeds + Injector: starting at 2011-03-28 23:37:13 + Injector: crawlDb: crawl/crawldb/0/0 + Injector: urlDir: urls-input/MR6/it_seeds + Injector: Converting injected urls to crawl db entries. + Injector: Merging injected urls into crawl db. + Injector: finished at 2011-03-28 23:37:20, elapsed: 00:00:07 + + generate-fetch-updatedb-invertlinks-index-merge iteration 0: + + bin/nutch generate crawl/crawldb/0/0 crawl/segments -topN 10 + Generator: starting at 2011-03-28 23:37:22 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 10 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20110328233727 Generator: finished at 2011-03-28 23:37:30, elapsed: 00:00:07 + + bin/nutch fetch crawl/segments/20110328233727 + Fetcher: starting at 2011-03-28 23:37:31 + Fetcher: segment: crawl/segments/20110328233727 + Fetcher: threads: 10 + QueueFeeder finished: total 2 records + hit by time limit :0 + fetching http://localhost:8080/qui/2.html + -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 + * queue: http://localhost + maxThreads= 1 + inProgress= 0 + crawlDelay= 5000 + minCrawlDelay = 0 + nextFetchTime = 1301348260190 + now = 1301348255771 + 0. http://localhost:8080/qui/1.html + -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 + * queue: http://localhost + maxThreads= 1 + inProgress= 0 + crawlDelay= 5000 + minCrawlDelay = 0 + nextFetchTime = 1301348260190 + now = 1301348256777 + 0. http://localhost:8080/qui/1.html + -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 + * queue: http://localhost + maxThreads= 1 + inProgress= 0 + crawlDelay= 5000 + minCrawlDelay = 0 + nextFetchTime = 1301348260190 + now = 1301348257779 + 0. http://localhost:8080/qui/1.html + -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 + * queue: http://localhost + maxThreads= 1 + inProgress= 0 + crawlDelay= 5000 + minCrawlDelay = 0 + nextFetchTime = 1301348260190 + now = 1301348258780 + 0. http://localhost:8080/qui/1.html + -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 + * queue: http://localhost + maxThreads= 1 + inProgress= 0 + crawlDelay= 5000 + minCrawlDelay = 0 + nextFetchTime = 1301348260190 + now = 1301348259783 + 0. http://localhost:8080/qui/1.html + fetching http://localhost:8080/qui/1.html + -finishing thread FetcherThread, activeThreads=9 + -finishing thread FetcherThread, activeThreads=8 + -finishing thread FetcherThread, activeThreads=7 + -finishing thread FetcherThread, activeThreads=6 + -finishing thread FetcherThread, activeThreads=5 + -finishing thread FetcherThread, activeThreads=3 + -finishing thread FetcherThread, activeThreads=3 + -finishing thread FetcherThread, activeThreads=2 + -finishing thread FetcherThread, activeThreads=1 + -finishing thread FetcherThread, activeThreads=0 + -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 + -activeThreads=0 + Fetcher: finished at 2011-03-28 23:37:41, elapsed: 00:00:10 + + bin/nutch updatedb crawl/crawldb/0/0 crawl/segments/20110328233727 + CrawlDb update: starting at 2011-03-28 23:37:43 + CrawlDb update: db: crawl/crawldb/0/0 + CrawlDb update: segments: [crawl/segments/20110328233727] + CrawlDb update: additions allowed: true + CrawlDb update: URL normalizing: false + CrawlDb update: URL filtering: false + CrawlDb
Re: http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling
K, hadoopized the script, though i've tried it only locally. I rethought (lazyness convinced me) not to include the indexer parameter. On Mon, Mar 28, 2011 at 10:50 AM, Gabriele Kahlout gabri...@mysimpatico.com wrote: On Mon, Mar 28, 2011 at 10:43 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Gabriele you don't need to have 2 *and *3. The hadoop commands will work on the local fs in a completely transparent way, it all depends on the way hadoop is configured. It isolates the way data are stored (local or distrib) from the client code i.e Nutch. By adding a separate script using fs, you'd add more confusion and lead beginners to think that they HAVE to use fs. I apologize for not having yet looked into hadoop in detail but I had understood that it would abstract over the single machine fs. No problems. It would be worth spending a bit of time reading about Hadoop if you want to get a better understanding of Nutch. Tom White's book is an excellent reference but the wikis and tutorials would be a good start However, to get up and running after downloading nutch will the script just work or will I have to configure hadoop? I assume the latter. Nope. By default Hadoop uses the local FS. Nutch relies on the Hadoop API for getting its inputs, so when you run it as you did what actually happens is that you are getting the data from the local FS via Hadoop. I'll look into it and update the script accordingly. From a beginner prospective I like to reduce the magic (at first) and see through the commands, and get up and running asap. Hence 2. I'll be using 3. Hadoop already reduces the magic for you :-) Okay, if so I'll put the equivalent unix commands (mv/rm) in the comment of the hadoop cmds and get rid of 2. As for the legacy-lucene vs SOLR what about having a parameter to determine which one should be used and have a single script? Excellent idea. The default is solr for 1 and 3, but one passes parameter 'll' it will use the legacy lucene. The default for 2 is ll since we want to get up and running fast (before knowing what solr is and set it up). It would be nice to have a third possible value (i.e. none) for the parameter -indexer (besides solr and lucene). A lot of people use Nutch as a crawling platform but do not do any indexing agreed. Will add that too. Why do you want to get the info about ALL the urls? There is a readdb -stats command which gives an summary of the content of the crawldb. If you need to check a particular URL or domain, just use readdb -url and readdb -regex (or whatever the name of the param is) At least when debugging/troubleshooting I found it useful to see which urls were fetched and the responses (robot_blocked, etc..). I can do that examining each $it_crawlddb in turn, since i don't know when that url was fetched (although since the fetching is pretty linear I could also find out, sth. like index in seeds/urls / $it_size. better to do that by looking at the content of the segments using 'nutch readseg -dump' or using 'hadoop fs -libjars nutch.job segment/SEGMENTNUM/crawl_data' for instance. That's probably not something that most people will want to do so maybe comment it out in your script? running hadoop in peudo distributed mode and looking at the hadoop web guis (http://*localhost*:*50030*) gives you a lot of information about your crawl It would definitely be better to have a single crawldb in your script. agreed, maybe again an option and the default is none. But keep every $it_crawldb instead of deleting and merging them. I should be looking into the necessary Hadoop today and start updating the script accordingly. Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)). -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧
Build failed in Jenkins: Nutch-trunk #1440
See https://hudson.apache.org/hudson/job/Nutch-trunk/1440/ -- [...truncated 1009 lines...] A src/plugin/subcollection/src/java/org/apache/nutch/collection A src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html A src/plugin/subcollection/src/java/org/apache/nutch/indexer A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java A src/plugin/subcollection/README.txt A src/plugin/subcollection/plugin.xml A src/plugin/subcollection/build.xml A src/plugin/index-more A src/plugin/index-more/ivy.xml A src/plugin/index-more/src A src/plugin/index-more/src/test A src/plugin/index-more/src/test/org A src/plugin/index-more/src/test/org/apache A src/plugin/index-more/src/test/org/apache/nutch A src/plugin/index-more/src/test/org/apache/nutch/indexer A src/plugin/index-more/src/test/org/apache/nutch/indexer/more A src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java A src/plugin/index-more/src/java A src/plugin/index-more/src/java/org A src/plugin/index-more/src/java/org/apache A src/plugin/index-more/src/java/org/apache/nutch A src/plugin/index-more/src/java/org/apache/nutch/indexer A src/plugin/index-more/src/java/org/apache/nutch/indexer/more A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html A src/plugin/index-more/plugin.xml A src/plugin/index-more/build.xml AUsrc/plugin/plugin.dtd A src/plugin/parse-ext A src/plugin/parse-ext/ivy.xml A src/plugin/parse-ext/src A src/plugin/parse-ext/src/test A src/plugin/parse-ext/src/test/org A src/plugin/parse-ext/src/test/org/apache A src/plugin/parse-ext/src/test/org/apache/nutch A src/plugin/parse-ext/src/test/org/apache/nutch/parse A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java A src/plugin/parse-ext/src/java A src/plugin/parse-ext/src/java/org A src/plugin/parse-ext/src/java/org/apache A src/plugin/parse-ext/src/java/org/apache/nutch A src/plugin/parse-ext/src/java/org/apache/nutch/parse A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java A src/plugin/parse-ext/plugin.xml A src/plugin/parse-ext/build.xml A src/plugin/parse-ext/command A src/plugin/urlnormalizer-pass A src/plugin/urlnormalizer-pass/ivy.xml A src/plugin/urlnormalizer-pass/src A src/plugin/urlnormalizer-pass/src/test A src/plugin/urlnormalizer-pass/src/test/org A src/plugin/urlnormalizer-pass/src/test/org/apache A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java A src/plugin/urlnormalizer-pass/src/java A src/plugin/urlnormalizer-pass/src/java/org A src/plugin/urlnormalizer-pass/src/java/org/apache A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java AUsrc/plugin/urlnormalizer-pass/plugin.xml AUsrc/plugin/urlnormalizer-pass/build.xml A src/plugin/parse-html A src/plugin/parse-html/ivy.xml A src/plugin/parse-html/lib A src/plugin/parse-html/lib/tagsoup.LICENSE.txt A src/plugin/parse-html/src A src/plugin/parse-html/src/test A src/plugin/parse-html/src/test/org A src/plugin/parse-html/src/test/org/apache A src/plugin/parse-html/src/test/org/apache/nutch A src/plugin/parse-html/src/test/org/apache/nutch/parse A