Re: http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling

2011-03-28 Thread Gabriele Kahlout
On Mon, Mar 28, 2011 at 10:43 AM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 Hi Gabriele


 you don't need to have 2 *and *3. The hadoop commands will work on the
 local fs in a completely transparent way, it all depends on the way hadoop
 is configured. It isolates the way data are stored (local or distrib) from
 the client code i.e Nutch. By adding a separate script using fs, you'd add
 more confusion and lead beginners to think that they HAVE to use fs.


 I apologize for not having yet looked into hadoop in detail but I had
 understood that it would abstract over the single machine fs.


 No problems. It would be worth spending a bit of time reading about Hadoop
 if you want to get a better understanding of Nutch. Tom White's book is an
 excellent reference but the wikis and tutorials would be a good start



 However, to get up and running after downloading nutch will the script
 just work or will I have to configure hadoop? I assume the latter.


 Nope. By default Hadoop uses the local FS. Nutch relies on the Hadoop API
 for getting its inputs, so when you run it as you did what actually happens
 is that you are getting the data from the local FS via Hadoop.


I'll look into it and update the script accordingly.



 From a beginner prospective I like to reduce the magic (at first) and see
 through the commands, and get up and running asap.
 Hence 2. I'll be using 3.


 Hadoop already reduces the magic for you :-)


Okay, if so I'll put the equivalent unix commands (mv/rm) in the comment of
the hadoop cmds and get rid of 2.





 As for the legacy-lucene vs SOLR what about having a parameter to
 determine which one should be used and have a single script?


 Excellent idea. The default is solr for 1 and 3, but one passes parameter
 'll' it will use the legacy lucene. The default for 2 is ll since we want to
 get up and running fast (before knowing what solr is and set it up).


 It would be nice to have a third possible value (i.e. none) for the
 parameter -indexer (besides solr and lucene). A lot of people use Nutch as a
 crawling platform but do not do any indexing

 agreed. Will add that too.



 Why do you want to get the info about ALL the urls? There is a readdb
 -stats command which gives an summary of the content of the crawldb. If you
 need to check a particular URL or domain, just use readdb -url and readdb
 -regex (or whatever the name of the param is)



 At least when debugging/troubleshooting I found it useful to see which
 urls were fetched and the responses (robot_blocked, etc..).
 I can do that examining each $it_crawlddb in turn, since i don't know when
 that url was fetched (although since the fetching is pretty linear I could
 also find out, sth. like index in seeds/urls / $it_size.


 better to do that by looking at the content of the segments using 'nutch
 readseg -dump' or using 'hadoop fs -libjars nutch.job
 segment/SEGMENTNUM/crawl_data' for instance. That's probably not something
 that most people will want to do so maybe comment it out in your script?


 running hadoop in peudo distributed mode and looking at the hadoop web guis
 (http://*localhost*:*50030*) gives you a lot of information about your
 crawl

 It would definitely be better to have a single crawldb in your script.


agreed, maybe again an option and the default is none. But keep every
$it_crawldb instead of deleting and merging them.
I should be looking into the necessary Hadoop today and start updating the
script accordingly.

Julien

 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com




-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains [LON] or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
 Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with X.
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


[Nutch Wiki] Update of Whole-Web Crawling incremental script by Gabriele Kahlout

2011-03-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Whole-Web Crawling incremental script page has been changed by Gabriele 
Kahlout.
http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script?action=diffrev1=15rev2=16

--

  
  === Script Editions: ===
   1. Abridged using Solr (tersest)
-  1. Unabridged with explanations and using nutch index (beginner)
+  1. Unabridged with explanations and using nutch index and local fs cmds 
(beginner)
-  1. TODO: Unabridged with explanations, using solr and Hadoop fs (most 
advanced)
+  1. Unabridged with explanations, using solr and Hadoop fs cmds (advanced)
  
  Please report any bug you find on the mailing list and to [[Gabriele 
Kahlout|me]].
  
- == 1. Abridged script using Solr ==
+ == 1. Abridged using Solr (tersest) ==
  {{{
  #!/bin/sh
  
@@ -83, +83 @@

  rm -r $it_seedsDir
  
  }}}
- == 2. Unabridged script with explanations and using nutch index ==
+ == 2. Unabridged with explanations and using nutch index and local fs cmds 
(beginner) ==
  
  {{{
  
@@ -223, +223 @@

  bin/nutch readdb $allcrawldb -stats
  }}}
  
+ == 3. Unabridged with explanations, using solr and Hadoop fs cmds (advanced) 
==
+ {{{
+ #!/bin/sh
+ 
+ #
+ # Created by Gabriele Kahlout on 27.03.11.
+ # The following script crawls the whole-web incrementally; Specifying a list 
of urls to crawl, nutch will continuously fetch $it_size urls from a specified 
list of urls, index and merge them with our whole-web index,  so that they can 
be immediately searched, until all urls have been fetched.
+ #
+ # TO USE:
+ # 1. $ mv whole-web-crawling-incremental 
$NUTCH_HOME/whole-web-crawling-incremental
+ # 2. $ cd $NUTCH_HOME
+ # 3. $ chmod +x whole-web-crawling-incremental
+ # 4. $ ./whole-web-crawling-incremental
+ 
+ # Usage: ./whole-web-crawling-incremental [it_seedsDir-path 
urls-to-fetch-per-iteration depth]
+ # Start
+ 
+ function echoThenRun () { # echo and then run the command
+   echo $1
+   $1
+   echo
+ }
+ 
+ echoThenRun bin/hadoop dfs -rmr crawl # fresh crawl
+ 
+ solrIndex=http://localhost:8080/solr;
+ echoThenRun curl --fail $solrIndex/update?commit=true -d  
'deletequery*:*/query/delete' #empty index
+ 
+ 
+ if [[ ! -d build ]]
+ then
+   echoThenRun ant
+ fi
+ 
+ seedsDir=seeds
+ if [[ $1 !=  ]]
+ then
+   seedsDir=$1
+ fi
+ 
+ it_size=10
+ if [[ $2 !=  ]]
+ then
+   it_size=$2
+ fi
+ 
+ indexedPlus1=1 #indexedPlus1 urls+1 because of tail. Never printed out
+ it_seedsDir=$seedsDir/it_seeds
+ 
+ bin/hadoop dfs -rmr $it_seedsDir
+ bin/hadoop dfs -mkdir $it_seedsDir
+ bin/hadoop dfs -mkdir crawl/crawldb
+ rm $seedsDir/urls-local-only
+ 
+ echoThenRun bin/hadoop dfs -get $seedsDir/*url* $seedsDir/urls-local-only
+ 
+ allUrls=`cat $seedsDir/urls-local-only | wc -l | sed -e s/^ *//`
+ echo $allUrls urls to crawl
+ 
+ 
+ depth=1
+ if [[ $3 !=  ]]
+ then
+   depth=$3
+ fi
+ 
+ j=0
+ while [[ $indexedPlus1 -le $allUrls ]] #repeat 
generate-fetch-updatedb-invertlinks-index-merge loop until all urls are fetched
+ do
+   bin/hadoop dfs -rm $it_seedsDir/urls
+   
+   tail -n+$indexedPlus1 $seedsDir/urls-local-only | head -n$it_size  
$it_seedsDir/urls-local-only
+   bin/hadoop dfs -moveFromLocal $it_seedsDir/urls-local-only 
$it_seedsDir/urls
+   
+   it_crawldb=crawl/crawldb/$j/0
+   bin/hadoop dfs -mkdir $it_crawldb
+   
+   echo
+   echoThenRun bin/nutch inject $it_crawldb $it_seedsDir
+   i=0
+   
+   while [[ $i -lt $depth ]] # depth-first
+   do
+   echo generate-fetch-updatedb-invertlinks-index-merge iteration 
$i:
+   
+   it_crawldb=crawl/crawldb/$j/$i
+   
+   echo
+   cmd=bin/nutch generate $it_crawldb crawl/segments -topN 
$it_size
+   echo $cmd
+   output=`$cmd`
+   echo $output
+   echo
+   if [[ $output == *'0 records selected for fetching'* ]] #all 
the urls of this iteration have been fetched
+   then
+   break;
+   fi
+   
+   echoThenRun bin/nutch fetch crawl/segments/2*
+ 
+   echoThenRun bin/nutch updatedb $it_crawldb crawl/segments/2*
+ 
+   echoThenRun bin/nutch invertlinks crawl/linkdb -dir 
crawl/segments
+ 
+ 
+   echoThenRun bin/nutch solrindex $solrIndex $it_crawldb 
crawl/linkdb crawl/segments/*
+ 
+   # you can now search the index with 
http://localhost:8080/solr/admin/ (if setup) or http://code.google.com/p/luke/ 
. The index is stored in $NUTCH_HOME/solr/data/index.
+   ((i++))
+   ((indexedPlus1+=$it_size)) # maybe should readdb crawl/crawldb 
-stats number of actually fetched, but (! going to fetch a page) -- infinite 
loop
+   echo
+   done
+ 
+   echoThenRun bin/nutch readdb 

[Nutch Wiki] Update of Incremental Crawling Scripts Test by Gabriele Kahlout

2011-03-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Incremental Crawling Scripts Test page has been changed by Gabriele 
Kahlout.
http://wiki.apache.org/nutch/Incremental%20Crawling%20Scripts%20Test?action=diffrev1=3rev2=4

--

- 1. Abridged script using Solr
+ == 1. ==
  {{{
  ./whole-web-crawling-incremental seeds 10 1
  rm: seeds/it_seeds/urls: No such file or directory
@@ -628, +628 @@

  
  }}}
  
- 2. Unabridged script with explanations and using nutch index:
+ == 2. ==
  
  {{{
  $ ./whole-web-crawling-incremental urls-input/MR6 5 2
@@ -797, +797 @@

  CrawlDb statistics: done
  }}}
  
+ == 3. ==
+ {{{
+ $ ./whole-web-crawling-incremental urls-input/MR6
+ bin/hadoop dfs -rmr crawl
+ Deleted file:/Users/simpatico/nutch-1.2/crawl
+ 
+ curl --fail http://localhost:8080/solr/update?commit=true -d 
'deletequery*:*/query/delete'
+ ?xml version=1.0 encoding=UTF-8?
+ response
+ lst name=responseHeaderint name=status0/intint 
name=QTime8/int/lst
+ /response
+ 
+ rmr: cannot remove urls-input/MR6/it_seeds: No such file or directory.
+ bin/hadoop dfs -get urls-input/MR6/2urls urls-input/MR6/urls-local-only
+ 
+ 2 urls to crawl
+ rm: cannot remove urls-input/MR6/it_seeds/urls: No such file or directory.
+ 
+ bin/nutch inject crawl/crawldb/0/0 urls-input/MR6/it_seeds
+ Injector: starting at 2011-03-28 23:37:13
+ Injector: crawlDb: crawl/crawldb/0/0
+ Injector: urlDir: urls-input/MR6/it_seeds
+ Injector: Converting injected urls to crawl db entries.
+ Injector: Merging injected urls into crawl db.
+ Injector: finished at 2011-03-28 23:37:20, elapsed: 00:00:07
+ 
+ generate-fetch-updatedb-invertlinks-index-merge iteration 0:
+ 
+ bin/nutch generate crawl/crawldb/0/0 crawl/segments -topN 10
+ Generator: starting at 2011-03-28 23:37:22 Generator: Selecting best-scoring 
urls due for fetch. Generator: filtering: true Generator: normalizing: true 
Generator: topN: 10 Generator: jobtracker is 'local', generating exactly one 
partition. Generator: Partitioning selected urls for politeness. Generator: 
segment: crawl/segments/20110328233727 Generator: finished at 2011-03-28 
23:37:30, elapsed: 00:00:07
+ 
+ bin/nutch fetch crawl/segments/20110328233727
+ Fetcher: starting at 2011-03-28 23:37:31
+ Fetcher: segment: crawl/segments/20110328233727
+ Fetcher: threads: 10
+ QueueFeeder finished: total 2 records + hit by time limit :0
+ fetching http://localhost:8080/qui/2.html
+ -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
+ * queue: http://localhost
+   maxThreads= 1
+   inProgress= 0
+   crawlDelay= 5000
+   minCrawlDelay = 0
+   nextFetchTime = 1301348260190
+   now   = 1301348255771
+   0. http://localhost:8080/qui/1.html
+ -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
+ * queue: http://localhost
+   maxThreads= 1
+   inProgress= 0
+   crawlDelay= 5000
+   minCrawlDelay = 0
+   nextFetchTime = 1301348260190
+   now   = 1301348256777
+   0. http://localhost:8080/qui/1.html
+ -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
+ * queue: http://localhost
+   maxThreads= 1
+   inProgress= 0
+   crawlDelay= 5000
+   minCrawlDelay = 0
+   nextFetchTime = 1301348260190
+   now   = 1301348257779
+   0. http://localhost:8080/qui/1.html
+ -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
+ * queue: http://localhost
+   maxThreads= 1
+   inProgress= 0
+   crawlDelay= 5000
+   minCrawlDelay = 0
+   nextFetchTime = 1301348260190
+   now   = 1301348258780
+   0. http://localhost:8080/qui/1.html
+ -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
+ * queue: http://localhost
+   maxThreads= 1
+   inProgress= 0
+   crawlDelay= 5000
+   minCrawlDelay = 0
+   nextFetchTime = 1301348260190
+   now   = 1301348259783
+   0. http://localhost:8080/qui/1.html
+ fetching http://localhost:8080/qui/1.html
+ -finishing thread FetcherThread, activeThreads=9
+ -finishing thread FetcherThread, activeThreads=8
+ -finishing thread FetcherThread, activeThreads=7
+ -finishing thread FetcherThread, activeThreads=6
+ -finishing thread FetcherThread, activeThreads=5
+ -finishing thread FetcherThread, activeThreads=3
+ -finishing thread FetcherThread, activeThreads=3
+ -finishing thread FetcherThread, activeThreads=2
+ -finishing thread FetcherThread, activeThreads=1
+ -finishing thread FetcherThread, activeThreads=0
+ -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
+ -activeThreads=0
+ Fetcher: finished at 2011-03-28 23:37:41, elapsed: 00:00:10
+ 
+ bin/nutch updatedb crawl/crawldb/0/0 crawl/segments/20110328233727
+ CrawlDb update: starting at 2011-03-28 23:37:43
+ CrawlDb update: db: crawl/crawldb/0/0
+ CrawlDb update: segments: [crawl/segments/20110328233727]
+ CrawlDb update: additions allowed: true
+ CrawlDb update: URL normalizing: false
+ CrawlDb update: URL filtering: false
+ CrawlDb 

Re: http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling

2011-03-28 Thread Gabriele Kahlout
K, hadoopized the script, though i've tried it only locally.
I rethought (lazyness convinced me) not to include the indexer parameter.

On Mon, Mar 28, 2011 at 10:50 AM, Gabriele Kahlout gabri...@mysimpatico.com
 wrote:



 On Mon, Mar 28, 2011 at 10:43 AM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

 Hi Gabriele


 you don't need to have 2 *and *3. The hadoop commands will work on the
 local fs in a completely transparent way, it all depends on the way hadoop
 is configured. It isolates the way data are stored (local or distrib) from
 the client code i.e Nutch. By adding a separate script using fs, you'd add
 more confusion and lead beginners to think that they HAVE to use fs.


 I apologize for not having yet looked into hadoop in detail but I had
 understood that it would abstract over the single machine fs.


 No problems. It would be worth spending a bit of time reading about Hadoop
 if you want to get a better understanding of Nutch. Tom White's book is an
 excellent reference but the wikis and tutorials would be a good start



 However, to get up and running after downloading nutch will the script
 just work or will I have to configure hadoop? I assume the latter.


 Nope. By default Hadoop uses the local FS. Nutch relies on the Hadoop API
 for getting its inputs, so when you run it as you did what actually happens
 is that you are getting the data from the local FS via Hadoop.


 I'll look into it and update the script accordingly.



 From a beginner prospective I like to reduce the magic (at first) and see
 through the commands, and get up and running asap.
 Hence 2. I'll be using 3.


 Hadoop already reduces the magic for you :-)


 Okay, if so I'll put the equivalent unix commands (mv/rm) in the comment of
 the hadoop cmds and get rid of 2.





 As for the legacy-lucene vs SOLR what about having a parameter to
 determine which one should be used and have a single script?


 Excellent idea. The default is solr for 1 and 3, but one passes parameter
 'll' it will use the legacy lucene. The default for 2 is ll since we want to
 get up and running fast (before knowing what solr is and set it up).


 It would be nice to have a third possible value (i.e. none) for the
 parameter -indexer (besides solr and lucene). A lot of people use Nutch as a
 crawling platform but do not do any indexing

 agreed. Will add that too.



 Why do you want to get the info about ALL the urls? There is a readdb
 -stats command which gives an summary of the content of the crawldb. If you
 need to check a particular URL or domain, just use readdb -url and readdb
 -regex (or whatever the name of the param is)



 At least when debugging/troubleshooting I found it useful to see which
 urls were fetched and the responses (robot_blocked, etc..).
 I can do that examining each $it_crawlddb in turn, since i don't know
 when that url was fetched (although since the fetching is pretty linear I
 could also find out, sth. like index in seeds/urls / $it_size.


 better to do that by looking at the content of the segments using 'nutch
 readseg -dump' or using 'hadoop fs -libjars nutch.job
 segment/SEGMENTNUM/crawl_data' for instance. That's probably not something
 that most people will want to do so maybe comment it out in your script?


 running hadoop in peudo distributed mode and looking at the hadoop web
 guis (http://*localhost*:*50030*) gives you a lot of information about
 your crawl

 It would definitely be better to have a single crawldb in your script.


 agreed, maybe again an option and the default is none. But keep every
 $it_crawldb instead of deleting and merging them.
 I should be looking into the necessary Hadoop today and start updating the
 script accordingly.

 Julien

 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com




 --
 Regards,
 K. Gabriele

 --- unchanged since 20/9/10 ---
 P.S. If the subject contains [LON] or the addressee acknowledges the
 receipt within 48 hours then I don't resend the email.
 subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
 time(x)  Now + 48h) ⇒ ¬resend(I, this).

 If an email is sent by a sender that is not a trusted contact or the email
 does not contain a valid code then the email is not received. A valid code
 starts with a hyphen and ends with X.
 ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
 L(-[a-z]+[0-9]X)).




-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains [LON] or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
 Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with X.
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ 

Build failed in Jenkins: Nutch-trunk #1440

2011-03-28 Thread Apache Hudson Server
See https://hudson.apache.org/hudson/job/Nutch-trunk/1440/

--
[...truncated 1009 lines...]
A src/plugin/subcollection/src/java/org/apache/nutch/collection
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html
A src/plugin/subcollection/src/java/org/apache/nutch/indexer
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
A src/plugin/subcollection/README.txt
A src/plugin/subcollection/plugin.xml
A src/plugin/subcollection/build.xml
A src/plugin/index-more
A src/plugin/index-more/ivy.xml
A src/plugin/index-more/src
A src/plugin/index-more/src/test
A src/plugin/index-more/src/test/org
A src/plugin/index-more/src/test/org/apache
A src/plugin/index-more/src/test/org/apache/nutch
A src/plugin/index-more/src/test/org/apache/nutch/indexer
A src/plugin/index-more/src/test/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
A src/plugin/index-more/src/java
A src/plugin/index-more/src/java/org
A src/plugin/index-more/src/java/org/apache
A src/plugin/index-more/src/java/org/apache/nutch
A src/plugin/index-more/src/java/org/apache/nutch/indexer
A src/plugin/index-more/src/java/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html
A src/plugin/index-more/plugin.xml
A src/plugin/index-more/build.xml
AUsrc/plugin/plugin.dtd
A src/plugin/parse-ext
A src/plugin/parse-ext/ivy.xml
A src/plugin/parse-ext/src
A src/plugin/parse-ext/src/test
A src/plugin/parse-ext/src/test/org
A src/plugin/parse-ext/src/test/org/apache
A src/plugin/parse-ext/src/test/org/apache/nutch
A src/plugin/parse-ext/src/test/org/apache/nutch/parse
A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java
A src/plugin/parse-ext/src/java
A src/plugin/parse-ext/src/java/org
A src/plugin/parse-ext/src/java/org/apache
A src/plugin/parse-ext/src/java/org/apache/nutch
A src/plugin/parse-ext/src/java/org/apache/nutch/parse
A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java
A src/plugin/parse-ext/plugin.xml
A src/plugin/parse-ext/build.xml
A src/plugin/parse-ext/command
A src/plugin/urlnormalizer-pass
A src/plugin/urlnormalizer-pass/ivy.xml
A src/plugin/urlnormalizer-pass/src
A src/plugin/urlnormalizer-pass/src/test
A src/plugin/urlnormalizer-pass/src/test/org
A src/plugin/urlnormalizer-pass/src/test/org/apache
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java
A src/plugin/urlnormalizer-pass/src/java
A src/plugin/urlnormalizer-pass/src/java/org
A src/plugin/urlnormalizer-pass/src/java/org/apache
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java
AUsrc/plugin/urlnormalizer-pass/plugin.xml
AUsrc/plugin/urlnormalizer-pass/build.xml
A src/plugin/parse-html
A src/plugin/parse-html/ivy.xml
A src/plugin/parse-html/lib
A src/plugin/parse-html/lib/tagsoup.LICENSE.txt
A src/plugin/parse-html/src
A src/plugin/parse-html/src/test
A src/plugin/parse-html/src/test/org
A src/plugin/parse-html/src/test/org/apache
A src/plugin/parse-html/src/test/org/apache/nutch
A src/plugin/parse-html/src/test/org/apache/nutch/parse
A