[jira] Commented: (NUTCH-872) Change the default fetcher.parse to FALSE

2011-03-18 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008397#comment-13008397
 ] 

Markus Jelsma commented on NUTCH-872:
-

To all: Andrzej has committed this to 1.3 as well in r1079746 at 2011-03-09.

 Change the default fetcher.parse to FALSE
 -

 Key: NUTCH-872
 URL: https://issues.apache.org/jira/browse/NUTCH-872
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.2, 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0


 I propose to change this property to false. The reason is that it's a safer 
 default - parsing issues don't lead to a loss of the downloaded content. For 
 larger crawls this is the recommended way to run Fetcher. Users that run 
 smaller crawls can still override it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix

2011-03-18 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008401#comment-13008401
 ] 

Markus Jelsma commented on NUTCH-958:
-

Hi Claudio. Is this desired behaviour? Shouldn't the default be used as 
fallback if the negotiated schema fails instead forcing default as only scheme?

 Httpclient scheme priority order fix
 

 Key: NUTCH-958
 URL: https://issues.apache.org/jira/browse/NUTCH-958
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3
Reporter: Claudio Martella
 Fix For: 1.3

 Attachments: httpclient.diff


 Httpclient will try to authenticate in this order by default: ntlm, digest, 
 basic.
 If you set as default a scheme that comes in this list after a scheme that is 
 negotiated by the server, and this authentication fails, the default scheme 
 will not be tried.
 I.e. if you set digest as default scheme but the server negotiates ntlm, the 
 client will still try ntlm and fail.
 The fix sets the default scheme as the only possible scheme for 
 authentication for the given realm by setting the authentication priorities 
 of httpclient.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-963) Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls)

2011-03-18 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008402#comment-13008402
 ] 

Markus Jelsma commented on NUTCH-963:
-

Julien, shouldn't the deduplicate mechanism kept separate from purging 404's? I 
agree your proposal for finding dupes is better than the current but i believe 
it should be kept separate because:
- people may use a Solr update request processor for finding and deleting dupes 
(it has several hashing algorithm incl. a fuzzy matching)
- controlled environments where there are no dupes don't need a 404 purger that 
wastes cycles on finding dupes

If so, i believe this issue can be committed for 1.3 after further testing.

 Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 
 urls)
 -

 Key: NUTCH-963
 URL: https://issues.apache.org/jira/browse/NUTCH-963
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 2.0
Reporter: Claudio Martella
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.3, 2.0

 Attachments: NUTCH-963-command-and-log4j.patch, Solr404Deleter.java, 
 SolrClean.java


 When issuing recrawls it can happen that certain urls have expired (i.e. URLs 
 that don't exist anymore and return 404).
 This patch creates a new command in the indexer that scans the crawldb 
 looking for these urls and issues delete commands to SOLR.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix

2011-03-18 Thread Claudio Martella (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008408#comment-13008408
 ] 

Claudio Martella commented on NUTCH-958:


that is the problem. right now the system does not allow the default scheme to 
be used as a fallback, which is the reason i wrote this patch. that comes 
because of a bug in httpclient.

So, in order to have some control over the kind of authentication is used, 
which is the expected behavior you also describe, the only way is through this 
workaround.

 Httpclient scheme priority order fix
 

 Key: NUTCH-958
 URL: https://issues.apache.org/jira/browse/NUTCH-958
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3
Reporter: Claudio Martella
 Fix For: 1.3

 Attachments: httpclient.diff


 Httpclient will try to authenticate in this order by default: ntlm, digest, 
 basic.
 If you set as default a scheme that comes in this list after a scheme that is 
 negotiated by the server, and this authentication fails, the default scheme 
 will not be tried.
 I.e. if you set digest as default scheme but the server negotiates ntlm, the 
 client will still try ntlm and fail.
 The fix sets the default scheme as the only possible scheme for 
 authentication for the given realm by setting the authentication priorities 
 of httpclient.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-963) Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls)

2011-03-18 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008421#comment-13008421
 ] 

Markus Jelsma commented on NUTCH-963:
-

Solr deduplication makes its own (fuzzy) hashes on one or more fields. Separate 
algorithms on different fields can be combined. It does not take into account 
the score of a document if you mean the index-time boost on the document. But 
if there is a separate score (or boost) field then a combined signature on 
body, title and boost will work.

All aside, i agree we should go for a single Nutch command for cleaning an 
index, doing dedup and/or 404 cleaning in one swift go.

I'll rereview this patch and do further testing and won't forget CHANGES.txt. 
After that i believe we can create a new related issue for the new 
deduplication.

 Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 
 urls)
 -

 Key: NUTCH-963
 URL: https://issues.apache.org/jira/browse/NUTCH-963
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 2.0
Reporter: Claudio Martella
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.3, 2.0

 Attachments: NUTCH-963-command-and-log4j.patch, Solr404Deleter.java, 
 SolrClean.java


 When issuing recrawls it can happen that certain urls have expired (i.e. URLs 
 that don't exist anymore and return 404).
 This patch creates a new command in the indexer that scans the crawldb 
 looking for these urls and issues delete commands to SOLR.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix

2011-03-18 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008422#comment-13008422
 ] 

Markus Jelsma commented on NUTCH-958:
-

Claudio, i am not sure if this workaround should be committed at all. If the 
devs agree then it should:
- be patched for 2.0 as well
- add a configuration option to enable your workaround so to prevent breaking 
other user's HTTP authentication methods

 Httpclient scheme priority order fix
 

 Key: NUTCH-958
 URL: https://issues.apache.org/jira/browse/NUTCH-958
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3
Reporter: Claudio Martella
 Fix For: 1.3

 Attachments: httpclient.diff


 Httpclient will try to authenticate in this order by default: ntlm, digest, 
 basic.
 If you set as default a scheme that comes in this list after a scheme that is 
 negotiated by the server, and this authentication fails, the default scheme 
 will not be tried.
 I.e. if you set digest as default scheme but the server negotiates ntlm, the 
 client will still try ntlm and fail.
 The fix sets the default scheme as the only possible scheme for 
 authentication for the given realm by setting the authentication priorities 
 of httpclient.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix

2011-03-18 Thread Claudio Martella (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008432#comment-13008432
 ] 

Claudio Martella commented on NUTCH-958:


this workaround was necessary for my work and introduced an expected behavior. 
I understand it's not clean, but the actual behavior of nutch isn't correct 
either. Maybe it can be useful for somebody else and maybe it's enough to keep 
it here so people can find it and apply the patch if they like, so that it 
doesn't have to be commited.

The right way would probably just pass through moving to httpclient4.

 Httpclient scheme priority order fix
 

 Key: NUTCH-958
 URL: https://issues.apache.org/jira/browse/NUTCH-958
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3
Reporter: Claudio Martella
 Fix For: 1.3

 Attachments: httpclient.diff


 Httpclient will try to authenticate in this order by default: ntlm, digest, 
 basic.
 If you set as default a scheme that comes in this list after a scheme that is 
 negotiated by the server, and this authentication fails, the default scheme 
 will not be tried.
 I.e. if you set digest as default scheme but the server negotiates ntlm, the 
 client will still try ntlm and fail.
 The fix sets the default scheme as the only possible scheme for 
 authentication for the given realm by setting the authentication priorities 
 of httpclient.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-967) Upgrade to Tika 0.9

2011-03-18 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008453#comment-13008453
 ] 

Markus Jelsma commented on NUTCH-967:
-

That didn't show up in test nor in a crawl, but i'm not using parse-zip anyway. 
How to procede with a fix?

 Upgrade to Tika 0.9
 ---

 Key: NUTCH-967
 URL: https://issues.apache.org/jira/browse/NUTCH-967
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.3, 2.0
Reporter: Markus Jelsma
Assignee: Julien Nioche
 Fix For: 1.3, 2.0




--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-963) Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls)

2011-03-18 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008469#comment-13008469
 ] 

Markus Jelsma commented on NUTCH-963:
-

Committed for branch-1.3 in rev 1082944.
- new command bin/nutch solrclean crawldb solrurl
- added solrclean to log4j to allow output to stdout


 Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 
 urls)
 -

 Key: NUTCH-963
 URL: https://issues.apache.org/jira/browse/NUTCH-963
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 2.0
Reporter: Claudio Martella
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.3, 2.0

 Attachments: NUTCH-963-command-and-log4j.patch, Solr404Deleter.java, 
 SolrClean.java


 When issuing recrawls it can happen that certain urls have expired (i.e. URLs 
 that don't exist anymore and return 404).
 This patch creates a new command in the indexer that scans the crawldb 
 looking for these urls and issues delete commands to SOLR.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Differences 1.x and trunk

2011-03-18 Thread Markus Jelsma
Hi all,

I'm giving it a try to patch https://issues.apache.org/jira/browse/NUTCH-963 
to trunk after committing to 1.3. There are of course a lot of differences so 
i need a little advice on how to procede:

- instead of using CrawlDB and CrawlDatum we now need WebTableReader?
- trunk uses slf instead of commons logging now?
- a page is now represented by storage.WebPage?

Any more good advice on this one? I need it ;)

Cheers,
-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


[Nutch Wiki] Update of CommandLineOptions by MarkusJelsma

2011-03-18 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The CommandLineOptions page has been changed by MarkusJelsma.
http://wiki.apache.org/nutch/CommandLineOptions?action=diffrev1=11rev2=12

--

  ||[[bin/nutch_segslice]]||Divide data from one segement into several 
segments||
  ||[[bin/nutch_server]]||Run a search server of IPC connections||
  ||[[bin/nutch solrdedup]]||Deletes duplicate documents from solr||
+ ||[[bin/nutch solrclean]]||Deletes 404 documents from solr||
  ||[[bin/nutch_updatedb]]||Updates the web page and link db from the segment 
fetcher output||
  ||  ||   
||
  
@@ -37, +38 @@

  
  bin/nutch org.apache.nutch.util.domain.[[DomainStatistics]]
  
- 


Re: Differences 1.x and trunk

2011-03-18 Thread Andrzej Bialecki

On 3/18/11 4:31 PM, Markus Jelsma wrote:

Hi all,

I'm giving it a try to patch https://issues.apache.org/jira/browse/NUTCH-963
to trunk after committing to 1.3. There are of course a lot of differences so
i need a little advice on how to procede:

- instead of using CrawlDB and CrawlDatum we now need WebTableReader?


Actually you need to use StorageUtils to set up Mapper or Reducer 
contexts. See other tools, e.g. Fetcher or Generator.



- trunk uses slf instead of commons logging now?


Yes.


- a page is now represented by storage.WebPage?


Yes. When you prepare a Job you also need to specify what fields from 
WebPage you are interested in (and only these fields will be pulled in 
from the storage). This is all handled by StorageUtils methods.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Differences 1.x and trunk

2011-03-18 Thread Markus Jelsma
Thanks! I'll try and come up with a working patch in the next few weeks orso.

On Friday 18 March 2011 16:57:20 Andrzej Bialecki wrote:
 On 3/18/11 4:31 PM, Markus Jelsma wrote:
  Hi all,
  
  I'm giving it a try to patch
  https://issues.apache.org/jira/browse/NUTCH-963 to trunk after
  committing to 1.3. There are of course a lot of differences so i need a
  little advice on how to procede:
  
  - instead of using CrawlDB and CrawlDatum we now need WebTableReader?
 
 Actually you need to use StorageUtils to set up Mapper or Reducer
 contexts. See other tools, e.g. Fetcher or Generator.
 
  - trunk uses slf instead of commons logging now?
 
 Yes.
 
  - a page is now represented by storage.WebPage?
 
 Yes. When you prepare a Job you also need to specify what fields from
 WebPage you are interested in (and only these fields will be pulled in
 from the storage). This is all handled by StorageUtils methods.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix

2011-03-18 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008508#comment-13008508
 ] 

Julien Nioche commented on NUTCH-958:
-

I had a look at upgrading to a more recent version of httpclient but it was a 
substantial job as most of the API had changed. We'll definitely do that for 
Nutch 2.0 at some point. 
What about marking this issue as won't fix and move it out of 1.3? As you said 
people will find your patch here if they have the same problem and can easily 
apply it. 

 Httpclient scheme priority order fix
 

 Key: NUTCH-958
 URL: https://issues.apache.org/jira/browse/NUTCH-958
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3
Reporter: Claudio Martella
 Fix For: 1.3

 Attachments: httpclient.diff


 Httpclient will try to authenticate in this order by default: ntlm, digest, 
 basic.
 If you set as default a scheme that comes in this list after a scheme that is 
 negotiated by the server, and this authentication fails, the default scheme 
 will not be tried.
 I.e. if you set digest as default scheme but the server negotiates ntlm, the 
 client will still try ntlm and fail.
 The fix sets the default scheme as the only possible scheme for 
 authentication for the given realm by setting the authentication priorities 
 of httpclient.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix

2011-03-18 Thread Claudio Martella (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008511#comment-13008511
 ] 

Claudio Martella commented on NUTCH-958:


yes, go on.

 Httpclient scheme priority order fix
 

 Key: NUTCH-958
 URL: https://issues.apache.org/jira/browse/NUTCH-958
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3
Reporter: Claudio Martella
 Fix For: 1.3

 Attachments: httpclient.diff


 Httpclient will try to authenticate in this order by default: ntlm, digest, 
 basic.
 If you set as default a scheme that comes in this list after a scheme that is 
 negotiated by the server, and this authentication fails, the default scheme 
 will not be tried.
 I.e. if you set digest as default scheme but the server negotiates ntlm, the 
 client will still try ntlm and fail.
 The fix sets the default scheme as the only possible scheme for 
 authentication for the given realm by setting the authentication priorities 
 of httpclient.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Resolved: (NUTCH-958) Httpclient scheme priority order fix

2011-03-18 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-958.
-

Resolution: Won't Fix

See comments. This patch fixes a bug in the underlying httpclient library which 
will be upgraded later anyway

 Httpclient scheme priority order fix
 

 Key: NUTCH-958
 URL: https://issues.apache.org/jira/browse/NUTCH-958
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3
Reporter: Claudio Martella
 Fix For: 1.3

 Attachments: httpclient.diff


 Httpclient will try to authenticate in this order by default: ntlm, digest, 
 basic.
 If you set as default a scheme that comes in this list after a scheme that is 
 negotiated by the server, and this authentication fails, the default scheme 
 will not be tried.
 I.e. if you set digest as default scheme but the server negotiates ntlm, the 
 client will still try ntlm and fail.
 The fix sets the default scheme as the only possible scheme for 
 authentication for the given realm by setting the authentication priorities 
 of httpclient.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Build failed in Jenkins: Nutch-trunk #1430

2011-03-18 Thread Apache Hudson Server
See https://hudson.apache.org/hudson/job/Nutch-trunk/1430/changes

Changes:

[markus] ASF licene header was missing

--
[...truncated 1008 lines...]
A src/plugin/subcollection/src/java/org/apache/nutch
A src/plugin/subcollection/src/java/org/apache/nutch/collection
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html
A src/plugin/subcollection/src/java/org/apache/nutch/indexer
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
A src/plugin/subcollection/README.txt
A src/plugin/subcollection/plugin.xml
A src/plugin/subcollection/build.xml
A src/plugin/index-more
A src/plugin/index-more/ivy.xml
A src/plugin/index-more/src
A src/plugin/index-more/src/test
A src/plugin/index-more/src/test/org
A src/plugin/index-more/src/test/org/apache
A src/plugin/index-more/src/test/org/apache/nutch
A src/plugin/index-more/src/test/org/apache/nutch/indexer
A src/plugin/index-more/src/test/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
A src/plugin/index-more/src/java
A src/plugin/index-more/src/java/org
A src/plugin/index-more/src/java/org/apache
A src/plugin/index-more/src/java/org/apache/nutch
A src/plugin/index-more/src/java/org/apache/nutch/indexer
A src/plugin/index-more/src/java/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html
A src/plugin/index-more/plugin.xml
A src/plugin/index-more/build.xml
AUsrc/plugin/plugin.dtd
A src/plugin/parse-ext
A src/plugin/parse-ext/ivy.xml
A src/plugin/parse-ext/src
A src/plugin/parse-ext/src/test
A src/plugin/parse-ext/src/test/org
A src/plugin/parse-ext/src/test/org/apache
A src/plugin/parse-ext/src/test/org/apache/nutch
A src/plugin/parse-ext/src/test/org/apache/nutch/parse
A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java
A src/plugin/parse-ext/src/java
A src/plugin/parse-ext/src/java/org
A src/plugin/parse-ext/src/java/org/apache
A src/plugin/parse-ext/src/java/org/apache/nutch
A src/plugin/parse-ext/src/java/org/apache/nutch/parse
A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java
A src/plugin/parse-ext/plugin.xml
A src/plugin/parse-ext/build.xml
A src/plugin/parse-ext/command
A src/plugin/urlnormalizer-pass
A src/plugin/urlnormalizer-pass/ivy.xml
A src/plugin/urlnormalizer-pass/src
A src/plugin/urlnormalizer-pass/src/test
A src/plugin/urlnormalizer-pass/src/test/org
A src/plugin/urlnormalizer-pass/src/test/org/apache
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java
A src/plugin/urlnormalizer-pass/src/java
A src/plugin/urlnormalizer-pass/src/java/org
A src/plugin/urlnormalizer-pass/src/java/org/apache
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java
AUsrc/plugin/urlnormalizer-pass/plugin.xml
AUsrc/plugin/urlnormalizer-pass/build.xml
A src/plugin/parse-html
A src/plugin/parse-html/ivy.xml
A src/plugin/parse-html/lib
A src/plugin/parse-html/lib/tagsoup.LICENSE.txt
A src/plugin/parse-html/src
A src/plugin/parse-html/src/test
A src/plugin/parse-html/src/test/org
A src/plugin/parse-html/src/test/org/apache
A