date:20110318

[jira] Commented: (NUTCH-872) Change the default fetcher.parse to FALSE

2011-03-18 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008397#comment-13008397
 ] 

Markus Jelsma commented on NUTCH-872:
-

To all: Andrzej has committed this to 1.3 as well in r1079746 at 2011-03-09.

 Change the default fetcher.parse to FALSE
 -

 Key: NUTCH-872
 URL: https://issues.apache.org/jira/browse/NUTCH-872
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.2, 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0


 I propose to change this property to false. The reason is that it's a safer 
 default - parsing issues don't lead to a loss of the downloaded content. For 
 larger crawls this is the recommended way to run Fetcher. Users that run 
 smaller crawls can still override it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix

2011-03-18 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008401#comment-13008401
 ] 

Markus Jelsma commented on NUTCH-958:
-

Hi Claudio. Is this desired behaviour? Shouldn't the default be used as 
fallback if the negotiated schema fails instead forcing default as only scheme?

 Httpclient scheme priority order fix
 

 Key: NUTCH-958
 URL: https://issues.apache.org/jira/browse/NUTCH-958
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3
Reporter: Claudio Martella
 Fix For: 1.3

 Attachments: httpclient.diff


 Httpclient will try to authenticate in this order by default: ntlm, digest, 
 basic.
 If you set as default a scheme that comes in this list after a scheme that is 
 negotiated by the server, and this authentication fails, the default scheme 
 will not be tried.
 I.e. if you set digest as default scheme but the server negotiates ntlm, the 
 client will still try ntlm and fail.
 The fix sets the default scheme as the only possible scheme for 
 authentication for the given realm by setting the authentication priorities 
 of httpclient.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-963) Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls)

2011-03-18 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008402#comment-13008402
 ] 

Markus Jelsma commented on NUTCH-963:
-

Julien, shouldn't the deduplicate mechanism kept separate from purging 404's? I 
agree your proposal for finding dupes is better than the current but i believe 
it should be kept separate because:
- people may use a Solr update request processor for finding and deleting dupes 
(it has several hashing algorithm incl. a fuzzy matching)
- controlled environments where there are no dupes don't need a 404 purger that 
wastes cycles on finding dupes

If so, i believe this issue can be committed for 1.3 after further testing.

 Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 
 urls)
 -

 Key: NUTCH-963
 URL: https://issues.apache.org/jira/browse/NUTCH-963
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 2.0
Reporter: Claudio Martella
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.3, 2.0

 Attachments: NUTCH-963-command-and-log4j.patch, Solr404Deleter.java, 
 SolrClean.java


 When issuing recrawls it can happen that certain urls have expired (i.e. URLs 
 that don't exist anymore and return 404).
 This patch creates a new command in the indexer that scans the crawldb 
 looking for these urls and issues delete commands to SOLR.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix

2011-03-18 Thread Claudio Martella (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008408#comment-13008408
]

Claudio Martella commented on NUTCH-958:

that is the problem. right now the system does not allow the default scheme to
be used as a fallback, which is the reason i wrote this patch. that comes
because of a bug in httpclient.

So, in order to have some control over the kind of authentication is used,
which is the expected behavior you also describe, the only way is through this
workaround.

Httpclient scheme priority order fix

Key: NUTCH-958
URL: https://issues.apache.org/jira/browse/NUTCH-958
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 1.3
Reporter: Claudio Martella
Fix For: 1.3

Attachments: httpclient.diff

Httpclient will try to authenticate in this order by default: ntlm, digest,
basic.
If you set as default a scheme that comes in this list after a scheme that is
negotiated by the server, and this authentication fails, the default scheme
will not be tried.
I.e. if you set digest as default scheme but the server negotiates ntlm, the
client will still try ntlm and fail.
The fix sets the default scheme as the only possible scheme for
authentication for the given realm by setting the authentication priorities
of httpclient.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-963) Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls)

2011-03-18 Thread Markus Jelsma (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008421#comment-13008421
]

Markus Jelsma commented on NUTCH-963:
-

Solr deduplication makes its own (fuzzy) hashes on one or more fields. Separate
algorithms on different fields can be combined. It does not take into account
the score of a document if you mean the index-time boost on the document. But
if there is a separate score (or boost) field then a combined signature on
body, title and boost will work.

All aside, i agree we should go for a single Nutch command for cleaning an
index, doing dedup and/or 404 cleaning in one swift go.

I'll rereview this patch and do further testing and won't forget CHANGES.txt.
After that i believe we can create a new related issue for the new
deduplication.

Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404
urls)
-

Key: NUTCH-963
URL: https://issues.apache.org/jira/browse/NUTCH-963
Project: Nutch
Issue Type: New Feature
Components: indexer
Affects Versions: 2.0
Reporter: Claudio Martella
Assignee: Markus Jelsma
Priority: Minor
Fix For: 1.3, 2.0

Attachments: NUTCH-963-command-and-log4j.patch, Solr404Deleter.java,
SolrClean.java

When issuing recrawls it can happen that certain urls have expired (i.e. URLs
that don't exist anymore and return 404).
This patch creates a new command in the indexer that scans the crawldb
looking for these urls and issues delete commands to SOLR.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix

2011-03-18 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008422#comment-13008422
 ] 

Markus Jelsma commented on NUTCH-958:
-

Claudio, i am not sure if this workaround should be committed at all. If the 
devs agree then it should:
- be patched for 2.0 as well
- add a configuration option to enable your workaround so to prevent breaking 
other user's HTTP authentication methods

 Httpclient scheme priority order fix
 

 Key: NUTCH-958
 URL: https://issues.apache.org/jira/browse/NUTCH-958
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3
Reporter: Claudio Martella
 Fix For: 1.3

 Attachments: httpclient.diff


 Httpclient will try to authenticate in this order by default: ntlm, digest, 
 basic.
 If you set as default a scheme that comes in this list after a scheme that is 
 negotiated by the server, and this authentication fails, the default scheme 
 will not be tried.
 I.e. if you set digest as default scheme but the server negotiates ntlm, the 
 client will still try ntlm and fail.
 The fix sets the default scheme as the only possible scheme for 
 authentication for the given realm by setting the authentication priorities 
 of httpclient.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix

2011-03-18 Thread Claudio Martella (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008432#comment-13008432
]

Claudio Martella commented on NUTCH-958:

this workaround was necessary for my work and introduced an expected behavior.
I understand it's not clean, but the actual behavior of nutch isn't correct
either. Maybe it can be useful for somebody else and maybe it's enough to keep
it here so people can find it and apply the patch if they like, so that it
doesn't have to be commited.

The right way would probably just pass through moving to httpclient4.

Httpclient scheme priority order fix

Key: NUTCH-958
URL: https://issues.apache.org/jira/browse/NUTCH-958
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 1.3
Reporter: Claudio Martella
Fix For: 1.3

Attachments: httpclient.diff

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-967) Upgrade to Tika 0.9

2011-03-18 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008453#comment-13008453
 ] 

Markus Jelsma commented on NUTCH-967:
-

That didn't show up in test nor in a crawl, but i'm not using parse-zip anyway. 
How to procede with a fix?

 Upgrade to Tika 0.9
 ---

 Key: NUTCH-967
 URL: https://issues.apache.org/jira/browse/NUTCH-967
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.3, 2.0
Reporter: Markus Jelsma
Assignee: Julien Nioche
 Fix For: 1.3, 2.0




--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-963) Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls)

2011-03-18 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008469#comment-13008469
 ] 

Markus Jelsma commented on NUTCH-963:
-

Committed for branch-1.3 in rev 1082944.
- new command bin/nutch solrclean crawldb solrurl
- added solrclean to log4j to allow output to stdout


 Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 
 urls)
 -

 Key: NUTCH-963
 URL: https://issues.apache.org/jira/browse/NUTCH-963
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 2.0
Reporter: Claudio Martella
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.3, 2.0

 Attachments: NUTCH-963-command-and-log4j.patch, Solr404Deleter.java, 
 SolrClean.java


 When issuing recrawls it can happen that certain urls have expired (i.e. URLs 
 that don't exist anymore and return 404).
 This patch creates a new command in the indexer that scans the crawldb 
 looking for these urls and issues delete commands to SOLR.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Differences 1.x and trunk

2011-03-18 Thread Markus Jelsma

Hi all,

I'm giving it a try to patch https://issues.apache.org/jira/browse/NUTCH-963 
to trunk after committing to 1.3. There are of course a lot of differences so 
i need a little advice on how to procede:

- instead of using CrawlDB and CrawlDatum we now need WebTableReader?
- trunk uses slf instead of commons logging now?
- a page is now represented by storage.WebPage?

Any more good advice on this one? I need it ;)

Cheers,
-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

[Nutch Wiki] Update of CommandLineOptions by MarkusJelsma

2011-03-18 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The CommandLineOptions page has been changed by MarkusJelsma.
http://wiki.apache.org/nutch/CommandLineOptions?action=diffrev1=11rev2=12

--

  ||[[bin/nutch_segslice]]||Divide data from one segement into several 
segments||
  ||[[bin/nutch_server]]||Run a search server of IPC connections||
  ||[[bin/nutch solrdedup]]||Deletes duplicate documents from solr||
+ ||[[bin/nutch solrclean]]||Deletes 404 documents from solr||
  ||[[bin/nutch_updatedb]]||Updates the web page and link db from the segment 
fetcher output||
  ||  ||   
||
  
@@ -37, +38 @@

  
  bin/nutch org.apache.nutch.util.domain.[[DomainStatistics]]
  
-

Re: Differences 1.x and trunk

2011-03-18 Thread Andrzej Bialecki


On 3/18/11 4:31 PM, Markus Jelsma wrote:

Hi all,

I'm giving it a try to patch https://issues.apache.org/jira/browse/NUTCH-963
to trunk after committing to 1.3. There are of course a lot of differences so
i need a little advice on how to procede:

- instead of using CrawlDB and CrawlDatum we now need WebTableReader?


Actually you need to use StorageUtils to set up Mapper or Reducer 
contexts. See other tools, e.g. Fetcher or Generator.



- trunk uses slf instead of commons logging now?


Yes.


- a page is now represented by storage.WebPage?


Yes. When you prepare a Job you also need to specify what fields from 
WebPage you are interested in (and only these fields will be pulled in 
from the storage). This is all handled by StorageUtils methods.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Differences 1.x and trunk

2011-03-18 Thread Markus Jelsma

Thanks! I'll try and come up with a working patch in the next few weeks orso.

On Friday 18 March 2011 16:57:20 Andrzej Bialecki wrote:
 On 3/18/11 4:31 PM, Markus Jelsma wrote:
  Hi all,
  
  I'm giving it a try to patch
  https://issues.apache.org/jira/browse/NUTCH-963 to trunk after
  committing to 1.3. There are of course a lot of differences so i need a
  little advice on how to procede:
  
  - instead of using CrawlDB and CrawlDatum we now need WebTableReader?
 
 Actually you need to use StorageUtils to set up Mapper or Reducer
 contexts. See other tools, e.g. Fetcher or Generator.
 
  - trunk uses slf instead of commons logging now?
 
 Yes.
 
  - a page is now represented by storage.WebPage?
 
 Yes. When you prepare a Job you also need to specify what fields from
 WebPage you are interested in (and only these fields will be pulled in
 from the storage). This is all handled by StorageUtils methods.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix

2011-03-18 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008508#comment-13008508
 ] 

Julien Nioche commented on NUTCH-958:
-

I had a look at upgrading to a more recent version of httpclient but it was a 
substantial job as most of the API had changed. We'll definitely do that for 
Nutch 2.0 at some point. 
What about marking this issue as won't fix and move it out of 1.3? As you said 
people will find your patch here if they have the same problem and can easily 
apply it. 

 Httpclient scheme priority order fix
 

 Key: NUTCH-958
 URL: https://issues.apache.org/jira/browse/NUTCH-958
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3
Reporter: Claudio Martella
 Fix For: 1.3

 Attachments: httpclient.diff


 Httpclient will try to authenticate in this order by default: ntlm, digest, 
 basic.
 If you set as default a scheme that comes in this list after a scheme that is 
 negotiated by the server, and this authentication fails, the default scheme 
 will not be tried.
 I.e. if you set digest as default scheme but the server negotiates ntlm, the 
 client will still try ntlm and fail.
 The fix sets the default scheme as the only possible scheme for 
 authentication for the given realm by setting the authentication priorities 
 of httpclient.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix

2011-03-18 Thread Claudio Martella (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008511#comment-13008511
 ] 

Claudio Martella commented on NUTCH-958:


yes, go on.

 Httpclient scheme priority order fix
 

 Key: NUTCH-958
 URL: https://issues.apache.org/jira/browse/NUTCH-958
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3
Reporter: Claudio Martella
 Fix For: 1.3

 Attachments: httpclient.diff


 Httpclient will try to authenticate in this order by default: ntlm, digest, 
 basic.
 If you set as default a scheme that comes in this list after a scheme that is 
 negotiated by the server, and this authentication fails, the default scheme 
 will not be tried.
 I.e. if you set digest as default scheme but the server negotiates ntlm, the 
 client will still try ntlm and fail.
 The fix sets the default scheme as the only possible scheme for 
 authentication for the given realm by setting the authentication priorities 
 of httpclient.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-958) Httpclient scheme priority order fix

2011-03-18 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-958.
-

Resolution: Won't Fix

See comments. This patch fixes a bug in the underlying httpclient library which 
will be upgraded later anyway

 Httpclient scheme priority order fix
 

 Key: NUTCH-958
 URL: https://issues.apache.org/jira/browse/NUTCH-958
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3
Reporter: Claudio Martella
 Fix For: 1.3

 Attachments: httpclient.diff


 Httpclient will try to authenticate in this order by default: ntlm, digest, 
 basic.
 If you set as default a scheme that comes in this list after a scheme that is 
 negotiated by the server, and this authentication fails, the default scheme 
 will not be tried.
 I.e. if you set digest as default scheme but the server negotiates ntlm, the 
 client will still try ntlm and fail.
 The fix sets the default scheme as the only possible scheme for 
 authentication for the given realm by setting the authentication priorities 
 of httpclient.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Build failed in Jenkins: Nutch-trunk #1430

2011-03-18 Thread Apache Hudson Server

See https://hudson.apache.org/hudson/job/Nutch-trunk/1430/changes

Changes:

[markus] ASF licene header was missing

--
[...truncated 1008 lines...]
A src/plugin/subcollection/src/java/org/apache/nutch
A src/plugin/subcollection/src/java/org/apache/nutch/collection
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html
A src/plugin/subcollection/src/java/org/apache/nutch/indexer
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
A src/plugin/subcollection/README.txt
A src/plugin/subcollection/plugin.xml
A src/plugin/subcollection/build.xml
A src/plugin/index-more
A src/plugin/index-more/ivy.xml
A src/plugin/index-more/src
A src/plugin/index-more/src/test
A src/plugin/index-more/src/test/org
A src/plugin/index-more/src/test/org/apache
A src/plugin/index-more/src/test/org/apache/nutch
A src/plugin/index-more/src/test/org/apache/nutch/indexer
A src/plugin/index-more/src/test/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
A src/plugin/index-more/src/java
A src/plugin/index-more/src/java/org
A src/plugin/index-more/src/java/org/apache
A src/plugin/index-more/src/java/org/apache/nutch
A src/plugin/index-more/src/java/org/apache/nutch/indexer
A src/plugin/index-more/src/java/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html
A src/plugin/index-more/plugin.xml
A src/plugin/index-more/build.xml
AUsrc/plugin/plugin.dtd
A src/plugin/parse-ext
A src/plugin/parse-ext/ivy.xml
A src/plugin/parse-ext/src
A src/plugin/parse-ext/src/test
A src/plugin/parse-ext/src/test/org
A src/plugin/parse-ext/src/test/org/apache
A src/plugin/parse-ext/src/test/org/apache/nutch
A src/plugin/parse-ext/src/test/org/apache/nutch/parse
A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java
A src/plugin/parse-ext/src/java
A src/plugin/parse-ext/src/java/org
A src/plugin/parse-ext/src/java/org/apache
A src/plugin/parse-ext/src/java/org/apache/nutch
A src/plugin/parse-ext/src/java/org/apache/nutch/parse
A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java
A src/plugin/parse-ext/plugin.xml
A src/plugin/parse-ext/build.xml
A src/plugin/parse-ext/command
A src/plugin/urlnormalizer-pass
A src/plugin/urlnormalizer-pass/ivy.xml
A src/plugin/urlnormalizer-pass/src
A src/plugin/urlnormalizer-pass/src/test
A src/plugin/urlnormalizer-pass/src/test/org
A src/plugin/urlnormalizer-pass/src/test/org/apache
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java
A src/plugin/urlnormalizer-pass/src/java
A src/plugin/urlnormalizer-pass/src/java/org
A src/plugin/urlnormalizer-pass/src/java/org/apache
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java
AUsrc/plugin/urlnormalizer-pass/plugin.xml
AUsrc/plugin/urlnormalizer-pass/build.xml
A src/plugin/parse-html
A src/plugin/parse-html/ivy.xml
A src/plugin/parse-html/lib
A src/plugin/parse-html/lib/tagsoup.LICENSE.txt
A src/plugin/parse-html/src
A src/plugin/parse-html/src/test
A src/plugin/parse-html/src/test/org
A src/plugin/parse-html/src/test/org/apache
A

[jira] Commented: (NUTCH-872) Change the default fetcher.parse to FALSE

[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix

[jira] Commented: (NUTCH-963) Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls)

[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix

[jira] Commented: (NUTCH-963) Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls)

[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix

[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix

[jira] Commented: (NUTCH-967) Upgrade to Tika 0.9

[jira] Commented: (NUTCH-963) Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls)

Differences 1.x and trunk

[Nutch Wiki] Update of CommandLineOptions by MarkusJelsma

Re: Differences 1.x and trunk

Re: Differences 1.x and trunk

[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix

[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix

[jira] Resolved: (NUTCH-958) Httpclient scheme priority order fix

Build failed in Jenkins: Nutch-trunk #1430

17 matches

Site Navigation

Mail list logo

Footer information