[jira] Commented: (NUTCH-872) Change the default fetcher.parse to FALSE
[ https://issues.apache.org/jira/browse/NUTCH-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008397#comment-13008397 ] Markus Jelsma commented on NUTCH-872: - To all: Andrzej has committed this to 1.3 as well in r1079746 at 2011-03-09. Change the default fetcher.parse to FALSE - Key: NUTCH-872 URL: https://issues.apache.org/jira/browse/NUTCH-872 Project: Nutch Issue Type: Improvement Affects Versions: 1.2, 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 I propose to change this property to false. The reason is that it's a safer default - parsing issues don't lead to a loss of the downloaded content. For larger crawls this is the recommended way to run Fetcher. Users that run smaller crawls can still override it. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix
[ https://issues.apache.org/jira/browse/NUTCH-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008401#comment-13008401 ] Markus Jelsma commented on NUTCH-958: - Hi Claudio. Is this desired behaviour? Shouldn't the default be used as fallback if the negotiated schema fails instead forcing default as only scheme? Httpclient scheme priority order fix Key: NUTCH-958 URL: https://issues.apache.org/jira/browse/NUTCH-958 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.3 Reporter: Claudio Martella Fix For: 1.3 Attachments: httpclient.diff Httpclient will try to authenticate in this order by default: ntlm, digest, basic. If you set as default a scheme that comes in this list after a scheme that is negotiated by the server, and this authentication fails, the default scheme will not be tried. I.e. if you set digest as default scheme but the server negotiates ntlm, the client will still try ntlm and fail. The fix sets the default scheme as the only possible scheme for authentication for the given realm by setting the authentication priorities of httpclient. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-963) Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls)
[ https://issues.apache.org/jira/browse/NUTCH-963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008402#comment-13008402 ] Markus Jelsma commented on NUTCH-963: - Julien, shouldn't the deduplicate mechanism kept separate from purging 404's? I agree your proposal for finding dupes is better than the current but i believe it should be kept separate because: - people may use a Solr update request processor for finding and deleting dupes (it has several hashing algorithm incl. a fuzzy matching) - controlled environments where there are no dupes don't need a 404 purger that wastes cycles on finding dupes If so, i believe this issue can be committed for 1.3 after further testing. Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls) - Key: NUTCH-963 URL: https://issues.apache.org/jira/browse/NUTCH-963 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 2.0 Reporter: Claudio Martella Assignee: Markus Jelsma Priority: Minor Fix For: 1.3, 2.0 Attachments: NUTCH-963-command-and-log4j.patch, Solr404Deleter.java, SolrClean.java When issuing recrawls it can happen that certain urls have expired (i.e. URLs that don't exist anymore and return 404). This patch creates a new command in the indexer that scans the crawldb looking for these urls and issues delete commands to SOLR. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix
[ https://issues.apache.org/jira/browse/NUTCH-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008408#comment-13008408 ] Claudio Martella commented on NUTCH-958: that is the problem. right now the system does not allow the default scheme to be used as a fallback, which is the reason i wrote this patch. that comes because of a bug in httpclient. So, in order to have some control over the kind of authentication is used, which is the expected behavior you also describe, the only way is through this workaround. Httpclient scheme priority order fix Key: NUTCH-958 URL: https://issues.apache.org/jira/browse/NUTCH-958 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.3 Reporter: Claudio Martella Fix For: 1.3 Attachments: httpclient.diff Httpclient will try to authenticate in this order by default: ntlm, digest, basic. If you set as default a scheme that comes in this list after a scheme that is negotiated by the server, and this authentication fails, the default scheme will not be tried. I.e. if you set digest as default scheme but the server negotiates ntlm, the client will still try ntlm and fail. The fix sets the default scheme as the only possible scheme for authentication for the given realm by setting the authentication priorities of httpclient. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-963) Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls)
[ https://issues.apache.org/jira/browse/NUTCH-963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008421#comment-13008421 ] Markus Jelsma commented on NUTCH-963: - Solr deduplication makes its own (fuzzy) hashes on one or more fields. Separate algorithms on different fields can be combined. It does not take into account the score of a document if you mean the index-time boost on the document. But if there is a separate score (or boost) field then a combined signature on body, title and boost will work. All aside, i agree we should go for a single Nutch command for cleaning an index, doing dedup and/or 404 cleaning in one swift go. I'll rereview this patch and do further testing and won't forget CHANGES.txt. After that i believe we can create a new related issue for the new deduplication. Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls) - Key: NUTCH-963 URL: https://issues.apache.org/jira/browse/NUTCH-963 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 2.0 Reporter: Claudio Martella Assignee: Markus Jelsma Priority: Minor Fix For: 1.3, 2.0 Attachments: NUTCH-963-command-and-log4j.patch, Solr404Deleter.java, SolrClean.java When issuing recrawls it can happen that certain urls have expired (i.e. URLs that don't exist anymore and return 404). This patch creates a new command in the indexer that scans the crawldb looking for these urls and issues delete commands to SOLR. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix
[ https://issues.apache.org/jira/browse/NUTCH-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008422#comment-13008422 ] Markus Jelsma commented on NUTCH-958: - Claudio, i am not sure if this workaround should be committed at all. If the devs agree then it should: - be patched for 2.0 as well - add a configuration option to enable your workaround so to prevent breaking other user's HTTP authentication methods Httpclient scheme priority order fix Key: NUTCH-958 URL: https://issues.apache.org/jira/browse/NUTCH-958 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.3 Reporter: Claudio Martella Fix For: 1.3 Attachments: httpclient.diff Httpclient will try to authenticate in this order by default: ntlm, digest, basic. If you set as default a scheme that comes in this list after a scheme that is negotiated by the server, and this authentication fails, the default scheme will not be tried. I.e. if you set digest as default scheme but the server negotiates ntlm, the client will still try ntlm and fail. The fix sets the default scheme as the only possible scheme for authentication for the given realm by setting the authentication priorities of httpclient. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix
[ https://issues.apache.org/jira/browse/NUTCH-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008432#comment-13008432 ] Claudio Martella commented on NUTCH-958: this workaround was necessary for my work and introduced an expected behavior. I understand it's not clean, but the actual behavior of nutch isn't correct either. Maybe it can be useful for somebody else and maybe it's enough to keep it here so people can find it and apply the patch if they like, so that it doesn't have to be commited. The right way would probably just pass through moving to httpclient4. Httpclient scheme priority order fix Key: NUTCH-958 URL: https://issues.apache.org/jira/browse/NUTCH-958 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.3 Reporter: Claudio Martella Fix For: 1.3 Attachments: httpclient.diff Httpclient will try to authenticate in this order by default: ntlm, digest, basic. If you set as default a scheme that comes in this list after a scheme that is negotiated by the server, and this authentication fails, the default scheme will not be tried. I.e. if you set digest as default scheme but the server negotiates ntlm, the client will still try ntlm and fail. The fix sets the default scheme as the only possible scheme for authentication for the given realm by setting the authentication priorities of httpclient. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-967) Upgrade to Tika 0.9
[ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008453#comment-13008453 ] Markus Jelsma commented on NUTCH-967: - That didn't show up in test nor in a crawl, but i'm not using parse-zip anyway. How to procede with a fix? Upgrade to Tika 0.9 --- Key: NUTCH-967 URL: https://issues.apache.org/jira/browse/NUTCH-967 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.3, 2.0 Reporter: Markus Jelsma Assignee: Julien Nioche Fix For: 1.3, 2.0 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-963) Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls)
[ https://issues.apache.org/jira/browse/NUTCH-963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008469#comment-13008469 ] Markus Jelsma commented on NUTCH-963: - Committed for branch-1.3 in rev 1082944. - new command bin/nutch solrclean crawldb solrurl - added solrclean to log4j to allow output to stdout Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls) - Key: NUTCH-963 URL: https://issues.apache.org/jira/browse/NUTCH-963 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 2.0 Reporter: Claudio Martella Assignee: Markus Jelsma Priority: Minor Fix For: 1.3, 2.0 Attachments: NUTCH-963-command-and-log4j.patch, Solr404Deleter.java, SolrClean.java When issuing recrawls it can happen that certain urls have expired (i.e. URLs that don't exist anymore and return 404). This patch creates a new command in the indexer that scans the crawldb looking for these urls and issues delete commands to SOLR. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Differences 1.x and trunk
Hi all, I'm giving it a try to patch https://issues.apache.org/jira/browse/NUTCH-963 to trunk after committing to 1.3. There are of course a lot of differences so i need a little advice on how to procede: - instead of using CrawlDB and CrawlDatum we now need WebTableReader? - trunk uses slf instead of commons logging now? - a page is now represented by storage.WebPage? Any more good advice on this one? I need it ;) Cheers, -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
[Nutch Wiki] Update of CommandLineOptions by MarkusJelsma
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The CommandLineOptions page has been changed by MarkusJelsma. http://wiki.apache.org/nutch/CommandLineOptions?action=diffrev1=11rev2=12 -- ||[[bin/nutch_segslice]]||Divide data from one segement into several segments|| ||[[bin/nutch_server]]||Run a search server of IPC connections|| ||[[bin/nutch solrdedup]]||Deletes duplicate documents from solr|| + ||[[bin/nutch solrclean]]||Deletes 404 documents from solr|| ||[[bin/nutch_updatedb]]||Updates the web page and link db from the segment fetcher output|| || || || @@ -37, +38 @@ bin/nutch org.apache.nutch.util.domain.[[DomainStatistics]] -
Re: Differences 1.x and trunk
On 3/18/11 4:31 PM, Markus Jelsma wrote: Hi all, I'm giving it a try to patch https://issues.apache.org/jira/browse/NUTCH-963 to trunk after committing to 1.3. There are of course a lot of differences so i need a little advice on how to procede: - instead of using CrawlDB and CrawlDatum we now need WebTableReader? Actually you need to use StorageUtils to set up Mapper or Reducer contexts. See other tools, e.g. Fetcher or Generator. - trunk uses slf instead of commons logging now? Yes. - a page is now represented by storage.WebPage? Yes. When you prepare a Job you also need to specify what fields from WebPage you are interested in (and only these fields will be pulled in from the storage). This is all handled by StorageUtils methods. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Differences 1.x and trunk
Thanks! I'll try and come up with a working patch in the next few weeks orso. On Friday 18 March 2011 16:57:20 Andrzej Bialecki wrote: On 3/18/11 4:31 PM, Markus Jelsma wrote: Hi all, I'm giving it a try to patch https://issues.apache.org/jira/browse/NUTCH-963 to trunk after committing to 1.3. There are of course a lot of differences so i need a little advice on how to procede: - instead of using CrawlDB and CrawlDatum we now need WebTableReader? Actually you need to use StorageUtils to set up Mapper or Reducer contexts. See other tools, e.g. Fetcher or Generator. - trunk uses slf instead of commons logging now? Yes. - a page is now represented by storage.WebPage? Yes. When you prepare a Job you also need to specify what fields from WebPage you are interested in (and only these fields will be pulled in from the storage). This is all handled by StorageUtils methods. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix
[ https://issues.apache.org/jira/browse/NUTCH-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008508#comment-13008508 ] Julien Nioche commented on NUTCH-958: - I had a look at upgrading to a more recent version of httpclient but it was a substantial job as most of the API had changed. We'll definitely do that for Nutch 2.0 at some point. What about marking this issue as won't fix and move it out of 1.3? As you said people will find your patch here if they have the same problem and can easily apply it. Httpclient scheme priority order fix Key: NUTCH-958 URL: https://issues.apache.org/jira/browse/NUTCH-958 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.3 Reporter: Claudio Martella Fix For: 1.3 Attachments: httpclient.diff Httpclient will try to authenticate in this order by default: ntlm, digest, basic. If you set as default a scheme that comes in this list after a scheme that is negotiated by the server, and this authentication fails, the default scheme will not be tried. I.e. if you set digest as default scheme but the server negotiates ntlm, the client will still try ntlm and fail. The fix sets the default scheme as the only possible scheme for authentication for the given realm by setting the authentication priorities of httpclient. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-958) Httpclient scheme priority order fix
[ https://issues.apache.org/jira/browse/NUTCH-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008511#comment-13008511 ] Claudio Martella commented on NUTCH-958: yes, go on. Httpclient scheme priority order fix Key: NUTCH-958 URL: https://issues.apache.org/jira/browse/NUTCH-958 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.3 Reporter: Claudio Martella Fix For: 1.3 Attachments: httpclient.diff Httpclient will try to authenticate in this order by default: ntlm, digest, basic. If you set as default a scheme that comes in this list after a scheme that is negotiated by the server, and this authentication fails, the default scheme will not be tried. I.e. if you set digest as default scheme but the server negotiates ntlm, the client will still try ntlm and fail. The fix sets the default scheme as the only possible scheme for authentication for the given realm by setting the authentication priorities of httpclient. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-958) Httpclient scheme priority order fix
[ https://issues.apache.org/jira/browse/NUTCH-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-958. - Resolution: Won't Fix See comments. This patch fixes a bug in the underlying httpclient library which will be upgraded later anyway Httpclient scheme priority order fix Key: NUTCH-958 URL: https://issues.apache.org/jira/browse/NUTCH-958 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.3 Reporter: Claudio Martella Fix For: 1.3 Attachments: httpclient.diff Httpclient will try to authenticate in this order by default: ntlm, digest, basic. If you set as default a scheme that comes in this list after a scheme that is negotiated by the server, and this authentication fails, the default scheme will not be tried. I.e. if you set digest as default scheme but the server negotiates ntlm, the client will still try ntlm and fail. The fix sets the default scheme as the only possible scheme for authentication for the given realm by setting the authentication priorities of httpclient. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Build failed in Jenkins: Nutch-trunk #1430
See https://hudson.apache.org/hudson/job/Nutch-trunk/1430/changes Changes: [markus] ASF licene header was missing -- [...truncated 1008 lines...] A src/plugin/subcollection/src/java/org/apache/nutch A src/plugin/subcollection/src/java/org/apache/nutch/collection A src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html A src/plugin/subcollection/src/java/org/apache/nutch/indexer A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java A src/plugin/subcollection/README.txt A src/plugin/subcollection/plugin.xml A src/plugin/subcollection/build.xml A src/plugin/index-more A src/plugin/index-more/ivy.xml A src/plugin/index-more/src A src/plugin/index-more/src/test A src/plugin/index-more/src/test/org A src/plugin/index-more/src/test/org/apache A src/plugin/index-more/src/test/org/apache/nutch A src/plugin/index-more/src/test/org/apache/nutch/indexer A src/plugin/index-more/src/test/org/apache/nutch/indexer/more A src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java A src/plugin/index-more/src/java A src/plugin/index-more/src/java/org A src/plugin/index-more/src/java/org/apache A src/plugin/index-more/src/java/org/apache/nutch A src/plugin/index-more/src/java/org/apache/nutch/indexer A src/plugin/index-more/src/java/org/apache/nutch/indexer/more A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html A src/plugin/index-more/plugin.xml A src/plugin/index-more/build.xml AUsrc/plugin/plugin.dtd A src/plugin/parse-ext A src/plugin/parse-ext/ivy.xml A src/plugin/parse-ext/src A src/plugin/parse-ext/src/test A src/plugin/parse-ext/src/test/org A src/plugin/parse-ext/src/test/org/apache A src/plugin/parse-ext/src/test/org/apache/nutch A src/plugin/parse-ext/src/test/org/apache/nutch/parse A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java A src/plugin/parse-ext/src/java A src/plugin/parse-ext/src/java/org A src/plugin/parse-ext/src/java/org/apache A src/plugin/parse-ext/src/java/org/apache/nutch A src/plugin/parse-ext/src/java/org/apache/nutch/parse A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java A src/plugin/parse-ext/plugin.xml A src/plugin/parse-ext/build.xml A src/plugin/parse-ext/command A src/plugin/urlnormalizer-pass A src/plugin/urlnormalizer-pass/ivy.xml A src/plugin/urlnormalizer-pass/src A src/plugin/urlnormalizer-pass/src/test A src/plugin/urlnormalizer-pass/src/test/org A src/plugin/urlnormalizer-pass/src/test/org/apache A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java A src/plugin/urlnormalizer-pass/src/java A src/plugin/urlnormalizer-pass/src/java/org A src/plugin/urlnormalizer-pass/src/java/org/apache A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java AUsrc/plugin/urlnormalizer-pass/plugin.xml AUsrc/plugin/urlnormalizer-pass/build.xml A src/plugin/parse-html A src/plugin/parse-html/ivy.xml A src/plugin/parse-html/lib A src/plugin/parse-html/lib/tagsoup.LICENSE.txt A src/plugin/parse-html/src A src/plugin/parse-html/src/test A src/plugin/parse-html/src/test/org A src/plugin/parse-html/src/test/org/apache A