[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once
[ https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471480#comment-13471480 ] Ferdy Galema commented on NUTCH-1457: - Included effort is resolving the conflict of time the document was fetched and the time the document ought to be fetched. Nutch2 Refactor the update process so that fetched items are only processed once Key: NUTCH-1457 URL: https://issues.apache.org/jira/browse/NUTCH-1457 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field
[ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1475: - Affects Version/s: (was: nutchgora) 1.5.1 This is an issue for the 1.x branch as well Nutch 2.1 Index-More Plugin -- A better fall back value for date field -- Key: NUTCH-1475 URL: https://issues.apache.org/jira/browse/NUTCH-1475 Project: Nutch Issue Type: Bug Affects Versions: 2.1, 1.5.1 Environment: All Reporter: James Sullivan Priority: Minor Labels: index-more, plugins Attachments: index-more-2x.patch Original Estimate: 1h Remaining Estimate: 1h Among other fields, the more plugin for Nutch 2.x provides a last modified and date field for the Solr index. The last modified field is the last modified date from the http headers if available, if not available it is left empty. Currently, the date field is the same as the last modified field unless that field is empty in which case getFetchTime is used as a fall back. I think getFetchTime is not a good fall back as it is the next fetch time and often a month or more in the future which doesn't make sense for the date field. Users do not expect webpages/documents with future dates. A more sensible fallback would be current date at the time it is indexed. This is possible by simply changing line 97 of https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java from time = page.getFetchTime(); // use fetch time to time = new Date().getTime(); Users interested in the getFetchTime value can still get it from the tstamp field. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-827) HTTP POST Authentication
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471577#comment-13471577 ] Max Dzyuba commented on NUTCH-827: -- Hi Jasper, Thanks, removing that line fixed the exception problem. At the moment, the log file doesn't have any errors related to HTTPclient plugin or authentication process. However, my tests show that the cookie can't be read by the test auth page I've set up. Is there an easy way to verify if the cookie was created by Nutch and stored as intended? Thanks, Max HTTP POST Authentication Key: NUTCH-827 URL: https://issues.apache.org/jira/browse/NUTCH-827 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 1.1, nutchgora Reporter: Jasper van Veghel Priority: Minor Labels: authentication Fix For: 1.6 Attachments: nutch-http-cookies.patch I've created a patch against the trunk which adds support for very rudimentary POST-based authentication support. It takes a link from nutch-site.xml with a site to POST to and its respective parameters (username, password, etc.). It then checks upon every request whether any cookies have been initialized, and if none have, it fetches them from the given link. This isn't perfect but Works For Me (TM) as I generally only need to retrieve results from a single domain and so have no cookie overlap (i.e. if the domain cookies expire, all cookies disappear from the HttpClient and I can simply re-fetch them). A natural improvement would be to be able to specify one particular cookie to check the expiration-date against. If anyone is interested in this beside me I'd be glad to put some more effort into making this more universally applicable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-827) HTTP POST Authentication
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471600#comment-13471600 ] Jasper van Veghel commented on NUTCH-827: - {code} +Http.LOG.trace(url: + url + +; status code: + code + +; cookies received: + Http.getClient().getState().getCookies().length); {code} If you turn on TRACE logging, you should see messages like that. HTTP POST Authentication Key: NUTCH-827 URL: https://issues.apache.org/jira/browse/NUTCH-827 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 1.1, nutchgora Reporter: Jasper van Veghel Priority: Minor Labels: authentication Fix For: 1.6 Attachments: nutch-http-cookies.patch I've created a patch against the trunk which adds support for very rudimentary POST-based authentication support. It takes a link from nutch-site.xml with a site to POST to and its respective parameters (username, password, etc.). It then checks upon every request whether any cookies have been initialized, and if none have, it fetches them from the given link. This isn't perfect but Works For Me (TM) as I generally only need to retrieve results from a single domain and so have no cookie overlap (i.e. if the domain cookies expire, all cookies disappear from the HttpClient and I can simply re-fetch them). A natural improvement would be to be able to specify one particular cookie to check the expiration-date against. If anyone is interested in this beside me I'd be glad to put some more effort into making this more universally applicable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field
[ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Sullivan updated NUTCH-1475: -- Attachment: index-more-1xand2x.patch Attaching new patch that patches both 1.x and 2.x Nutch 2.1 Index-More Plugin -- A better fall back value for date field -- Key: NUTCH-1475 URL: https://issues.apache.org/jira/browse/NUTCH-1475 Project: Nutch Issue Type: Bug Affects Versions: 2.1, 1.5.1 Environment: All Reporter: James Sullivan Priority: Minor Labels: index-more, plugins Attachments: index-more-1xand2x.patch, index-more-2x.patch Original Estimate: 1h Remaining Estimate: 1h Among other fields, the more plugin for Nutch 2.x provides a last modified and date field for the Solr index. The last modified field is the last modified date from the http headers if available, if not available it is left empty. Currently, the date field is the same as the last modified field unless that field is empty in which case getFetchTime is used as a fall back. I think getFetchTime is not a good fall back as it is the next fetch time and often a month or more in the future which doesn't make sense for the date field. Users do not expect webpages/documents with future dates. A more sensible fallback would be current date at the time it is indexed. This is possible by simply changing line 97 of https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java from time = page.getFetchTime(); // use fetch time to time = new Date().getTime(); Users interested in the getFetchTime value can still get it from the tstamp field. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1476) SegmentReader getStats should set parsed = -1 if no parsing took place
Sebastian Nagel created NUTCH-1476: -- Summary: SegmentReader getStats should set parsed = -1 if no parsing took place Key: NUTCH-1476 URL: https://issues.apache.org/jira/browse/NUTCH-1476 Project: Nutch Issue Type: Bug Affects Versions: 1.6 Reporter: Sebastian Nagel Priority: Trivial Fix For: 1.6 Attachments: NUTCH-1476.patch The method getStats in SegmentReader sets the number of parsed documents (and also the number of parseErrors) to 0 if no parsing took place for a segment. The values should be set to -1 analogous to the number of fetched docs and fetchErrors. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1476) SegmentReader getStats should set parsed = -1 if no parsing took place
[ https://issues.apache.org/jira/browse/NUTCH-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1476: --- Attachment: NUTCH-1476.patch SegmentReader getStats should set parsed = -1 if no parsing took place -- Key: NUTCH-1476 URL: https://issues.apache.org/jira/browse/NUTCH-1476 Project: Nutch Issue Type: Bug Affects Versions: 1.6 Reporter: Sebastian Nagel Priority: Trivial Fix For: 1.6 Attachments: NUTCH-1476.patch The method getStats in SegmentReader sets the number of parsed documents (and also the number of parseErrors) to 0 if no parsing took place for a segment. The values should be set to -1 analogous to the number of fetched docs and fetchErrors. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (NUTCH-1252) SegmentReader -get shows wrong data
[ https://issues.apache.org/jira/browse/NUTCH-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-1252: -- Assignee: Sebastian Nagel SegmentReader -get shows wrong data --- Key: NUTCH-1252 URL: https://issues.apache.org/jira/browse/NUTCH-1252 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Assignee: Sebastian Nagel Fix For: 1.6 Attachments: NUTCH-1252.patch, NUTCH-1252-v2.patch The command/option -get of the SegmentReader may show wrong data associated with the given URL. To reproduce: {code} % mkdir -p test_readseg/urls % echo -e http://nutch.apache.org/\ttest=ApacheNutch\nhttp://abc.test/\ttest=AbcTest\tnutch.score=10.0; test_readseg/urls/seeds % nutch inject test_readseg/crawldb test_readseg/urls Injector: starting at 2012-01-18 09:32:25 Injector: crawlDb: test_readseg/crawldb Injector: urlDir: test_readseg/urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-01-18 09:32:28, elapsed: 00:00:03 % nutch generate test_readseg/crawldb test_readseg/segments/ Generator: starting at 2012-01-18 09:32:30 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: test_readseg/segments/20120118093232 Generator: finished at 2012-01-18 09:32:34, elapsed: 00:00:03 % nutch readseg -get test_readseg/segments/* 'http://nutch.apache.org/' -nocontent -noparse -nofetch -noparsedata -noparsetext SegmentReader: get 'http://nutch.apache.org/' Crawl Generate:: Version: 7 Status: 1 (db_unfetched) Fetch time: Wed Jan 18 09:32:26 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 10.0 Signature: null Metadata: _ngt_: 1326875550401test: AbcTest {code} The metadata and the score indicate that the CrawlDatum shown is the wrong one (that associated to http://abc.test/ but not to http://nutch.apache.org/). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1344) BasicURLNormalizer to normalize https same as http
[ https://issues.apache.org/jira/browse/NUTCH-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471915#comment-13471915 ] Sebastian Nagel commented on NUTCH-1344: Is there any reason why https should be treated different from http (and ftp)? BasicURLNormalizer to normalize https same as http --- Key: NUTCH-1344 URL: https://issues.apache.org/jira/browse/NUTCH-1344 Project: Nutch Issue Type: Bug Affects Versions: nutchgora, 1.6 Reporter: Sebastian Nagel Attachments: NUTCH-1344.patch Most of the normalization done by BasicURLNormalizer (lowercasing host, removing default port, removal of page anchors, cleaning . and . in the path) is not done for URLs with protocol https. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Jenkins build is back to normal : Nutch-nutchgora #373
See https://builds.apache.org/job/Nutch-nutchgora/373/