[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2012-10-08 Thread Ferdy Galema (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471480#comment-13471480
 ] 

Ferdy Galema commented on NUTCH-1457:
-

Included effort is resolving the conflict of time the document was fetched 
and the time the document ought to be fetched.

 Nutch2 Refactor the update process so that fetched items are only processed 
 once
 

 Key: NUTCH-1457
 URL: https://issues.apache.org/jira/browse/NUTCH-1457
 Project: Nutch
  Issue Type: Improvement
Reporter: Ferdy Galema



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

2012-10-08 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1475:
-

Affects Version/s: (was: nutchgora)
   1.5.1

This is an issue for the 1.x branch as well 

 Nutch 2.1 Index-More Plugin -- A better fall back value for date field
 --

 Key: NUTCH-1475
 URL: https://issues.apache.org/jira/browse/NUTCH-1475
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.1, 1.5.1
 Environment: All
Reporter: James Sullivan
Priority: Minor
  Labels: index-more, plugins
 Attachments: index-more-2x.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Among other fields, the more plugin for Nutch 2.x provides a last modified 
 and date field for the Solr index. The last modified field is the last 
 modified date from the http headers if available, if not available it is left 
 empty. Currently, the date field is the same as the last modified field 
 unless that field is empty in which case getFetchTime is used as a fall back. 
 I think getFetchTime is not a good fall back as it is the next fetch time and 
 often a month or more in the future which doesn't make sense for the date 
 field. Users do not expect webpages/documents with future dates. A more 
 sensible fallback would be current date at the time it is indexed. 
 This is possible by simply changing line 97 of 
 https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
  from
 time = page.getFetchTime(); // use fetch time
 to
 time = new Date().getTime();
 Users interested in the getFetchTime value can still get it from the tstamp 
 field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-827) HTTP POST Authentication

2012-10-08 Thread Max Dzyuba (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471577#comment-13471577
 ] 

Max Dzyuba commented on NUTCH-827:
--

Hi Jasper,

Thanks, removing that line fixed the exception problem.
At the moment, the log file doesn't have any errors related to HTTPclient 
plugin or authentication process. However, my tests show that the cookie can't 
be read by the test auth page I've set up.

Is there an easy way to verify if the cookie was created by Nutch and stored as 
intended?


Thanks,
Max

 HTTP POST Authentication
 

 Key: NUTCH-827
 URL: https://issues.apache.org/jira/browse/NUTCH-827
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.1, nutchgora
Reporter: Jasper van Veghel
Priority: Minor
  Labels: authentication
 Fix For: 1.6

 Attachments: nutch-http-cookies.patch


 I've created a patch against the trunk which adds support for very 
 rudimentary POST-based authentication support. It takes a link from 
 nutch-site.xml with a site to POST to and its respective parameters 
 (username, password, etc.). It then checks upon every request whether any 
 cookies have been initialized, and if none have, it fetches them from the 
 given link.
 This isn't perfect but Works For Me (TM) as I generally only need to retrieve 
 results from a single domain and so have no cookie overlap (i.e. if the 
 domain cookies expire, all cookies disappear from the HttpClient and I can 
 simply re-fetch them). A natural improvement would be to be able to specify 
 one particular cookie to check the expiration-date against. If anyone is 
 interested in this beside me I'd be glad to put some more effort into making 
 this more universally applicable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-827) HTTP POST Authentication

2012-10-08 Thread Jasper van Veghel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471600#comment-13471600
 ] 

Jasper van Veghel commented on NUTCH-827:
-

{code}
+Http.LOG.trace(url:  + url +
+; status code:  + code +
+; cookies received:  + 
Http.getClient().getState().getCookies().length);
{code}

If you turn on TRACE logging, you should see messages like that.

 HTTP POST Authentication
 

 Key: NUTCH-827
 URL: https://issues.apache.org/jira/browse/NUTCH-827
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.1, nutchgora
Reporter: Jasper van Veghel
Priority: Minor
  Labels: authentication
 Fix For: 1.6

 Attachments: nutch-http-cookies.patch


 I've created a patch against the trunk which adds support for very 
 rudimentary POST-based authentication support. It takes a link from 
 nutch-site.xml with a site to POST to and its respective parameters 
 (username, password, etc.). It then checks upon every request whether any 
 cookies have been initialized, and if none have, it fetches them from the 
 given link.
 This isn't perfect but Works For Me (TM) as I generally only need to retrieve 
 results from a single domain and so have no cookie overlap (i.e. if the 
 domain cookies expire, all cookies disappear from the HttpClient and I can 
 simply re-fetch them). A natural improvement would be to be able to specify 
 one particular cookie to check the expiration-date against. If anyone is 
 interested in this beside me I'd be glad to put some more effort into making 
 this more universally applicable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

2012-10-08 Thread James Sullivan (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Sullivan updated NUTCH-1475:
--

Attachment: index-more-1xand2x.patch

Attaching new patch that patches both 1.x and 2.x

 Nutch 2.1 Index-More Plugin -- A better fall back value for date field
 --

 Key: NUTCH-1475
 URL: https://issues.apache.org/jira/browse/NUTCH-1475
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.1, 1.5.1
 Environment: All
Reporter: James Sullivan
Priority: Minor
  Labels: index-more, plugins
 Attachments: index-more-1xand2x.patch, index-more-2x.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Among other fields, the more plugin for Nutch 2.x provides a last modified 
 and date field for the Solr index. The last modified field is the last 
 modified date from the http headers if available, if not available it is left 
 empty. Currently, the date field is the same as the last modified field 
 unless that field is empty in which case getFetchTime is used as a fall back. 
 I think getFetchTime is not a good fall back as it is the next fetch time and 
 often a month or more in the future which doesn't make sense for the date 
 field. Users do not expect webpages/documents with future dates. A more 
 sensible fallback would be current date at the time it is indexed. 
 This is possible by simply changing line 97 of 
 https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
  from
 time = page.getFetchTime(); // use fetch time
 to
 time = new Date().getTime();
 Users interested in the getFetchTime value can still get it from the tstamp 
 field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1476) SegmentReader getStats should set parsed = -1 if no parsing took place

2012-10-08 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1476:
--

 Summary: SegmentReader getStats should set parsed = -1 if no 
parsing took place
 Key: NUTCH-1476
 URL: https://issues.apache.org/jira/browse/NUTCH-1476
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.6
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 1.6
 Attachments: NUTCH-1476.patch

The method getStats in SegmentReader sets the number of parsed documents (and 
also the number of parseErrors) to 0 if no parsing took place for a segment. 
The values should be set to -1 analogous to the number of fetched docs and 
fetchErrors.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1476) SegmentReader getStats should set parsed = -1 if no parsing took place

2012-10-08 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1476:
---

Attachment: NUTCH-1476.patch

 SegmentReader getStats should set parsed = -1 if no parsing took place
 --

 Key: NUTCH-1476
 URL: https://issues.apache.org/jira/browse/NUTCH-1476
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.6
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 1.6

 Attachments: NUTCH-1476.patch


 The method getStats in SegmentReader sets the number of parsed documents (and 
 also the number of parseErrors) to 0 if no parsing took place for a segment. 
 The values should be set to -1 analogous to the number of fetched docs and 
 fetchErrors.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (NUTCH-1252) SegmentReader -get shows wrong data

2012-10-08 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-1252:
--

Assignee: Sebastian Nagel

 SegmentReader -get shows wrong data
 ---

 Key: NUTCH-1252
 URL: https://issues.apache.org/jira/browse/NUTCH-1252
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4, 1.5
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.6

 Attachments: NUTCH-1252.patch, NUTCH-1252-v2.patch


 The command/option -get of the SegmentReader may show wrong data associated 
 with the given URL. 
 To reproduce:
 {code}
 % mkdir -p test_readseg/urls
 % echo -e 
 http://nutch.apache.org/\ttest=ApacheNutch\nhttp://abc.test/\ttest=AbcTest\tnutch.score=10.0;
   test_readseg/urls/seeds
 % nutch inject test_readseg/crawldb test_readseg/urls
 Injector: starting at 2012-01-18 09:32:25
 Injector: crawlDb: test_readseg/crawldb
 Injector: urlDir: test_readseg/urls
 Injector: Converting injected urls to crawl db entries.
 Injector: Merging injected urls into crawl db.
 Injector: finished at 2012-01-18 09:32:28, elapsed: 00:00:03
 % nutch generate test_readseg/crawldb test_readseg/segments/
 Generator: starting at 2012-01-18 09:32:30
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: Partitioning selected urls for politeness.
 Generator: segment: test_readseg/segments/20120118093232
 Generator: finished at 2012-01-18 09:32:34, elapsed: 00:00:03
 % nutch readseg -get test_readseg/segments/* 'http://nutch.apache.org/' 
 -nocontent -noparse -nofetch -noparsedata -noparsetext
 SegmentReader: get 'http://nutch.apache.org/'
 Crawl Generate::
 Version: 7
 Status: 1 (db_unfetched)
 Fetch time: Wed Jan 18 09:32:26 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 2592000 seconds (30 days)
 Score: 10.0
 Signature: null
 Metadata: _ngt_: 1326875550401test: AbcTest
 {code}
 The metadata and the score indicate that the CrawlDatum shown is the wrong 
 one (that associated to http://abc.test/ but not to http://nutch.apache.org/).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1344) BasicURLNormalizer to normalize https same as http

2012-10-08 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471915#comment-13471915
 ] 

Sebastian Nagel commented on NUTCH-1344:


Is there any reason why https should be treated different from http (and ftp)?

 BasicURLNormalizer to normalize https same as http 
 ---

 Key: NUTCH-1344
 URL: https://issues.apache.org/jira/browse/NUTCH-1344
 Project: Nutch
  Issue Type: Bug
Affects Versions: nutchgora, 1.6
Reporter: Sebastian Nagel
 Attachments: NUTCH-1344.patch


 Most of the normalization done by BasicURLNormalizer (lowercasing host, 
 removing default port, removal of page anchors, cleaning . and . in the path) 
 is not done for URLs with protocol https.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Jenkins build is back to normal : Nutch-nutchgora #373

2012-10-08 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-nutchgora/373/