[jira] [Updated] (NUTCH-1388) Optionally maintain custom fetch interval despite AdaptiveFetchSchedule

2012-06-22 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1388:
-

Attachment: NUTCH-1388-1.6-2.patch

Complete patch that actually builds against current trunk.

 Optionally maintain custom fetch interval despite AdaptiveFetchSchedule
 ---

 Key: NUTCH-1388
 URL: https://issues.apache.org/jira/browse/NUTCH-1388
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1388-1.6-1.patch, NUTCH-1388-1.6-2.patch


 During injection a custom fetch interval can be configured but it is not 
 maintained with an AdaptiveFetchSchedule enabled. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1405) Allow to overwrite CrawlDatum's with injected entries

2012-06-22 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1405:
-

Patch Info: Patch Available

 Allow to overwrite CrawlDatum's with injected entries
 -

 Key: NUTCH-1405
 URL: https://issues.apache.org/jira/browse/NUTCH-1405
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Affects Versions: 1.5, 1.6
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: NUTCH-1405-1.6-1.patch


 Injector's reducer does not permit overwriting existing CrawlDatum entries. 
 It is, however, useful to optionally overwrite so users can reset metadata 
 manually.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1342) Read time out protocol-http

2012-06-22 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1342:
-

Patch Info: Patch Available

 Read time out protocol-http
 ---

 Key: NUTCH-1342
 URL: https://issues.apache.org/jira/browse/NUTCH-1342
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.4, 1.5
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Critical
 Fix For: 1.6

 Attachments: NUTCH-1342-1.6-1.patch


 For some reason some URL's always time out with protocol-http but not 
 protocol-httpclient. The stack trace is always the same:
 {code}
 2012-04-20 11:25:44,275 ERROR http.Http - Failed to get protocol output
 java.net.SocketTimeoutException: Read timed out
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(SocketInputStream.java:129)
 at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
 at java.io.FilterInputStream.read(FilterInputStream.java:116)
 at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
 at java.io.FilterInputStream.read(FilterInputStream.java:90)
 at 
 org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:228)
 at 
 org.apache.nutch.protocol.http.HttpResponse.init(HttpResponse.java:157)
 at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
 at 
 org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
 {code}
 Some example URL's:
 * 404 http://www.fcgroningen.nl/tribunenamen/stemmen/
 * 301 http://shop.fcgroningen.nl/aanbieding

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1388) Optionally maintain custom fetch interval despite AdaptiveFetchSchedule

2012-06-22 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1388:
-

Patch Info: Patch Available

 Optionally maintain custom fetch interval despite AdaptiveFetchSchedule
 ---

 Key: NUTCH-1388
 URL: https://issues.apache.org/jira/browse/NUTCH-1388
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1388-1.6-1.patch, NUTCH-1388-1.6-2.patch


 During injection a custom fetch interval can be configured but it is not 
 maintained with an AdaptiveFetchSchedule enabled. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1405) Allow to overwrite CrawlDatum's with injected entries

2012-06-22 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1405:
-

Attachment: (was: NUTCH-1405-1.6-1.patch)

 Allow to overwrite CrawlDatum's with injected entries
 -

 Key: NUTCH-1405
 URL: https://issues.apache.org/jira/browse/NUTCH-1405
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Affects Versions: 1.5, 1.6
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: NUTCH-1405-1.6-2.patch


 Injector's reducer does not permit overwriting existing CrawlDatum entries. 
 It is, however, useful to optionally overwrite so users can reset metadata 
 manually.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1405) Allow to overwrite CrawlDatum's with injected entries

2012-06-22 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1405:
-

Attachment: NUTCH-1405-1.6-2.patch

Correct patch.

 Allow to overwrite CrawlDatum's with injected entries
 -

 Key: NUTCH-1405
 URL: https://issues.apache.org/jira/browse/NUTCH-1405
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Affects Versions: 1.5, 1.6
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: NUTCH-1405-1.6-2.patch


 Injector's reducer does not permit overwriting existing CrawlDatum entries. 
 It is, however, useful to optionally overwrite so users can reset metadata 
 manually.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1405) Allow to overwrite CrawlDatum's with injected entries

2012-06-22 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1405:
-

Attachment: (was: NUTCH-1405-1.6-2.patch)

 Allow to overwrite CrawlDatum's with injected entries
 -

 Key: NUTCH-1405
 URL: https://issues.apache.org/jira/browse/NUTCH-1405
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Affects Versions: 1.5, 1.6
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: NUTCH-1405-1.6-3.patch


 Injector's reducer does not permit overwriting existing CrawlDatum entries. 
 It is, however, useful to optionally overwrite so users can reset metadata 
 manually.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1405) Allow to overwrite CrawlDatum's with injected entries

2012-06-22 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1405:
-

Attachment: NUTCH-1405-1.6-3.patch

This time without a debug log line!!!

 Allow to overwrite CrawlDatum's with injected entries
 -

 Key: NUTCH-1405
 URL: https://issues.apache.org/jira/browse/NUTCH-1405
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Affects Versions: 1.5, 1.6
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: NUTCH-1405-1.6-3.patch


 Injector's reducer does not permit overwriting existing CrawlDatum entries. 
 It is, however, useful to optionally overwrite so users can reset metadata 
 manually.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1406) Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags

2012-06-22 Thread Kristof (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kristof  updated NUTCH-1406:


Attachment: index-metadata.patch

 Metatags-index/-parse plugin: conversion to Solr date format and prevents 
 parsing/indexing of empty tags
 

 Key: NUTCH-1406
 URL: https://issues.apache.org/jira/browse/NUTCH-1406
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, parser
Reporter: Kristof 
Priority: Minor
  Labels: conversion, date
 Attachments: index-metadata.patch


 This improvement to the index-metatags plugin (sometimes also refered to 
 parse-metatags plugin) allows for conversion of selected fields to the Solr 
 date format and prevents parsing/indexing of metatags that do not contain any 
 content.
 In order to convert the values of selected metatags to Solr date format, you 
 must specify in nutch-site.xml. The example used is an extended Dublin Core 
 element dcterms.modified with the seed url http://www.cic.gc.ca/. 
 dcterms.modified must also be defined in the metatags.names property.
 {code}
 property
   namemetatags.convert/name
   valuedcterms.modified/value
   descriptionFor plugin index-metadata: Indicate here the name of the 
 html meta tag that should be converted to Solr date format.
   /description
 /property
 {code}
 I read that SimpleDateFormat format is not a robust solution, so this 
 improvement might have some problems.
 So far it worked well for me. Below more details about the changes.
 Please note:
 The attached jar-file was originally taken from NUTCH-809 
 (https://issues.apache.org/jira/browse/NUTCH-809). The plugin and tutorial 
 there do not necessarily match the index-metadata plugin in subversion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1406) Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags

2012-06-22 Thread Kristof (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kristof  updated NUTCH-1406:


Description: 
This improvement to the index-metatags plugin (sometimes also refered to 
parse-metatags plugin) allows for conversion of selected fields to the Solr 
date format and prevents parsing/indexing of metatags that do not contain any 
content.

In order to convert the values of selected metatags to Solr date format, you 
must specify in nutch-site.xml. The example used is an extended Dublin Core 
element dcterms.modified with the seed url http://www.cic.gc.ca/. 
dcterms.modified must also be defined in the metatags.names property.
{code}
property
nameindex.dateconvert.md/name
valuedcterms.modified/value
descriptionFor plugin index-metadata: Indicate here the name of the 
html meta tag that should be converted to Solr date format.
/description
/property
{code}


  was:
This improvement to the index-metatags plugin (sometimes also refered to 
parse-metatags plugin) allows for conversion of selected fields to the Solr 
date format and prevents parsing/indexing of metatags that do not contain any 
content.

In order to convert the values of selected metatags to Solr date format, you 
must specify in nutch-site.xml. The example used is an extended Dublin Core 
element dcterms.modified with the seed url http://www.cic.gc.ca/. 
dcterms.modified must also be defined in the metatags.names property.
{code}
property
namemetatags.convert/name
valuedcterms.modified/value
descriptionFor plugin index-metadata: Indicate here the name of the 
html meta tag that should be converted to Solr date format.
/description
/property
{code}

I read that SimpleDateFormat format is not a robust solution, so this 
improvement might have some problems.
So far it worked well for me. Below more details about the changes.

Please note:
The attached jar-file was originally taken from NUTCH-809 
(https://issues.apache.org/jira/browse/NUTCH-809). The plugin and tutorial 
there do not necessarily match the index-metadata plugin in subversion.


 Metatags-index/-parse plugin: conversion to Solr date format and prevents 
 parsing/indexing of empty tags
 

 Key: NUTCH-1406
 URL: https://issues.apache.org/jira/browse/NUTCH-1406
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, parser
Reporter: Kristof 
Priority: Minor
  Labels: conversion, date
 Attachments: index-metadata.patch


 This improvement to the index-metatags plugin (sometimes also refered to 
 parse-metatags plugin) allows for conversion of selected fields to the Solr 
 date format and prevents parsing/indexing of metatags that do not contain any 
 content.
 In order to convert the values of selected metatags to Solr date format, you 
 must specify in nutch-site.xml. The example used is an extended Dublin Core 
 element dcterms.modified with the seed url http://www.cic.gc.ca/. 
 dcterms.modified must also be defined in the metatags.names property.
 {code}
 property
   nameindex.dateconvert.md/name
   valuedcterms.modified/value
   descriptionFor plugin index-metadata: Indicate here the name of the 
 html meta tag that should be converted to Solr date format.
   /description
 /property
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1406) Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags

2012-06-22 Thread Kristof (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kristof  updated NUTCH-1406:


Attachment: (was: index-metadata-plugin.patch)

 Metatags-index/-parse plugin: conversion to Solr date format and prevents 
 parsing/indexing of empty tags
 

 Key: NUTCH-1406
 URL: https://issues.apache.org/jira/browse/NUTCH-1406
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, parser
Reporter: Kristof 
Priority: Minor
  Labels: conversion, date
 Attachments: index-metadata.patch


 This improvement to the index-metatags plugin (sometimes also refered to 
 parse-metatags plugin) allows for conversion of selected fields to the Solr 
 date format and prevents parsing/indexing of metatags that do not contain any 
 content.
 In order to convert the values of selected metatags to Solr date format, you 
 must specify in nutch-site.xml. The example used is an extended Dublin Core 
 element dcterms.modified with the seed url http://www.cic.gc.ca/. 
 dcterms.modified must also be defined in the metatags.names property.
 {code}
 property
   namemetatags.convert/name
   valuedcterms.modified/value
   descriptionFor plugin index-metadata: Indicate here the name of the 
 html meta tag that should be converted to Solr date format.
   /description
 /property
 {code}
 I read that SimpleDateFormat format is not a robust solution, so this 
 improvement might have some problems.
 So far it worked well for me. Below more details about the changes.
 Please note:
 The attached jar-file was originally taken from NUTCH-809 
 (https://issues.apache.org/jira/browse/NUTCH-809). The plugin and tutorial 
 there do not necessarily match the index-metadata plugin in subversion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1406) metadata-index plugin: conversion to Solr date format

2012-06-22 Thread Kristof (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kristof  updated NUTCH-1406:


Description: 
This improvement to the index-metatags plugin (sometimes also refered to 
parse-metatags plugin) allows for conversion of selected fields to the Solr 
date format.

In order to convert the values of selected metatags to Solr date format, you 
must specify in nutch-site.xml. The example used is an extended Dublin Core 
element dcterms.modified with the seed url http://www.cic.gc.ca/. 
dcterms.modified must also be defined in the metatags.names property.
{code}
property
nameindex.dateconvert.md/name
valuedcterms.modified/value
descriptionFor plugin index-metadata: Indicate here the name of the 
html meta tag that should be converted to Solr date format.
/description
/property
{code}


  was:
This improvement to the index-metatags plugin (sometimes also refered to 
parse-metatags plugin) allows for conversion of selected fields to the Solr 
date format and prevents parsing/indexing of metatags that do not contain any 
content.

In order to convert the values of selected metatags to Solr date format, you 
must specify in nutch-site.xml. The example used is an extended Dublin Core 
element dcterms.modified with the seed url http://www.cic.gc.ca/. 
dcterms.modified must also be defined in the metatags.names property.
{code}
property
nameindex.dateconvert.md/name
valuedcterms.modified/value
descriptionFor plugin index-metadata: Indicate here the name of the 
html meta tag that should be converted to Solr date format.
/description
/property
{code}


Summary: metadata-index plugin: conversion to Solr date format  (was: 
Metatags-index/-parse plugin: conversion to Solr date format and prevents 
parsing/indexing of empty tags)

 metadata-index plugin: conversion to Solr date format
 -

 Key: NUTCH-1406
 URL: https://issues.apache.org/jira/browse/NUTCH-1406
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, parser
Reporter: Kristof 
Priority: Minor
  Labels: conversion, date
 Attachments: index-metadata.patch


 This improvement to the index-metatags plugin (sometimes also refered to 
 parse-metatags plugin) allows for conversion of selected fields to the Solr 
 date format.
 In order to convert the values of selected metatags to Solr date format, you 
 must specify in nutch-site.xml. The example used is an extended Dublin Core 
 element dcterms.modified with the seed url http://www.cic.gc.ca/. 
 dcterms.modified must also be defined in the metatags.names property.
 {code}
 property
   nameindex.dateconvert.md/name
   valuedcterms.modified/value
   descriptionFor plugin index-metadata: Indicate here the name of the 
 html meta tag that should be converted to Solr date format.
   /description
 /property
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1406) metadata-index plugin: conversion to Solr date format

2012-06-22 Thread Kristof (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13399222#comment-13399222
 ] 

Kristof  commented on NUTCH-1406:
-

Thank you for the clarification. When I originally looked for a plugin to index 
metadata early this year, the index-metatags was the one available. Hence I 
developed based on this, only realizing after trying to get it working with 
trunk that something did not add up. Obviously building on the committed 
index-metadata version is the way to go. I attached the hopefully correct way 
to patch it, and removed the wrong version and any information that might be 
misleading. I was not able to make extensive tests though as this was done 
using the version initially posted in NUTCH-809.

 metadata-index plugin: conversion to Solr date format
 -

 Key: NUTCH-1406
 URL: https://issues.apache.org/jira/browse/NUTCH-1406
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, parser
Reporter: Kristof 
Priority: Minor
  Labels: conversion, date
 Attachments: index-metadata.patch


 This improvement to the index-metatags plugin (sometimes also refered to 
 parse-metatags plugin) allows for conversion of selected fields to the Solr 
 date format.
 In order to convert the values of selected metatags to Solr date format, you 
 must specify in nutch-site.xml. The example used is an extended Dublin Core 
 element dcterms.modified with the seed url http://www.cic.gc.ca/. 
 dcterms.modified must also be defined in the metatags.names property.
 {code}
 property
   nameindex.dateconvert.md/name
   valuedcterms.modified/value
   descriptionFor plugin index-metadata: Indicate here the name of the 
 html meta tag that should be converted to Solr date format.
   /description
 /property
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1406) metadata-index plugin: conversion to Solr date format

2012-06-22 Thread Kristof (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kristof  updated NUTCH-1406:


Description: 
This improvement to the index-metatags plugin (sometimes also refered to 
parse-metatags plugin) allows for conversion of selected fields to the Solr 
date format.

In order to convert the values of selected metatags to Solr date format, you 
must specify in nutch-site.xml. This can be for example used with Dublin Core 
elements.  dcterms.modified with the seed url http://www.cic.gc.ca 
dcterms.modified must also be defined in the metatags.names and index.parse.md 
propertie. 
{code}
property
nameindex.dateconvert.md/name
valuemetatag.dcterms.modified/value
descriptionFor plugin index-metadata: Indicate here the name of the 
html meta tag that should be converted to Solr date format.
/description
/property
{code}


  was:
This improvement to the index-metatags plugin (sometimes also refered to 
parse-metatags plugin) allows for conversion of selected fields to the Solr 
date format.

In order to convert the values of selected metatags to Solr date format, you 
must specify in nutch-site.xml. The example used is an extended Dublin Core 
element dcterms.modified with the seed url http://www.cic.gc.ca/. 
dcterms.modified must also be defined in the metatags.names property.
{code}
property
nameindex.dateconvert.md/name
valuedcterms.modified/value
descriptionFor plugin index-metadata: Indicate here the name of the 
html meta tag that should be converted to Solr date format.
/description
/property
{code}



 metadata-index plugin: conversion to Solr date format
 -

 Key: NUTCH-1406
 URL: https://issues.apache.org/jira/browse/NUTCH-1406
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, parser
Reporter: Kristof 
Priority: Minor
  Labels: conversion, date
 Attachments: index-metadata.patch


 This improvement to the index-metatags plugin (sometimes also refered to 
 parse-metatags plugin) allows for conversion of selected fields to the Solr 
 date format.
 In order to convert the values of selected metatags to Solr date format, you 
 must specify in nutch-site.xml. This can be for example used with Dublin Core 
 elements.  dcterms.modified with the seed url http://www.cic.gc.ca 
 dcterms.modified must also be defined in the metatags.names and 
 index.parse.md propertie. 
 {code}
 property
   nameindex.dateconvert.md/name
   valuemetatag.dcterms.modified/value
   descriptionFor plugin index-metadata: Indicate here the name of the 
 html meta tag that should be converted to Solr date format.
   /description
 /property
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1406) metadata-index plugin: conversion to Solr date format

2012-06-22 Thread Kristof (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kristof  updated NUTCH-1406:


Description: 
This improvement to the index-metatags plugin (sometimes also refered to 
parse-metatags plugin) allows for conversion of selected fields to the Solr 
date format.

In order to convert the values of selected metatags to Solr date format, you 
must specify in nutch-site.xml. This can be for example used with Dublin Core 
elements. A subdomain which would have pages with the meta tag dcterms.modified 
would be cic.gc.ca. dcterms.modified must also be defined in the metatags.names 
and index.parse.md properties.
 
{code}
property
nameindex.dateconvert.md/name
valuemetatag.dcterms.modified/value
descriptionFor plugin index-metadata: Indicate here the name of the 
html meta tag that should be converted to Solr date format.
/description
/property
{code}


  was:
This improvement to the index-metatags plugin (sometimes also refered to 
parse-metatags plugin) allows for conversion of selected fields to the Solr 
date format.

In order to convert the values of selected metatags to Solr date format, you 
must specify in nutch-site.xml. This can be for example used with Dublin Core 
elements.  dcterms.modified with the seed url http://www.cic.gc.ca 
dcterms.modified must also be defined in the metatags.names and index.parse.md 
propertie. 
{code}
property
nameindex.dateconvert.md/name
valuemetatag.dcterms.modified/value
descriptionFor plugin index-metadata: Indicate here the name of the 
html meta tag that should be converted to Solr date format.
/description
/property
{code}



 metadata-index plugin: conversion to Solr date format
 -

 Key: NUTCH-1406
 URL: https://issues.apache.org/jira/browse/NUTCH-1406
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, parser
Reporter: Kristof 
Priority: Minor
  Labels: conversion, date
 Attachments: index-metadata.patch


 This improvement to the index-metatags plugin (sometimes also refered to 
 parse-metatags plugin) allows for conversion of selected fields to the Solr 
 date format.
 In order to convert the values of selected metatags to Solr date format, you 
 must specify in nutch-site.xml. This can be for example used with Dublin Core 
 elements. A subdomain which would have pages with the meta tag 
 dcterms.modified would be cic.gc.ca. dcterms.modified must also be defined in 
 the metatags.names and index.parse.md properties.
  
 {code}
 property
   nameindex.dateconvert.md/name
   valuemetatag.dcterms.modified/value
   descriptionFor plugin index-metadata: Indicate here the name of the 
 html meta tag that should be converted to Solr date format.
   /description
 /property
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1406) metadata-index plugin: conversion to Solr date format

2012-06-22 Thread Kristof (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kristof  updated NUTCH-1406:


Description: 
This improvement to the index-metatags plugin (sometimes also refered to 
parse-metatags plugin) allows for conversion of selected fields to the Solr 
date format. The main benefit of this conversion is the possibility to create 
range facets.

In order to convert the values of selected metatags to Solr date format, you 
must specify in nutch-site.xml. This can be for example used with Dublin Core 
elements. A subdomain which would have pages with the meta tag dcterms.modified 
would be cic.gc.ca. dcterms.modified must also be defined in the metatags.names 
and index.parse.md properties.
 
{code}
property
nameindex.dateconvert.md/name
valuemetatag.dcterms.modified/value
descriptionFor plugin index-metadata: Indicate here the name of the 
html meta tag that should be converted to Solr date format.
/description
/property
{code}


  was:
This improvement to the index-metatags plugin (sometimes also refered to 
parse-metatags plugin) allows for conversion of selected fields to the Solr 
date format.

In order to convert the values of selected metatags to Solr date format, you 
must specify in nutch-site.xml. This can be for example used with Dublin Core 
elements. A subdomain which would have pages with the meta tag dcterms.modified 
would be cic.gc.ca. dcterms.modified must also be defined in the metatags.names 
and index.parse.md properties.
 
{code}
property
nameindex.dateconvert.md/name
valuemetatag.dcterms.modified/value
descriptionFor plugin index-metadata: Indicate here the name of the 
html meta tag that should be converted to Solr date format.
/description
/property
{code}



 metadata-index plugin: conversion to Solr date format
 -

 Key: NUTCH-1406
 URL: https://issues.apache.org/jira/browse/NUTCH-1406
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, parser
Reporter: Kristof 
Priority: Minor
  Labels: conversion, date
 Attachments: index-metadata.patch


 This improvement to the index-metatags plugin (sometimes also refered to 
 parse-metatags plugin) allows for conversion of selected fields to the Solr 
 date format. The main benefit of this conversion is the possibility to create 
 range facets.
 In order to convert the values of selected metatags to Solr date format, you 
 must specify in nutch-site.xml. This can be for example used with Dublin Core 
 elements. A subdomain which would have pages with the meta tag 
 dcterms.modified would be cic.gc.ca. dcterms.modified must also be defined in 
 the metatags.names and index.parse.md properties.
  
 {code}
 property
   nameindex.dateconvert.md/name
   valuemetatag.dcterms.modified/value
   descriptionFor plugin index-metadata: Indicate here the name of the 
 html meta tag that should be converted to Solr date format.
   /description
 /property
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1406) metadata-index plugin: conversion to Solr date format

2012-06-22 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13399228#comment-13399228
 ] 

Markus Jelsma commented on NUTCH-1406:
--

Hello, a few notes on your patch:
* Nutch uses double space for a single indentation, not tabs;
* convertIndicatior seems to be misspelled;
* -MM-dd doesn't look like Solr's supported DateField as it's missing time 
and timezone Z.

 metadata-index plugin: conversion to Solr date format
 -

 Key: NUTCH-1406
 URL: https://issues.apache.org/jira/browse/NUTCH-1406
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, parser
Reporter: Kristof 
Priority: Minor
  Labels: conversion, date
 Attachments: index-metadata.patch


 This improvement to the index-metatags plugin (sometimes also refered to 
 parse-metatags plugin) allows for conversion of selected fields to the Solr 
 date format. The main benefit of this conversion is the possibility to create 
 range facets.
 In order to convert the values of selected metatags to Solr date format, you 
 must specify in nutch-site.xml. This can be for example used with Dublin Core 
 elements. A subdomain which would have pages with the meta tag 
 dcterms.modified would be cic.gc.ca. dcterms.modified must also be defined in 
 the metatags.names and index.parse.md properties.
  
 {code}
 property
   nameindex.dateconvert.md/name
   valuemetatag.dcterms.modified/value
   descriptionFor plugin index-metadata: Indicate here the name of the 
 html meta tag that should be converted to Solr date format.
   /description
 /property
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1406) metadata-index plugin: conversion to Solr date format

2012-06-22 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13399250#comment-13399250
 ] 

Julien Nioche commented on NUTCH-1406:
--

BTW we have formatting rules for Eclipse in the NutchGora branch (see 
eclipse-codeformat.xml). We could add this to the trunk as well

 metadata-index plugin: conversion to Solr date format
 -

 Key: NUTCH-1406
 URL: https://issues.apache.org/jira/browse/NUTCH-1406
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, parser
Reporter: Kristof 
Priority: Minor
  Labels: conversion, date
 Attachments: index-metadata.patch


 This improvement to the index-metatags plugin (sometimes also refered to 
 parse-metatags plugin) allows for conversion of selected fields to the Solr 
 date format. The main benefit of this conversion is the possibility to create 
 range facets.
 In order to convert the values of selected metatags to Solr date format, you 
 must specify in nutch-site.xml. This can be for example used with Dublin Core 
 elements. A subdomain which would have pages with the meta tag 
 dcterms.modified would be cic.gc.ca. dcterms.modified must also be defined in 
 the metatags.names and index.parse.md properties.
  
 {code}
 property
   nameindex.dateconvert.md/name
   valuemetatag.dcterms.modified/value
   descriptionFor plugin index-metadata: Indicate here the name of the 
 html meta tag that should be converted to Solr date format.
   /description
 /property
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1408) RobotRulesParser main doesn't take URL's

2012-06-22 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-1408:


 Summary: RobotRulesParser main doesn't take URL's
 Key: NUTCH-1408
 URL: https://issues.apache.org/jira/browse/NUTCH-1408
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6


lib-http's org.apache.nutch.protocol.http.api.RobotRulesParser main() takes a 
robot file and an URL file according to its usage output. It, however, expects 
URI paths not URL's and will therefore never work if an input contains URL's.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1408) RobotRulesParser main doesn't take URL's

2012-06-22 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1408:
-

Attachment: NUTCH-1408-1.6-1.patch

Patch turns input to an URL objects which is handled properly.

 RobotRulesParser main doesn't take URL's
 

 Key: NUTCH-1408
 URL: https://issues.apache.org/jira/browse/NUTCH-1408
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: NUTCH-1408-1.6-1.patch


 lib-http's org.apache.nutch.protocol.http.api.RobotRulesParser main() takes a 
 robot file and an URL file according to its usage output. It, however, 
 expects URI paths not URL's and will therefore never work if an input 
 contains URL's.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1409) Remove deprecated properties in nutch-default.xml

2012-06-22 Thread Matthias Agethle (JIRA)
Matthias Agethle created NUTCH-1409:
---

 Summary: Remove deprecated properties in nutch-default.xml
 Key: NUTCH-1409
 URL: https://issues.apache.org/jira/browse/NUTCH-1409
 Project: Nutch
  Issue Type: Improvement
Reporter: Matthias Agethle
Priority: Minor
 Fix For: 1.6


1) Remove deprecated properties from nutch-default.xml (generate.max.per.host 
and db.default.fetch.interval).

2) The already removed properties generate.max.per.host.by.ip and 
db.max.fetch.interval are still used in source code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1409) Remove deprecated properties in nutch-default.xml

2012-06-22 Thread Matthias Agethle (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Agethle updated NUTCH-1409:


Attachment: NUTCH-1409.patch

Patch for trunk (rev 1352896)

 Remove deprecated properties in nutch-default.xml
 -

 Key: NUTCH-1409
 URL: https://issues.apache.org/jira/browse/NUTCH-1409
 Project: Nutch
  Issue Type: Improvement
Reporter: Matthias Agethle
Priority: Minor
 Fix For: 1.6

 Attachments: NUTCH-1409.patch


 1) Remove deprecated properties from nutch-default.xml (generate.max.per.host 
 and db.default.fetch.interval).
 2) The already removed properties generate.max.per.host.by.ip and 
 db.max.fetch.interval are still used in source code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1409) Remove deprecated properties in nutch-default.xml

2012-06-22 Thread Matthias Agethle (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Agethle updated NUTCH-1409:


Patch Info: Patch Available

 Remove deprecated properties in nutch-default.xml
 -

 Key: NUTCH-1409
 URL: https://issues.apache.org/jira/browse/NUTCH-1409
 Project: Nutch
  Issue Type: Improvement
Reporter: Matthias Agethle
Priority: Minor
 Fix For: 1.6

 Attachments: NUTCH-1409.patch


 1) Remove deprecated properties from nutch-default.xml (generate.max.per.host 
 and db.default.fetch.interval).
 2) The already removed properties generate.max.per.host.by.ip and 
 db.max.fetch.interval are still used in source code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1408) RobotRulesParser main doesn't take URL's

2012-06-22 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13399356#comment-13399356
 ] 

Lewis John McGibbney commented on NUTCH-1408:
-

+1

 RobotRulesParser main doesn't take URL's
 

 Key: NUTCH-1408
 URL: https://issues.apache.org/jira/browse/NUTCH-1408
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: NUTCH-1408-1.6-1.patch


 lib-http's org.apache.nutch.protocol.http.api.RobotRulesParser main() takes a 
 robot file and an URL file according to its usage output. It, however, 
 expects URI paths not URL's and will therefore never work if an input 
 contains URL's.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: 1.5.1 release

2012-06-22 Thread Mattmann, Chris A (388J)
Hey Guys,

(sorry for the top post)

There's no reason to freeze trunk during releases. In fact, during the RC, once 
the branch (or tag for that matter)
is created, trunk can continue on, no need to stop. Heck, we can always just 
tag or branch from a specific 
revision too so it's not really a biggie.

Cheers,
Chris

On Jun 21, 2012, at 2:43 PM, Lewis John Mcgibbney wrote:

 Hi Markus,
 
 On Thu, Jun 21, 2012 at 10:02 PM, Markus Jelsma
 markus.jel...@openindex.io wrote:
 It's still not clear to me what 1.5.1 is going to look like. Will it be 
 current trunk incl. the script bugfix or just 1.5 plus the bugfix? I would 
 vote for the latter as it makes more sense for a bugfix release.
 
 I am easy on this one... I suggest we do it the normal way. Lets let
 folks chime in and see where we are on Saturday. It looks like 2.0 is
 going to be shifted with the new commits so do we wish to try and keep
 at least the minimal consistency between both releases?
 
 
 There is another debate behind this, in my opinion, about freezing trunk 
 prior to releases and thus stopping active development. This has been an 
 issue in the past. Is this something for another thread?
 
 
 Yeah I must also agree that we should branch trunk, keep the branch
 for the release then run the RC's from the branch regardless of how
 trunk comes on. My only suggestion for  backporting patches from trunk
 to the release candidate branch is if it is a pretty critical bug fix
 as we've now discovered in 1.5!
 
 Additionally there is another note here as well w.r.t release
 managers. We've relied on the excellent work done by Chris (and
 others) as RM's for a number of releases but during the release period
 (on occasion, more recently) as you mention trunk has frozen
 temporarily. Of course it is the aim to prevent this happening should
 the RC not progress as we would all like. Hopefully we are moving
 towards a more adaptable and sustainable RM process within Nutch where
 the RM responsibility can be undertaken/overseen by more than one
 individual over the entire duration of the process. I think (and hope)
 we can consider the slight struggle we've had for 1.5 as an exception.
 As far back as I can remember RC's have always been efficient and
 smooth and I personally am committed to ensuring we return to the high
 precedent set by previous RM's.
 We've also seen an alternative (and in my opinion an improved)
 publication of Nutch atrifacts for 1.5. For reference I direct you to
 Julien's commentary [0] on this topic. Due to this, we've had to run
 additional RC's which has taken a bit longer than usual and I must
 personally apologise to everyone for at least one RC cock up which
 could have been avoided had I been more familiar with the Nutch
 specific release process.
 
 I think I'm ranting here so I'm going to give it a bye now.
 
 Lewis
 
 [0] http://digitalpebble.blogspot.co.uk/2012/06/whats-new-in-nutch-15.html


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Nutch 2.0 Press Announcement

2012-06-22 Thread Sally Khudairi
Hello Lewis --great to hear from you, as always. Hello Nutch DevTeam!

Of course; I'm happy to help. What's your timeframe?

Traditionally, these sorts of announcements are usually something I work with 
the PMC on, vs. dev (no offense, folks, it's more of an issue of public 
exposure prior to the announcement being made). Whatever works best for you is 
fine...I'm flexible.

Having said that, what is your timeframe? In other words, has v2.0 already been 
releases (I hope not!). Also, if you would like to include supporting 
testimonial quotes from highly-visible users (organizations), we are going to 
have to plan to set aside at least a week for those to come in (some companies 
have strict vetting/clearance requirements by their legal teams).

And finally, in an ideal situation, we'll work on the announcement together 
(usually there's a point-person assigned to take the lead on this, and we'll 
run drafts by the list during the final editing stages) so I can get a better 
grasp of the project and be able to highlight what's new/important/sexy/*.

Thanks again. I look forward to working with y'all g

Chat soon,
Sally
 




 From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
To: Sally Khudairi s...@apache.org 
Cc: dev@nutch.apache.org 
Sent: Thursday, 21 June 2012, 16:49
Subject: Nutch 2.0 Press Announcement
 
Good Evening Sally,

First and foremost I hope you are keeping well and that the beginning
of the summer has been kind to you... all the good weather still to
come not to worry :0)

The reason I contact you is that we (the Apache Nutch community) are
nearly ready to release Nutch 2.0 which represents a pretty
significant milestone for Apache Nutch as a project. Although Nutch
2.0 is not considered as main stream development (a decision made by
the PMC some time ago) it still marks a real step forward for the
project as a whole and also pays serious merit to users, developers
and committers past and present. Due top these reasons I think it
would be excellent for the community if we could really get the
message out that the project is rocking in addition to the fact that
it is an excellent, well followed, vibrant TLP within the foundation.

I wonder if it would be possible for us to get a formal press
announcement constructed based on input from ourselves in
collaboration with your experience in this area?

I am coming into the official press releases from an almost blind
tangent so would really appreciate your guidance and input on this one
if possible.

Thanks in advance for any input you have.

Best

Lewis

N.B Please anyone from dev@ chime in on this thread. I personally feel
the better an announcement, the more our community grows. Thank you




[jira] [Updated] (NUTCH-1406) metadata-index plugin: conversion to Solr date format

2012-06-22 Thread Kristof (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kristof  updated NUTCH-1406:


Attachment: index-metadata_formatted.patch

Formatting done (correct?), spelling error corrected. In regards to the format. 
You are right that Solr uses this date format -mm-ddThh:mm:ss.mmmZ. The 
used SimpleDateFormat -MM-dd correctly converts to the 
-mm-ddThh:mm:ss.mmmZ, but for dates only. I did not consider time when 
using it as the fields I am looking only have date. The conversion basically 
adds time information by interpreting the missing time as 00:00:00 and 
converting it to UTC based on the time zone settings of the machine used in the 
process. I just tested with some altered files into which I included time 
information and several SimpleDateFormat patterns trying to find one which 
works. So far I did not find any that works. A pattern going beyond the pattern 
-MM-dd the original field values only having are not converted. So it seems 
this solutions is only limited to dates.

 metadata-index plugin: conversion to Solr date format
 -

 Key: NUTCH-1406
 URL: https://issues.apache.org/jira/browse/NUTCH-1406
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, parser
Reporter: Kristof 
Priority: Minor
  Labels: conversion, date
 Attachments: index-metadata_formatted.patch


 This improvement to the index-metatags plugin (sometimes also refered to 
 parse-metatags plugin) allows for conversion of selected fields to the Solr 
 date format. The main benefit of this conversion is the possibility to create 
 range facets.
 In order to convert the values of selected metatags to Solr date format, you 
 must specify in nutch-site.xml. This can be for example used with Dublin Core 
 elements. A subdomain which would have pages with the meta tag 
 dcterms.modified would be cic.gc.ca. dcterms.modified must also be defined in 
 the metatags.names and index.parse.md properties.
  
 {code}
 property
   nameindex.dateconvert.md/name
   valuemetatag.dcterms.modified/value
   descriptionFor plugin index-metadata: Indicate here the name of the 
 html meta tag that should be converted to Solr date format.
   /description
 /property
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1406) metadata-index plugin: conversion to Solr date format

2012-06-22 Thread Kristof (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kristof  updated NUTCH-1406:


Attachment: (was: index-metadata.patch)

 metadata-index plugin: conversion to Solr date format
 -

 Key: NUTCH-1406
 URL: https://issues.apache.org/jira/browse/NUTCH-1406
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, parser
Reporter: Kristof 
Priority: Minor
  Labels: conversion, date
 Attachments: index-metadata_formatted.patch


 This improvement to the index-metatags plugin (sometimes also refered to 
 parse-metatags plugin) allows for conversion of selected fields to the Solr 
 date format. The main benefit of this conversion is the possibility to create 
 range facets.
 In order to convert the values of selected metatags to Solr date format, you 
 must specify in nutch-site.xml. This can be for example used with Dublin Core 
 elements. A subdomain which would have pages with the meta tag 
 dcterms.modified would be cic.gc.ca. dcterms.modified must also be defined in 
 the metatags.names and index.parse.md properties.
  
 {code}
 property
   nameindex.dateconvert.md/name
   valuemetatag.dcterms.modified/value
   descriptionFor plugin index-metadata: Indicate here the name of the 
 html meta tag that should be converted to Solr date format.
   /description
 /property
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Build failed in Jenkins: Nutch-nutchgora #289

2012-06-22 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-nutchgora/289/

--
Started by timer
Building remotely on solaris1 in workspace 
https://builds.apache.org/job/Nutch-nutchgora/ws/
hudson.util.IOException2: remote file operation failed: 
https://builds.apache.org/job/Nutch-nutchgora/ws/ at 
hudson.remoting.Channel@30e3f2e6:solaris1
at hudson.FilePath.act(FilePath.java:838)
at hudson.FilePath.act(FilePath.java:824)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:743)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:685)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1242)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:589)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:88)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:494)
at hudson.model.Run.execute(Run.java:1460)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:239)
Caused by: java.io.IOException: Remote call on solaris1 failed
at hudson.remoting.Channel.call(Channel.java:655)
at hudson.FilePath.act(FilePath.java:831)
... 11 more
Caused by: java.lang.LinkageError: duplicate class definition: 
hudson/model/Descriptor
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at java.lang.ClassLoader.defineClass(ClassLoader.java:466)
at 
hudson.remoting.RemoteClassLoader.loadClassFile(RemoteClassLoader.java:152)
at 
hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:131)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2259)
at java.lang.Class.getDeclaredField(Class.java:1852)
at 
java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1582)
at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:52)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:408)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.init(ObjectStreamClass.java:400)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:297)
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:531)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1699)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
at hudson.remoting.UserRequest.deserialize(UserRequest.java:182)
at hudson.remoting.UserRequest.perform(UserRequest.java:98)
at hudson.remoting.UserRequest.perform(UserRequest.java:48)
at hudson.remoting.Request$2.run(Request.java:287)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
at java.util.concurrent.FutureTask.run(FutureTask.java:123)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:651)
at 

Build failed in Jenkins: Nutch-trunk #1877

2012-06-22 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-trunk/1877/

--
Started by timer
Building remotely on solaris1 in workspace 
https://builds.apache.org/job/Nutch-trunk/ws/
hudson.util.IOException2: remote file operation failed: 
https://builds.apache.org/job/Nutch-trunk/ws/ at 
hudson.remoting.Channel@30e3f2e6:solaris1
at hudson.FilePath.act(FilePath.java:838)
at hudson.FilePath.act(FilePath.java:824)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:743)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:685)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1242)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:589)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:88)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:494)
at hudson.model.Run.execute(Run.java:1460)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:239)
Caused by: java.io.IOException: Remote call on solaris1 failed
at hudson.remoting.Channel.call(Channel.java:655)
at hudson.FilePath.act(FilePath.java:831)
... 11 more
Caused by: java.lang.LinkageError: duplicate class definition: 
hudson/model/Descriptor
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at java.lang.ClassLoader.defineClass(ClassLoader.java:466)
at 
hudson.remoting.RemoteClassLoader.loadClassFile(RemoteClassLoader.java:152)
at 
hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:131)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2259)
at java.lang.Class.getDeclaredField(Class.java:1852)
at 
java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1582)
at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:52)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:408)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.init(ObjectStreamClass.java:400)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:297)
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:531)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1699)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
at hudson.remoting.UserRequest.deserialize(UserRequest.java:182)
at hudson.remoting.UserRequest.perform(UserRequest.java:98)
at hudson.remoting.UserRequest.perform(UserRequest.java:48)
at hudson.remoting.Request$2.run(Request.java:287)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
at java.util.concurrent.FutureTask.run(FutureTask.java:123)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:651)
at