date:20130613

RE: [DISCUSS] Nutch 1.7 ready for release?

2013-06-13 Thread Markus Jelsma

+1

-Original message-
 From:Sebastian Nagel wastl.na...@googlemail.com
 Sent: Sun 09-Jun-2013 14:05
 To: dev@nutch.apache.org
 Subject: Re: [DISCUSS] Nutch 1.7 ready for release?

 +1 go ahead!

 Sebastian

 On 06/08/2013 11:53 PM, Lewis John Mcgibbney wrote:
  Thread says it all troops.
  Best
  Lewis

[jira] [Created] (NUTCH-1581) CrawlDB csv output to include metadata

2013-06-13 Thread Markus Jelsma (JIRA)

Markus Jelsma created NUTCH-1581:


 Summary: CrawlDB csv output to include metadata
 Key: NUTCH-1581
 URL: https://issues.apache.org/jira/browse/NUTCH-1581
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.8


Dumping the CrawlDB to CSV should include the CrawlDatum's metadata.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1581) CrawlDB csv output to include metadata

2013-06-13 Thread Markus Jelsma (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1581:
-

Attachment: NUTCH-1581-1.8.patch

Patch for 1.8.

 CrawlDB csv output to include metadata
 --

 Key: NUTCH-1581
 URL: https://issues.apache.org/jira/browse/NUTCH-1581
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1581-1.8.patch


 Dumping the CrawlDB to CSV should include the CrawlDatum's metadata.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1430) Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule

2013-06-13 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682145#comment-13682145
 ] 

Markus Jelsma commented on NUTCH-1430:
--

If no objections i'd like to get this in for 1.7, this is a show stopper for 
all using the FreeGenerator. We've been using this patch in our dist for many 
months now and are happy with it.

 Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule
 --

 Key: NUTCH-1430
 URL: https://issues.apache.org/jira/browse/NUTCH-1430
 Project: Nutch
  Issue Type: Bug
  Components: crawldb
Affects Versions: 1.5
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Critical
 Fix For: 1.8

 Attachments: NUTCH-1430-1.6-1.patch, NUTCH-1430-1.6-2.patch


 Steps to reproduce:
 Without AdaptiveFetchSchedule:
 {code}
 $ bin/nutch readdb crawl/crawldb/ -url http://www.openindex.io/en/home.html
 URL: http://www.openindex.io/en/home.html
 Version: 7
 Status: 2 (db_fetched)
 Fetch time: Thu Aug 16 13:58:23 CEST 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 2592000 seconds (30 days)
 Score: 0.0
 Signature: c2601ca503f2fc5edcb286501d7fb271
 Metadata: Content-Type: text/html_pst_: success(1), lastModified=0
 {code}
 With AdaptiveFetchSchedule:
 {code}
 $ bin/nutch readdb crawl/crawldb/ -url http://www.openindex.io/en/home.html
 URL: http://www.openindex.io/en/home.html
 Version: 7
 Status: 2 (db_fetched)
 Fetch time: Tue Jul 17 13:56:33 CEST 2012
 Modified time: Tue Jul 17 13:55:33 CEST 2012
 Retries since fetch: 0
 Retry interval: 60 seconds (0 days)
 Score: 0.0
 Signature: 23567bb52ee8b905b8649c4305ed82ee
 Metadata: Content-Type: text/html_pst_: success(1), lastModified=0
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1430) Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule

2013-06-13 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682160#comment-13682160
 ] 

Sebastian Nagel commented on NUTCH-1430:


+1

 Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule
 --

 Key: NUTCH-1430
 URL: https://issues.apache.org/jira/browse/NUTCH-1430
 Project: Nutch
  Issue Type: Bug
  Components: crawldb
Affects Versions: 1.5
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Critical
 Fix For: 1.8

 Attachments: NUTCH-1430-1.6-1.patch, NUTCH-1430-1.6-2.patch


 Steps to reproduce:
 Without AdaptiveFetchSchedule:
 {code}
 $ bin/nutch readdb crawl/crawldb/ -url http://www.openindex.io/en/home.html
 URL: http://www.openindex.io/en/home.html
 Version: 7
 Status: 2 (db_fetched)
 Fetch time: Thu Aug 16 13:58:23 CEST 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 2592000 seconds (30 days)
 Score: 0.0
 Signature: c2601ca503f2fc5edcb286501d7fb271
 Metadata: Content-Type: text/html_pst_: success(1), lastModified=0
 {code}
 With AdaptiveFetchSchedule:
 {code}
 $ bin/nutch readdb crawl/crawldb/ -url http://www.openindex.io/en/home.html
 URL: http://www.openindex.io/en/home.html
 Version: 7
 Status: 2 (db_fetched)
 Fetch time: Tue Jul 17 13:56:33 CEST 2012
 Modified time: Tue Jul 17 13:55:33 CEST 2012
 Retries since fetch: 0
 Retry interval: 60 seconds (0 days)
 Score: 0.0
 Signature: 23567bb52ee8b905b8649c4305ed82ee
 Metadata: Content-Type: text/html_pst_: success(1), lastModified=0
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-1430) Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule

2013-06-13 Thread Markus Jelsma (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1430.
--

Resolution: Fixed

Committed for 1.7 in rev. 1492639.

 Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule
 --

 Key: NUTCH-1430
 URL: https://issues.apache.org/jira/browse/NUTCH-1430
 Project: Nutch
  Issue Type: Bug
  Components: crawldb
Affects Versions: 1.5
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Critical
 Fix For: 1.8

 Attachments: NUTCH-1430-1.6-1.patch, NUTCH-1430-1.6-2.patch


 Steps to reproduce:
 Without AdaptiveFetchSchedule:
 {code}
 $ bin/nutch readdb crawl/crawldb/ -url http://www.openindex.io/en/home.html
 URL: http://www.openindex.io/en/home.html
 Version: 7
 Status: 2 (db_fetched)
 Fetch time: Thu Aug 16 13:58:23 CEST 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 2592000 seconds (30 days)
 Score: 0.0
 Signature: c2601ca503f2fc5edcb286501d7fb271
 Metadata: Content-Type: text/html_pst_: success(1), lastModified=0
 {code}
 With AdaptiveFetchSchedule:
 {code}
 $ bin/nutch readdb crawl/crawldb/ -url http://www.openindex.io/en/home.html
 URL: http://www.openindex.io/en/home.html
 Version: 7
 Status: 2 (db_fetched)
 Fetch time: Tue Jul 17 13:56:33 CEST 2012
 Modified time: Tue Jul 17 13:55:33 CEST 2012
 Retries since fetch: 0
 Retry interval: 60 seconds (0 days)
 Score: 0.0
 Signature: 23567bb52ee8b905b8649c4305ed82ee
 Metadata: Content-Type: text/html_pst_: success(1), lastModified=0
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1430) Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule

2013-06-13 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682226#comment-13682226
 ] 

Hudson commented on NUTCH-1430:
---

Integrated in Nutch-trunk #2238 (See 
[https://builds.apache.org/job/Nutch-trunk/2238/])
NUTCH-1430 Freegenerator records overwrite CrawlDB records with 
AdaptiveFetchSchedule (Revision 1492639)

 Result = SUCCESS
markus : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1492639
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java
* /nutch/trunk/src/java/org/apache/nutch/tools/FreeGenerator.java


 Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule
 --

 Key: NUTCH-1430
 URL: https://issues.apache.org/jira/browse/NUTCH-1430
 Project: Nutch
  Issue Type: Bug
  Components: crawldb
Affects Versions: 1.5
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Critical
 Fix For: 1.8

 Attachments: NUTCH-1430-1.6-1.patch, NUTCH-1430-1.6-2.patch


 Steps to reproduce:
 Without AdaptiveFetchSchedule:
 {code}
 $ bin/nutch readdb crawl/crawldb/ -url http://www.openindex.io/en/home.html
 URL: http://www.openindex.io/en/home.html
 Version: 7
 Status: 2 (db_fetched)
 Fetch time: Thu Aug 16 13:58:23 CEST 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 2592000 seconds (30 days)
 Score: 0.0
 Signature: c2601ca503f2fc5edcb286501d7fb271
 Metadata: Content-Type: text/html_pst_: success(1), lastModified=0
 {code}
 With AdaptiveFetchSchedule:
 {code}
 $ bin/nutch readdb crawl/crawldb/ -url http://www.openindex.io/en/home.html
 URL: http://www.openindex.io/en/home.html
 Version: 7
 Status: 2 (db_fetched)
 Fetch time: Tue Jul 17 13:56:33 CEST 2012
 Modified time: Tue Jul 17 13:55:33 CEST 2012
 Retries since fetch: 0
 Retry interval: 60 seconds (0 days)
 Score: 0.0
 Signature: 23567bb52ee8b905b8649c4305ed82ee
 Metadata: Content-Type: text/html_pst_: success(1), lastModified=0
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Jenkins build is back to normal : Nutch-trunk #2238

2013-06-13 Thread Apache Jenkins Server

See https://builds.apache.org/job/Nutch-trunk/2238/changes

[jira] [Updated] (NUTCH-1327) QueryStringNormalizer

2013-06-13 Thread Markus Jelsma (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1327:
-

Attachment: NUTCH-1327-1.8-1.patch

Patch for trunk. It rebuilds the URL with querystring parameters properly 
sorted.

 QueryStringNormalizer
 -

 Key: NUTCH-1327
 URL: https://issues.apache.org/jira/browse/NUTCH-1327
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.9

 Attachments: NUTCH-1327-1.8-1.patch


 A normalizer for dealing with query strings. Sorting query strings is helpful 
 in preventing duplicates for some (bad) websites.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-06-13 Thread Markus Jelsma (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1527:
-

Attachment: NUTCH-1527.patch

Here's a new patch for trunk. I still need to actually test it against an ES 
instance but there's probably a working patch next week.

Perhaps it can still be released with 1.7.

 Port nutch-elasticsearch-indexer to Nutch
 -

 Key: NUTCH-1527
 URL: https://issues.apache.org/jira/browse/NUTCH-1527
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.6, 2.1
Reporter: Lewis John McGibbney
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 2.4

 Attachments: NUTCH-1527.patch, NUTCH-1527.patch


 The source repos for this can be found here [0].
 This issue should be inline with the work already done by Julien and others 
 over at NUTCH-1047.
 [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-06-13 Thread Markus Jelsma (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1527:
-

Attachment: NUTCH-1527.patch

New patch. You do need to have a config/names.txt file in your runtime/local 
(for whatever reason i don't know). I also had to update Solr's deps to make 
sure all Lucene jars are at 4.3.0 otherwise all will fail! After adding 
indexer-elastic to plugin.includes you can index with : bin/nutch index 
-Delastic.cluster=nutch crawl//crawdb/ crawl/segments/20130613162613/


There's one problem i can't figure out right now:
{code}
2013-06-13 17:51:40,205 INFO  elasticsearch.node - [nutch] {0.90.1}[1001]: 
initializing ...
2013-06-13 17:51:40,275 WARN  mapred.LocalJobRunner - job_local1865023617_0001
java.lang.LinkageError: loader constraint violation: loader (instance of 
sun/misc/Launcher$AppClassLoader) previously initiated loading for a different 
type with name org/elasticsearch/env/Environment
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:787)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:447)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
at 
org.elasticsearch.plugins.PluginsHelper.sitePlugins(PluginsHelper.java:39)
at 
org.elasticsearch.plugins.PluginsService.init(PluginsService.java:94)
at 
org.elasticsearch.node.internal.InternalNode.init(InternalNode.java:128)
at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:159)
at org.elasticsearch.node.NodeBuilder.node(NodeBuilder.java:166)
at 
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter.open(ElasticIndexWriter.java:73)
at org.apache.nutch.indexer.IndexWriters.open(IndexWriters.java:78)
at 
org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:39)
at 
org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.init(ReduceTask.java:449)
at 
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:491)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
2013-06-13 17:51:40,732 ERROR indexer.IndexingJob - Indexer: 
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:185)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:195)
{code}

Any pointers are much appreciated!

 Port nutch-elasticsearch-indexer to Nutch
 -

 Key: NUTCH-1527
 URL: https://issues.apache.org/jira/browse/NUTCH-1527
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.6, 2.1
Reporter: Lewis John McGibbney
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 2.4

 Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch


 The source repos for this can be found here [0].
 This issue should be inline with the work already done by Julien and others 
 over at NUTCH-1047.
 [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-06-13 Thread lufeng (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682380#comment-13682380
 ] 

lufeng commented on NUTCH-1527:
---

Hi Markus

1. Elastic search will load the configure file first, so you need to add 
config/elasticsearch.yml in your runtime/local/config. But I don't find any 
method to load configure file with configuration.

2. do you still have lucene-core-3.4.jar in you runtime/local/lib directory?  
or do you add this

{code:xml}
+  dependency org=org.elasticsearch name=elasticsearch rev=0.90.1
+conf=*-default/
{code}

code in ivy/ivy.xml file. 

maybe the elasticsearch can not load class in nutch plugins system.


 Port nutch-elasticsearch-indexer to Nutch
 -

 Key: NUTCH-1527
 URL: https://issues.apache.org/jira/browse/NUTCH-1527
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.6, 2.1
Reporter: Lewis John McGibbney
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 2.4

 Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch


 The source repos for this can be found here [0].
 This issue should be inline with the work already done by Julien and others 
 over at NUTCH-1047.
 [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-1560) index-metadata to add all values of multivalued metadata

2013-06-13 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1560.


   Resolution: Fixed
Fix Version/s: (was: 1.8)
   1.7

Committed to trunk (r1492832) together with NUTCH-1467. Thanks [~kiranch]!

 index-metadata to add all values of multivalued metadata
 

 Key: NUTCH-1560
 URL: https://issues.apache.org/jira/browse/NUTCH-1560
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.6
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 1.7

 Attachments: NUTCH-1560-trunk-v1.patch


 MetadataIndexer does not add all values of multivalued meta tags. This causes 
 the fix for NUTCH-1467 to be almost useless.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

2013-06-13 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1467.


Resolution: Fixed

Committed to trunk (r1492832) together with NUTCH-1560. Thanks [~kiranch]!

 nutch 1.5.1 not able to parse mutliValued metatags
 --

 Key: NUTCH-1467
 URL: https://issues.apache.org/jira/browse/NUTCH-1467
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.5.1
Reporter: kiran
Priority: Minor
 Fix For: 1.9

 Attachments: NUTCH-1467-TEST-1.patch, NUTCH-1467-trunk.patch, 
 NUTCH-1467-trunk_v1.patch, NUTCH-1467-trunk_v2.patch, 
 NUTCH-1467-trunk-v3.patch, Patch_HTMLMetaProcessor.patch, 
 Patch_HTMLMetaTags.patch, Patch_MetadataIndexer.patch, 
 Patch_MetaTagsParser.patch, patch.txt


 Hi,
 I have been able to parse metatags in an html page using 
 http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when 
 there are two metatags with same name but two different contents. 
 Does anyone encounter this kind of issue ?  
 Are there any changes that need to be made to the config files to make it 
 work ?
 When there are two tags with same name and different content, it takes the 
 value of the later tag and saves it rather than creating a multiValue field.
 Edit: I have attached the patch for the file and it is provided by DLA 
 (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 
 Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1560) index-metadata to add all values of multivalued metadata

2013-06-13 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682743#comment-13682743
 ] 

Hudson commented on NUTCH-1560:
---

Integrated in Nutch-trunk #2239 (See 
[https://builds.apache.org/job/Nutch-trunk/2239/])
NUTCH-1467 and NUTCH-1560: add all values of multi-valued metatags 
(Revision 1492856)

 Result = FAILURE
snagel : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1492856
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/parse/HTMLMetaTags.java
* 
/nutch/trunk/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java
* 
/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HTMLMetaProcessor.java
* /nutch/trunk/src/plugin/parse-metatags/build.xml
* /nutch/trunk/src/plugin/parse-metatags/sample/testMultivalueMetatags.html
* 
/nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/MetaTagsParser.java
* 
/nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse/html/TestMetatagParser.java
* 
/nutch/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/HTMLMetaProcessor.java


 index-metadata to add all values of multivalued metadata
 

 Key: NUTCH-1560
 URL: https://issues.apache.org/jira/browse/NUTCH-1560
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.6
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 1.7

 Attachments: NUTCH-1560-trunk-v1.patch


 MetadataIndexer does not add all values of multivalued meta tags. This causes 
 the fix for NUTCH-1467 to be almost useless.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Build failed in Jenkins: Nutch-trunk #2239

2013-06-13 Thread Apache Jenkins Server

See https://builds.apache.org/job/Nutch-trunk/2239/changes

Changes:

[snagel] NUTCH-1467 and NUTCH-1560: add all values of multi-valued metatags

--
[...truncated 3261 lines...]

jar:
  [jar] Building jar: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlmeta/urlmeta.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlmeta

copy-generated-lib:
 [copy] Copying 1 file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlmeta

init:
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-basic
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/urlnormalizer-basic.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic

copy-generated-lib:
 [copy] Copying 1 file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/test/data
 [copy] Copying 1 file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/test/data

init:
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/classes
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-host

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-host
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/classes
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/urlnormalizer-host.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-host

copy-generated-lib:
 [copy] Copying 1 file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-host

init:
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/classes
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/test
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-pass

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-pass
[javac] Compiling 1 source file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/classes
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/urlnormalizer-pass.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-pass

copy-generated-lib:
 [copy] Copying 1 file to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-pass
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/test/data
 [copy] Copying 4 files to 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/test/data

init:
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/classes
[mkdir] Created dir: 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-regex

[jira] [Updated] (NUTCH-1486) Upgrade to Solr 4.2.1

2013-06-13 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1486:


Issue Type: Improvement  (was: Bug)

 Upgrade to Solr 4.2.1
 -

 Key: NUTCH-1486
 URL: https://issues.apache.org/jira/browse/NUTCH-1486
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.6, 2.1
 Environment: Solr 4.0, Nutch trunk 1.6-SNAPSHOT  Probably 2.2-SNAPHOT
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1486-2.x.patch, NUTCH-1486-2.x.v2.patch, 
 NUTCH-1486-nutchgora.patch, NUTCH-1486-trunk.patch, 
 NUTCH-1486-trunk.v2.patch, NUTCH-1486-trunk.v3.patch


 When attempting to configure a 4 multicore 4.0 instance with Nutch 
 schema-solr4.xml file, I get the following exceptions.
 This has been discussed previously. As I see it we have two options
 1. Keep maintaining both schema options
 2. Ditch the more complex schema-solr4.xml in favour of vanilla schema.xml
 Thoughts?
 {code}
 SEVERE: Unable to create core: collection4
 org.apache.solr.common.SolrException: Unable to use updateLog: _version_field 
 must exist in schema, using indexed=true stored=true and 
 multiValued=false (_version_ does not exist)
   at org.apache.solr.core.SolrCore.init(SolrCore.java:721)
   at org.apache.solr.core.SolrCore.init(SolrCore.java:566)
   at org.apache.solr.core.CoreContainer.create(CoreContainer.java:850)
   at org.apache.solr.core.CoreContainer.load(CoreContainer.java:534)
   at org.apache.solr.core.CoreContainer.load(CoreContainer.java:356)
   at 
 org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:308)
   at 
 org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:107)
   at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:114)
   at 
 org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
   at 
 org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:754)
   at 
 org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:258)
   at 
 org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1221)
   at 
 org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:699)
   at 
 org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:454)
   at 
 org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
   at 
 org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:36)
   at 
 org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:183)
   at 
 org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:491)
   at 
 org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:138)
   at 
 org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:142)
   at 
 org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:53)
   at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:604)
   at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:535)
   at org.eclipse.jetty.util.Scanner.scan(Scanner.java:398)
   at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:332)
   at 
 org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
   at 
 org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:118)
   at 
 org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
   at 
 org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:552)
   at 
 org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:227)
   at 
 org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
   at 
 org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:63)
   at 
 org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:53)
   at 
 org.eclipse.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:91)
   at org.eclipse.jetty.server.Server.doStart(Server.java:263)
   at 
 org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
   at 
 org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1215)
   at java.security.AccessController.doPrivileged(Native Method)
   at 
 org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:1138)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at

[jira] [Updated] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

2013-06-13 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1475:


Fix Version/s: 2.3

 Nutch 2.1 Index-More Plugin -- A better fall back value for date field
 --

 Key: NUTCH-1475
 URL: https://issues.apache.org/jira/browse/NUTCH-1475
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.1, 1.5.1
 Environment: All
Reporter: James Sullivan
Priority: Minor
  Labels: index-more, plugins
 Fix For: 2.3, 1.8

 Attachments: index-more-1xand2x.patch, index-more-2x.patch, 
 index-more-2x.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Among other fields, the more plugin for Nutch 2.x provides a last modified 
 and date field for the Solr index. The last modified field is the last 
 modified date from the http headers if available, if not available it is left 
 empty. Currently, the date field is the same as the last modified field 
 unless that field is empty in which case getFetchTime is used as a fall back. 
 I think getFetchTime is not a good fall back as it is the next fetch time and 
 often a month or more in the future which doesn't make sense for the date 
 field. Users do not expect webpages/documents with future dates. A more 
 sensible fallback would be current date at the time it is indexed. 
 This is possible by simply changing line 97 of 
 https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
  from
 time = page.getFetchTime(); // use fetch time
 to
 time = new Date().getTime();
 Users interested in the getFetchTime value can still get it from the tstamp 
 field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: right place to put wiki images

2013-06-13 Thread Lewis John Mcgibbney

Hi,

On Wed, Jun 12, 2013 at 2:01 AM, dev-digest-h...@nutch.apache.org wrote:


 As per suggestion by Seb, I have corrected wiki at several places.

 The images over Admin UI Proposal are lost as they were hosted somewhere
 else and the site is down now :(
 http://wiki.apache.org/nutch/NutchAdministrationUserInterface



You can actually check out the code that was proposed for the GUI from here

https://github.com/101tec/nutch/wiki

It is extremely dated, and better proposal has been suggested now. Purely
for motivation and graphic content it is useful to see what the proposed
GUI looked like.
Lewis

Jenkins build is back to normal : Nutch-nutchgora #645

2013-06-13 Thread Apache Jenkins Server

See https://builds.apache.org/job/Nutch-nutchgora/645/

Jenkins build is back to normal : Nutch-trunk #2240

2013-06-13 Thread Apache Jenkins Server

See https://builds.apache.org/job/Nutch-trunk/2240/

RE: [DISCUSS] Nutch 1.7 ready for release?

[jira] [Created] (NUTCH-1581) CrawlDB csv output to include metadata

[jira] [Updated] (NUTCH-1581) CrawlDB csv output to include metadata

[jira] [Commented] (NUTCH-1430) Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule

[jira] [Commented] (NUTCH-1430) Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule

[jira] [Resolved] (NUTCH-1430) Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule

[jira] [Commented] (NUTCH-1430) Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule

Jenkins build is back to normal : Nutch-trunk #2238

[jira] [Updated] (NUTCH-1327) QueryStringNormalizer

[jira] [Updated] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

[jira] [Updated] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

[jira] [Resolved] (NUTCH-1560) index-metadata to add all values of multivalued metadata

[jira] [Resolved] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

[jira] [Commented] (NUTCH-1560) index-metadata to add all values of multivalued metadata

Build failed in Jenkins: Nutch-trunk #2239

[jira] [Updated] (NUTCH-1486) Upgrade to Solr 4.2.1

[jira] [Updated] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

Re: right place to put wiki images

Jenkins build is back to normal : Nutch-nutchgora #645

Jenkins build is back to normal : Nutch-trunk #2240

21 matches

Site Navigation

Mail list logo

Footer information