RE: [DISCUSS] Nutch 1.7 ready for release?
+1 -Original message- From:Sebastian Nagel wastl.na...@googlemail.com Sent: Sun 09-Jun-2013 14:05 To: dev@nutch.apache.org Subject: Re: [DISCUSS] Nutch 1.7 ready for release? +1 go ahead! Sebastian On 06/08/2013 11:53 PM, Lewis John Mcgibbney wrote: Thread says it all troops. Best Lewis
[jira] [Created] (NUTCH-1581) CrawlDB csv output to include metadata
Markus Jelsma created NUTCH-1581: Summary: CrawlDB csv output to include metadata Key: NUTCH-1581 URL: https://issues.apache.org/jira/browse/NUTCH-1581 Project: Nutch Issue Type: Improvement Components: crawldb Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.8 Dumping the CrawlDB to CSV should include the CrawlDatum's metadata. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1581) CrawlDB csv output to include metadata
[ https://issues.apache.org/jira/browse/NUTCH-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1581: - Attachment: NUTCH-1581-1.8.patch Patch for 1.8. CrawlDB csv output to include metadata -- Key: NUTCH-1581 URL: https://issues.apache.org/jira/browse/NUTCH-1581 Project: Nutch Issue Type: Improvement Components: crawldb Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.8 Attachments: NUTCH-1581-1.8.patch Dumping the CrawlDB to CSV should include the CrawlDatum's metadata. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1430) Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule
[ https://issues.apache.org/jira/browse/NUTCH-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682145#comment-13682145 ] Markus Jelsma commented on NUTCH-1430: -- If no objections i'd like to get this in for 1.7, this is a show stopper for all using the FreeGenerator. We've been using this patch in our dist for many months now and are happy with it. Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule -- Key: NUTCH-1430 URL: https://issues.apache.org/jira/browse/NUTCH-1430 Project: Nutch Issue Type: Bug Components: crawldb Affects Versions: 1.5 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Critical Fix For: 1.8 Attachments: NUTCH-1430-1.6-1.patch, NUTCH-1430-1.6-2.patch Steps to reproduce: Without AdaptiveFetchSchedule: {code} $ bin/nutch readdb crawl/crawldb/ -url http://www.openindex.io/en/home.html URL: http://www.openindex.io/en/home.html Version: 7 Status: 2 (db_fetched) Fetch time: Thu Aug 16 13:58:23 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 0.0 Signature: c2601ca503f2fc5edcb286501d7fb271 Metadata: Content-Type: text/html_pst_: success(1), lastModified=0 {code} With AdaptiveFetchSchedule: {code} $ bin/nutch readdb crawl/crawldb/ -url http://www.openindex.io/en/home.html URL: http://www.openindex.io/en/home.html Version: 7 Status: 2 (db_fetched) Fetch time: Tue Jul 17 13:56:33 CEST 2012 Modified time: Tue Jul 17 13:55:33 CEST 2012 Retries since fetch: 0 Retry interval: 60 seconds (0 days) Score: 0.0 Signature: 23567bb52ee8b905b8649c4305ed82ee Metadata: Content-Type: text/html_pst_: success(1), lastModified=0 {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1430) Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule
[ https://issues.apache.org/jira/browse/NUTCH-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682160#comment-13682160 ] Sebastian Nagel commented on NUTCH-1430: +1 Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule -- Key: NUTCH-1430 URL: https://issues.apache.org/jira/browse/NUTCH-1430 Project: Nutch Issue Type: Bug Components: crawldb Affects Versions: 1.5 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Critical Fix For: 1.8 Attachments: NUTCH-1430-1.6-1.patch, NUTCH-1430-1.6-2.patch Steps to reproduce: Without AdaptiveFetchSchedule: {code} $ bin/nutch readdb crawl/crawldb/ -url http://www.openindex.io/en/home.html URL: http://www.openindex.io/en/home.html Version: 7 Status: 2 (db_fetched) Fetch time: Thu Aug 16 13:58:23 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 0.0 Signature: c2601ca503f2fc5edcb286501d7fb271 Metadata: Content-Type: text/html_pst_: success(1), lastModified=0 {code} With AdaptiveFetchSchedule: {code} $ bin/nutch readdb crawl/crawldb/ -url http://www.openindex.io/en/home.html URL: http://www.openindex.io/en/home.html Version: 7 Status: 2 (db_fetched) Fetch time: Tue Jul 17 13:56:33 CEST 2012 Modified time: Tue Jul 17 13:55:33 CEST 2012 Retries since fetch: 0 Retry interval: 60 seconds (0 days) Score: 0.0 Signature: 23567bb52ee8b905b8649c4305ed82ee Metadata: Content-Type: text/html_pst_: success(1), lastModified=0 {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1430) Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule
[ https://issues.apache.org/jira/browse/NUTCH-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1430. -- Resolution: Fixed Committed for 1.7 in rev. 1492639. Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule -- Key: NUTCH-1430 URL: https://issues.apache.org/jira/browse/NUTCH-1430 Project: Nutch Issue Type: Bug Components: crawldb Affects Versions: 1.5 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Critical Fix For: 1.8 Attachments: NUTCH-1430-1.6-1.patch, NUTCH-1430-1.6-2.patch Steps to reproduce: Without AdaptiveFetchSchedule: {code} $ bin/nutch readdb crawl/crawldb/ -url http://www.openindex.io/en/home.html URL: http://www.openindex.io/en/home.html Version: 7 Status: 2 (db_fetched) Fetch time: Thu Aug 16 13:58:23 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 0.0 Signature: c2601ca503f2fc5edcb286501d7fb271 Metadata: Content-Type: text/html_pst_: success(1), lastModified=0 {code} With AdaptiveFetchSchedule: {code} $ bin/nutch readdb crawl/crawldb/ -url http://www.openindex.io/en/home.html URL: http://www.openindex.io/en/home.html Version: 7 Status: 2 (db_fetched) Fetch time: Tue Jul 17 13:56:33 CEST 2012 Modified time: Tue Jul 17 13:55:33 CEST 2012 Retries since fetch: 0 Retry interval: 60 seconds (0 days) Score: 0.0 Signature: 23567bb52ee8b905b8649c4305ed82ee Metadata: Content-Type: text/html_pst_: success(1), lastModified=0 {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1430) Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule
[ https://issues.apache.org/jira/browse/NUTCH-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682226#comment-13682226 ] Hudson commented on NUTCH-1430: --- Integrated in Nutch-trunk #2238 (See [https://builds.apache.org/job/Nutch-trunk/2238/]) NUTCH-1430 Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule (Revision 1492639) Result = SUCCESS markus : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1492639 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java * /nutch/trunk/src/java/org/apache/nutch/tools/FreeGenerator.java Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule -- Key: NUTCH-1430 URL: https://issues.apache.org/jira/browse/NUTCH-1430 Project: Nutch Issue Type: Bug Components: crawldb Affects Versions: 1.5 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Critical Fix For: 1.8 Attachments: NUTCH-1430-1.6-1.patch, NUTCH-1430-1.6-2.patch Steps to reproduce: Without AdaptiveFetchSchedule: {code} $ bin/nutch readdb crawl/crawldb/ -url http://www.openindex.io/en/home.html URL: http://www.openindex.io/en/home.html Version: 7 Status: 2 (db_fetched) Fetch time: Thu Aug 16 13:58:23 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 0.0 Signature: c2601ca503f2fc5edcb286501d7fb271 Metadata: Content-Type: text/html_pst_: success(1), lastModified=0 {code} With AdaptiveFetchSchedule: {code} $ bin/nutch readdb crawl/crawldb/ -url http://www.openindex.io/en/home.html URL: http://www.openindex.io/en/home.html Version: 7 Status: 2 (db_fetched) Fetch time: Tue Jul 17 13:56:33 CEST 2012 Modified time: Tue Jul 17 13:55:33 CEST 2012 Retries since fetch: 0 Retry interval: 60 seconds (0 days) Score: 0.0 Signature: 23567bb52ee8b905b8649c4305ed82ee Metadata: Content-Type: text/html_pst_: success(1), lastModified=0 {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Jenkins build is back to normal : Nutch-trunk #2238
See https://builds.apache.org/job/Nutch-trunk/2238/changes
[jira] [Updated] (NUTCH-1327) QueryStringNormalizer
[ https://issues.apache.org/jira/browse/NUTCH-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1327: - Attachment: NUTCH-1327-1.8-1.patch Patch for trunk. It rebuilds the URL with querystring parameters properly sorted. QueryStringNormalizer - Key: NUTCH-1327 URL: https://issues.apache.org/jira/browse/NUTCH-1327 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.9 Attachments: NUTCH-1327-1.8-1.patch A normalizer for dealing with query strings. Sorting query strings is helpful in preventing duplicates for some (bad) websites. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1527: - Attachment: NUTCH-1527.patch Here's a new patch for trunk. I still need to actually test it against an ES instance but there's probably a working patch next week. Perhaps it can still be released with 1.7. Port nutch-elasticsearch-indexer to Nutch - Key: NUTCH-1527 URL: https://issues.apache.org/jira/browse/NUTCH-1527 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.6, 2.1 Reporter: Lewis John McGibbney Assignee: Markus Jelsma Priority: Minor Fix For: 2.4 Attachments: NUTCH-1527.patch, NUTCH-1527.patch The source repos for this can be found here [0]. This issue should be inline with the work already done by Julien and others over at NUTCH-1047. [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1527: - Attachment: NUTCH-1527.patch New patch. You do need to have a config/names.txt file in your runtime/local (for whatever reason i don't know). I also had to update Solr's deps to make sure all Lucene jars are at 4.3.0 otherwise all will fail! After adding indexer-elastic to plugin.includes you can index with : bin/nutch index -Delastic.cluster=nutch crawl//crawdb/ crawl/segments/20130613162613/ There's one problem i can't figure out right now: {code} 2013-06-13 17:51:40,205 INFO elasticsearch.node - [nutch] {0.90.1}[1001]: initializing ... 2013-06-13 17:51:40,275 WARN mapred.LocalJobRunner - job_local1865023617_0001 java.lang.LinkageError: loader constraint violation: loader (instance of sun/misc/Launcher$AppClassLoader) previously initiated loading for a different type with name org/elasticsearch/env/Environment at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:787) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:447) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:423) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:356) at org.elasticsearch.plugins.PluginsHelper.sitePlugins(PluginsHelper.java:39) at org.elasticsearch.plugins.PluginsService.init(PluginsService.java:94) at org.elasticsearch.node.internal.InternalNode.init(InternalNode.java:128) at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:159) at org.elasticsearch.node.NodeBuilder.node(NodeBuilder.java:166) at org.apache.nutch.indexwriter.elastic.ElasticIndexWriter.open(ElasticIndexWriter.java:73) at org.apache.nutch.indexer.IndexWriters.open(IndexWriters.java:78) at org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:39) at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.init(ReduceTask.java:449) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:491) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398) 2013-06-13 17:51:40,732 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:185) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:195) {code} Any pointers are much appreciated! Port nutch-elasticsearch-indexer to Nutch - Key: NUTCH-1527 URL: https://issues.apache.org/jira/browse/NUTCH-1527 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.6, 2.1 Reporter: Lewis John McGibbney Assignee: Markus Jelsma Priority: Minor Fix For: 2.4 Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch The source repos for this can be found here [0]. This issue should be inline with the work already done by Julien and others over at NUTCH-1047. [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682380#comment-13682380 ] lufeng commented on NUTCH-1527: --- Hi Markus 1. Elastic search will load the configure file first, so you need to add config/elasticsearch.yml in your runtime/local/config. But I don't find any method to load configure file with configuration. 2. do you still have lucene-core-3.4.jar in you runtime/local/lib directory? or do you add this {code:xml} + dependency org=org.elasticsearch name=elasticsearch rev=0.90.1 +conf=*-default/ {code} code in ivy/ivy.xml file. maybe the elasticsearch can not load class in nutch plugins system. Port nutch-elasticsearch-indexer to Nutch - Key: NUTCH-1527 URL: https://issues.apache.org/jira/browse/NUTCH-1527 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.6, 2.1 Reporter: Lewis John McGibbney Assignee: Markus Jelsma Priority: Minor Fix For: 2.4 Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch The source repos for this can be found here [0]. This issue should be inline with the work already done by Julien and others over at NUTCH-1047. [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1560) index-metadata to add all values of multivalued metadata
[ https://issues.apache.org/jira/browse/NUTCH-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1560. Resolution: Fixed Fix Version/s: (was: 1.8) 1.7 Committed to trunk (r1492832) together with NUTCH-1467. Thanks [~kiranch]! index-metadata to add all values of multivalued metadata Key: NUTCH-1560 URL: https://issues.apache.org/jira/browse/NUTCH-1560 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.6 Reporter: Sebastian Nagel Priority: Minor Fix For: 1.7 Attachments: NUTCH-1560-trunk-v1.patch MetadataIndexer does not add all values of multivalued meta tags. This causes the fix for NUTCH-1467 to be almost useless. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags
[ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1467. Resolution: Fixed Committed to trunk (r1492832) together with NUTCH-1560. Thanks [~kiranch]! nutch 1.5.1 not able to parse mutliValued metatags -- Key: NUTCH-1467 URL: https://issues.apache.org/jira/browse/NUTCH-1467 Project: Nutch Issue Type: Bug Affects Versions: 1.5.1 Reporter: kiran Priority: Minor Fix For: 1.9 Attachments: NUTCH-1467-TEST-1.patch, NUTCH-1467-trunk.patch, NUTCH-1467-trunk_v1.patch, NUTCH-1467-trunk_v2.patch, NUTCH-1467-trunk-v3.patch, Patch_HTMLMetaProcessor.patch, Patch_HTMLMetaTags.patch, Patch_MetadataIndexer.patch, Patch_MetaTagsParser.patch, patch.txt Hi, I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. Does anyone encounter this kind of issue ? Are there any changes that need to be made to the config files to make it work ? When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field. Edit: I have attached the patch for the file and it is provided by DLA (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. Many Thanks, -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1560) index-metadata to add all values of multivalued metadata
[ https://issues.apache.org/jira/browse/NUTCH-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682743#comment-13682743 ] Hudson commented on NUTCH-1560: --- Integrated in Nutch-trunk #2239 (See [https://builds.apache.org/job/Nutch-trunk/2239/]) NUTCH-1467 and NUTCH-1560: add all values of multi-valued metatags (Revision 1492856) Result = FAILURE snagel : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1492856 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/parse/HTMLMetaTags.java * /nutch/trunk/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java * /nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HTMLMetaProcessor.java * /nutch/trunk/src/plugin/parse-metatags/build.xml * /nutch/trunk/src/plugin/parse-metatags/sample/testMultivalueMetatags.html * /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/MetaTagsParser.java * /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse/html/TestMetatagParser.java * /nutch/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/HTMLMetaProcessor.java index-metadata to add all values of multivalued metadata Key: NUTCH-1560 URL: https://issues.apache.org/jira/browse/NUTCH-1560 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.6 Reporter: Sebastian Nagel Priority: Minor Fix For: 1.7 Attachments: NUTCH-1560-trunk-v1.patch MetadataIndexer does not add all values of multivalued meta tags. This causes the fix for NUTCH-1467 to be almost useless. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Build failed in Jenkins: Nutch-trunk #2239
See https://builds.apache.org/job/Nutch-trunk/2239/changes Changes: [snagel] NUTCH-1467 and NUTCH-1560: add all values of multi-valued metatags -- [...truncated 3261 lines...] jar: [jar] Building jar: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlmeta/urlmeta.jar deps-test: deploy: [copy] Copying 1 file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlmeta copy-generated-lib: [copy] Copying 1 file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlmeta init: [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-basic [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] 1 warning jar: [jar] Building jar: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/urlnormalizer-basic.jar deps-test: deploy: [copy] Copying 1 file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic copy-generated-lib: [copy] Copying 1 file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/test/data [copy] Copying 1 file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/test/data init: [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/classes [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-host init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-host [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/classes [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] 1 warning jar: [jar] Building jar: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-host/urlnormalizer-host.jar deps-test: deploy: [copy] Copying 1 file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-host copy-generated-lib: [copy] Copying 1 file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-host init: [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/classes [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/test [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-pass init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-pass [javac] Compiling 1 source file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/classes [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] 1 warning jar: [jar] Building jar: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/urlnormalizer-pass.jar deps-test: deploy: [copy] Copying 1 file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-pass copy-generated-lib: [copy] Copying 1 file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-pass [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/test/data [copy] Copying 4 files to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/test/data init: [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/classes [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-regex
[jira] [Updated] (NUTCH-1486) Upgrade to Solr 4.2.1
[ https://issues.apache.org/jira/browse/NUTCH-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1486: Issue Type: Improvement (was: Bug) Upgrade to Solr 4.2.1 - Key: NUTCH-1486 URL: https://issues.apache.org/jira/browse/NUTCH-1486 Project: Nutch Issue Type: Improvement Affects Versions: 1.6, 2.1 Environment: Solr 4.0, Nutch trunk 1.6-SNAPSHOT Probably 2.2-SNAPHOT Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 2.3, 1.8 Attachments: NUTCH-1486-2.x.patch, NUTCH-1486-2.x.v2.patch, NUTCH-1486-nutchgora.patch, NUTCH-1486-trunk.patch, NUTCH-1486-trunk.v2.patch, NUTCH-1486-trunk.v3.patch When attempting to configure a 4 multicore 4.0 instance with Nutch schema-solr4.xml file, I get the following exceptions. This has been discussed previously. As I see it we have two options 1. Keep maintaining both schema options 2. Ditch the more complex schema-solr4.xml in favour of vanilla schema.xml Thoughts? {code} SEVERE: Unable to create core: collection4 org.apache.solr.common.SolrException: Unable to use updateLog: _version_field must exist in schema, using indexed=true stored=true and multiValued=false (_version_ does not exist) at org.apache.solr.core.SolrCore.init(SolrCore.java:721) at org.apache.solr.core.SolrCore.init(SolrCore.java:566) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:850) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:534) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:356) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:308) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:107) at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:114) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) at org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:754) at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:258) at org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1221) at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:699) at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:454) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) at org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:36) at org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:183) at org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:491) at org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:138) at org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:142) at org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:53) at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:604) at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:535) at org.eclipse.jetty.util.Scanner.scan(Scanner.java:398) at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:332) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) at org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:118) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) at org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:552) at org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:227) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) at org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:63) at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:53) at org.eclipse.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:91) at org.eclipse.jetty.server.Server.doStart(Server.java:263) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) at org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1215) at java.security.AccessController.doPrivileged(Native Method) at org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:1138) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
[jira] [Updated] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field
[ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1475: Fix Version/s: 2.3 Nutch 2.1 Index-More Plugin -- A better fall back value for date field -- Key: NUTCH-1475 URL: https://issues.apache.org/jira/browse/NUTCH-1475 Project: Nutch Issue Type: Bug Affects Versions: 2.1, 1.5.1 Environment: All Reporter: James Sullivan Priority: Minor Labels: index-more, plugins Fix For: 2.3, 1.8 Attachments: index-more-1xand2x.patch, index-more-2x.patch, index-more-2x.patch Original Estimate: 1h Remaining Estimate: 1h Among other fields, the more plugin for Nutch 2.x provides a last modified and date field for the Solr index. The last modified field is the last modified date from the http headers if available, if not available it is left empty. Currently, the date field is the same as the last modified field unless that field is empty in which case getFetchTime is used as a fall back. I think getFetchTime is not a good fall back as it is the next fetch time and often a month or more in the future which doesn't make sense for the date field. Users do not expect webpages/documents with future dates. A more sensible fallback would be current date at the time it is indexed. This is possible by simply changing line 97 of https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java from time = page.getFetchTime(); // use fetch time to time = new Date().getTime(); Users interested in the getFetchTime value can still get it from the tstamp field. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: right place to put wiki images
Hi, On Wed, Jun 12, 2013 at 2:01 AM, dev-digest-h...@nutch.apache.org wrote: As per suggestion by Seb, I have corrected wiki at several places. The images over Admin UI Proposal are lost as they were hosted somewhere else and the site is down now :( http://wiki.apache.org/nutch/NutchAdministrationUserInterface You can actually check out the code that was proposed for the GUI from here https://github.com/101tec/nutch/wiki It is extremely dated, and better proposal has been suggested now. Purely for motivation and graphic content it is useful to see what the proposed GUI looked like. Lewis
Jenkins build is back to normal : Nutch-nutchgora #645
See https://builds.apache.org/job/Nutch-nutchgora/645/
Jenkins build is back to normal : Nutch-trunk #2240
See https://builds.apache.org/job/Nutch-trunk/2240/