[jira] [Updated] (NUTCH-1278) Fetch Improvement in threads per host
[ https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] behnam nikbakht updated NUTCH-1278: --- Attachment: NUTCH-1278.zip Fetch Improvement in threads per host - Key: NUTCH-1278 URL: https://issues.apache.org/jira/browse/NUTCH-1278 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 1.4 Reporter: behnam nikbakht Attachments: NUTCH-1278.zip the value of maxThreads is equal to fetcher.threads.per.host and is constant for every host there is a possibility with using of dynamic values for every host that influeced with number of blocked requests. this means that if number of blocked requests for one host increased, then we most decrease this value and increase http.timeout -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1278) Fetch Improvement in threads per host
[ https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211294#comment-13211294 ] behnam nikbakht commented on NUTCH-1278: here is a primary patch, that has some changes in Fetcher.java ,Protocol.java and it's plugins like lib-http i use a file in local system for maintaining a hashtable that contains hosts and their http.timeout for each blocked response, there is a increment in timeout and for each success, there is a decrement we can use different increment and decrement rates so we can make a balance between total time of fetch Job, and a relation between fetched and blocked rates. for example it can configurable that if 90% of requests for some host are seccess, there is no need to increase timeout. Fetch Improvement in threads per host - Key: NUTCH-1278 URL: https://issues.apache.org/jira/browse/NUTCH-1278 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 1.4 Reporter: behnam nikbakht Attachments: NUTCH-1278.zip the value of maxThreads is equal to fetcher.threads.per.host and is constant for every host there is a possibility with using of dynamic values for every host that influeced with number of blocked requests. this means that if number of blocked requests for one host increased, then we most decrease this value and increase http.timeout -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1246) Upgrade to Hadoop 1.0.0
[ https://issues.apache.org/jira/browse/NUTCH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211296#comment-13211296 ] Hudson commented on NUTCH-1246: --- Integrated in Nutch-nutchgora #165 (See [https://builds.apache.org/job/Nutch-nutchgora/165/]) commit to address NUTCH-1246 and update to CHANGES.txt (Revision 1245921) Result = SUCCESS lewismc : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/ivy/ivy.xml Upgrade to Hadoop 1.0.0 --- Key: NUTCH-1246 URL: https://issues.apache.org/jira/browse/NUTCH-1246 Project: Nutch Issue Type: Improvement Affects Versions: nutchgora, 1.5 Reporter: Julien Nioche -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1282) linkdb scalability
linkdb scalability -- Key: NUTCH-1282 URL: https://issues.apache.org/jira/browse/NUTCH-1282 Project: Nutch Issue Type: Improvement Components: linkdb Affects Versions: 1.4 Reporter: behnam nikbakht as described in NUTCH-1054, the linkdb is optional in solrindex and it's usage is only for anchor and not impact on scoring. as seemed, size of linkdb in incremental crawl grow very fast and make it unscalable for huge size of web sites. so, here is two choises, one, ignore invertlinks and linkdb from crawl, and second, make it scalable in invertlinks, there is 2 jobs, first for construct new linkdb from new parsed segments, and second for merge new linkdb with old linkdb. the second job is unscalable and we can ignore it with this changes in solrIndex: in the class IndexerMapReduce, reduce method, if fetchDatum == null or dbDatum == null or parseText == null or parseData == null, then add anchor to doc and update solr (no insert) here also some changes required to NutchDocument. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1281) tika parser not work properly with unwanted file types that passed from filters in nutch
[ https://issues.apache.org/jira/browse/NUTCH-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211314#comment-13211314 ] Lewis John McGibbney commented on NUTCH-1281: - Hi behnam, there is a similar issue open and a patch has been submitted for Nutchgora. I wonder if you can check it out and comment on the link between these two. NUTCH-965 Also would it be possible for you to attach your code changes as a patch against trunk? Which I guess is what you are using. Thank you tika parser not work properly with unwanted file types that passed from filters in nutch Key: NUTCH-1281 URL: https://issues.apache.org/jira/browse/NUTCH-1281 Project: Nutch Issue Type: Improvement Components: parser Reporter: behnam nikbakht when in parse-plugins.xml, set this property: mimeType name=* plugin id=parse-tika / /mimeType all unwanted files that pass from all filters, refered to tika but for some file types like .flv, tika parser has problem and hunged and cause to fail in parse Job. if this file types passed from regex-urlfilter and other filters, parse job failed. for this problem I suggest that add some properties for valid file types, and use this code in TikaParser.java, like this: public ParseResult getParse(Content content) { String mimeType = content.getContentType(); + String[]validTypes=new String[]{application/pdf,application/x-tika-msoffice,application/x-tika- ooxml,application/vnd.oasis.opendocument.text,text/plain,application/rtf,application/rss+xml,application/x-bzip2,application/x-gzip,application/x-javascript,application/javascript,text/javascript,application/x-shockwave-flash,application/zip,text/xml,application/xml}; + boolean valid=false; + for(int k=0;kvalidTypes.length;k++){ + if(validTypes[k].compareTo(mimeType.toLowerCase())==0) + valid=true; + } + if(!valid) + return new ParseStatus(ParseStatus.NOTPARSED, Can't parse for unwanted filetype + mimeType).getEmptyParseResult(content.getUrl(), getConf()); URL base; -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1281) tika parser not work properly with unwanted file types that passed from filters in nutch
[ https://issues.apache.org/jira/browse/NUTCH-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211316#comment-13211316 ] behnam nikbakht commented on NUTCH-1281: Problem is that actual mime-types can not properly filtered until the parse or fetch start. and here are many file types that we can not filter all of them, and maybe there are some bugs with tika parser with some file types. so we can filter them in TikaParser from valid file types. tika parser not work properly with unwanted file types that passed from filters in nutch Key: NUTCH-1281 URL: https://issues.apache.org/jira/browse/NUTCH-1281 Project: Nutch Issue Type: Improvement Components: parser Reporter: behnam nikbakht when in parse-plugins.xml, set this property: mimeType name=* plugin id=parse-tika / /mimeType all unwanted files that pass from all filters, refered to tika but for some file types like .flv, tika parser has problem and hunged and cause to fail in parse Job. if this file types passed from regex-urlfilter and other filters, parse job failed. for this problem I suggest that add some properties for valid file types, and use this code in TikaParser.java, like this: public ParseResult getParse(Content content) { String mimeType = content.getContentType(); + String[]validTypes=new String[]{application/pdf,application/x-tika-msoffice,application/x-tika- ooxml,application/vnd.oasis.opendocument.text,text/plain,application/rtf,application/rss+xml,application/x-bzip2,application/x-gzip,application/x-javascript,application/javascript,text/javascript,application/x-shockwave-flash,application/zip,text/xml,application/xml}; + boolean valid=false; + for(int k=0;kvalidTypes.length;k++){ + if(validTypes[k].compareTo(mimeType.toLowerCase())==0) + valid=true; + } + if(!valid) + return new ParseStatus(ParseStatus.NOTPARSED, Can't parse for unwanted filetype + mimeType).getEmptyParseResult(content.getUrl(), getConf()); URL base; -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1278) Fetch Improvement in threads per host
[ https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211320#comment-13211320 ] Lewis John McGibbney commented on NUTCH-1278: - Behnam, this looks interesting but there are a few problems here. 1) It would be much much easier for us to apply, test and comment on your contribution if you included it in a simple .patch file. This can be done like so {code} $ cd $NUTCH_HOME $ svn diff NUTCH-patch-name.patch {code} The current zip format for the patch(es), plus the fact that every class has been patched separately from thier own respective directories makes it really hard for us to work with this. 2) I doesn't appear that this patch is actually applies against trunk? Maybe 1.4? You can check out trunk here [1] I'm getting errors when trying to apply HttpBase then gave up and started writing this. 3) for a change to the fetcher of this scale, it would be really nice if you could provide a test within the test suite we already maintain [2]. As I said this looks really great, and sorry for the rather lengthy initial response, but for us to consider this for integration it would be great for your contributions to meet this minimum requirement as they are highly appreciated. Thank you [1] https://svn.apache.org/repos/asf/nutch/trunk/ [2] https://svn.apache.org/viewvc/nutch/trunk/src/test/org/apache/nutch/fetcher/TestFetcher.java?view=markup Fetch Improvement in threads per host - Key: NUTCH-1278 URL: https://issues.apache.org/jira/browse/NUTCH-1278 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 1.4 Reporter: behnam nikbakht Attachments: NUTCH-1278.zip the value of maxThreads is equal to fetcher.threads.per.host and is constant for every host there is a possibility with using of dynamic values for every host that influeced with number of blocked requests. this means that if number of blocked requests for one host increased, then we most decrease this value and increase http.timeout -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1283) Ridically update all Solr configuration in Nutchgora
Ridically update all Solr configuration in Nutchgora Key: NUTCH-1283 URL: https://issues.apache.org/jira/browse/NUTCH-1283 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: nutchgora Reporter: Lewis John McGibbney Fix For: nutchgora We're currently running with a Schema which states it's 1.4 :0| There should be better support for newer stuff going on over the Solrland. Thsi issue should track those improvements entirely. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[Nutch Wiki] Trivial Update of NutchAdministrationUserInterface by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The NutchAdministrationUserInterface page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/NutchAdministrationUserInterface?action=diffrev1=5rev2=6 == Look and feel Admin Gui: == + This [[https://github.com/101tec/nutch/wiki|link]] provides the best working prototype of an example admin gui, it also provides a heap of material relating to what kind and level of functionality the Nutch webapp should support. - The following link provide a non working prototype of the admin gui created by Frank Henze (credits). - http://www.media-style.com/gfx/nutchadmin/index.html == Description Admin Gui: == There are three main functionalities of the admin gui
[Nutch Wiki] Trivial Update of NutchAdministrationUserInterface by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The NutchAdministrationUserInterface page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/NutchAdministrationUserInterface?action=diffrev1=6rev2=7 == Timetable: == - first beta version until end of Feburary. + TODO == people: == - Frank Hanze: jsp programming - Marko Bauhard: re-factoring nutchConf and tools api + The Apache Nutch Development team. - Stefan Groschupf: developing plugin Extension point and ground framework + Original developers working on this included Frank Hanze (jsp programming), Marko Bauhard (re-factoring nutchConf and tools api) Stefan Groschupf (developing plugin Extension point and ground framework) Please add yourself here. - If you wish to help but do not know how, please get in touch with Stefan. + If you wish to help but do not know how, please get in touch with the Nutch tean [[http://nutch.apache.org/mailing_lists.htmlhere]]. == Download: == - Here are some mirrors where you can download a version of nutch-0.8-dev bundled with the administration GUI: + The code base we are working on is Nutch 2.0, which you can checkout [[https://svn.apache.org/repos/asf/nutch/branches/nutchgora/|here]]. If you are unfamiliar with using SVN repositories and SVN, then please see [[http://nutch.apache.org/version_control.html|here]]. + + == Old Resources == + + Here are some mirrors where you can download a version of nutch-0.8-dev bundled with the administration GUI, some of these mirrors no longer exist, and are there merely to provide you with a look and feel for the GUI. * http://85.214.26.67/nutch-admingui/nutch-0.8-dev_guiBundle_05_02_06.tar.gz * http://jerome.charron.free.fr/nutch/nutch-0.8-dev_guiBundle_05_02_06.tar.gz
[Nutch Wiki] Trivial Update of NutchAdministrationUserInterface by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The NutchAdministrationUserInterface page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/NutchAdministrationUserInterface?action=diffrev1=7rev2=8 == Look and feel Admin Gui: == This [[https://github.com/101tec/nutch/wiki|link]] provides the best working prototype of an example admin gui, it also provides a heap of material relating to what kind and level of functionality the Nutch webapp should support. + {{http://101tec.com/wp-content/themes/101tec/images/instanceNew.jpg}} + {{http://101tec.com/wp-content/themes/101tec/images/configuration.jpg}} + {{http://101tec.com/wp-content/themes/101tec/images/urlUpload.jpg}} + {{http://101tec.com/wp-content/themes/101tec/images/crawl1.jpg}} + {{http://101tec.com/wp-content/themes/101tec/images/crawl2.jpg}} + {{http://101tec.com/wp-content/themes/101tec/images/crawl3.jpg}} == Description Admin Gui: == There are three main functionalities of the admin gui
[Nutch Wiki] Trivial Update of NutchAdministrationUserInterface by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The NutchAdministrationUserInterface page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/NutchAdministrationUserInterface?action=diffrev1=8rev2=9 == Look and feel Admin Gui: == This [[https://github.com/101tec/nutch/wiki|link]] provides the best working prototype of an example admin gui, it also provides a heap of material relating to what kind and level of functionality the Nutch webapp should support. + + === A New Nutch Instance === + {{http://101tec.com/wp-content/themes/101tec/images/instanceNew.jpg}} + + === Congiguration UI === + {{http://101tec.com/wp-content/themes/101tec/images/configuration.jpg}} + + === URL Upload === + {{http://101tec.com/wp-content/themes/101tec/images/urlUpload.jpg}} + + === Example Crawl === + {{http://101tec.com/wp-content/themes/101tec/images/crawl1.jpg}} + + === Example Crawl === + {{http://101tec.com/wp-content/themes/101tec/images/crawl2.jpg}} + + === Example Crawl === + {{http://101tec.com/wp-content/themes/101tec/images/crawl3.jpg}} + + /!\ '''Edit conflict - other version:''' {{http://101tec.com/wp-content/themes/101tec/images/instanceNew.jpg}} {{http://101tec.com/wp-content/themes/101tec/images/configuration.jpg}} {{http://101tec.com/wp-content/themes/101tec/images/urlUpload.jpg}} {{http://101tec.com/wp-content/themes/101tec/images/crawl1.jpg}} {{http://101tec.com/wp-content/themes/101tec/images/crawl2.jpg}} {{http://101tec.com/wp-content/themes/101tec/images/crawl3.jpg}} + + /!\ '''Edit conflict - your version:''' + + /!\ '''End of edit conflict''' == Description Admin Gui: == There are three main functionalities of the admin gui
[Nutch Wiki] Trivial Update of NutchAdministrationUserInterface by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The NutchAdministrationUserInterface page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/NutchAdministrationUserInterface?action=diffrev1=9rev2=10 === URL Upload === {{http://101tec.com/wp-content/themes/101tec/images/urlUpload.jpg}} - === Example Crawl === + === Example Crawl 1 === {{http://101tec.com/wp-content/themes/101tec/images/crawl1.jpg}} - === Example Crawl === + === Example Crawl 2 === {{http://101tec.com/wp-content/themes/101tec/images/crawl2.jpg}} - === Example Crawl === + === Example Crawl 3 === {{http://101tec.com/wp-content/themes/101tec/images/crawl3.jpg}} - - /!\ '''Edit conflict - other version:''' - {{http://101tec.com/wp-content/themes/101tec/images/instanceNew.jpg}} - {{http://101tec.com/wp-content/themes/101tec/images/configuration.jpg}} - {{http://101tec.com/wp-content/themes/101tec/images/urlUpload.jpg}} - {{http://101tec.com/wp-content/themes/101tec/images/crawl1.jpg}} - {{http://101tec.com/wp-content/themes/101tec/images/crawl2.jpg}} - {{http://101tec.com/wp-content/themes/101tec/images/crawl3.jpg}} - - /!\ '''Edit conflict - your version:''' - - /!\ '''End of edit conflict''' == Description Admin Gui: == There are three main functionalities of the admin gui
[Nutch Wiki] Trivial Update of NutchAdministrationUserInterface by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The NutchAdministrationUserInterface page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/NutchAdministrationUserInterface?action=diffrev1=11rev2=12 == Summary: == - The goal is to extend nutch with a comfortable web based administration user interface to monitor, configure and manage one or a set of nutch search system instances. + The goal is to extend Apach Nutch with a comfortable [[https://issues.apache.org/jira/browse/NUTCH-929|web based administration user interface]] to monitor, configure and manage one or a set of Nutch search system instances through the [[https://issues.apache.org/jira/browse/NUTCH-880|REST-API]]. This will tie together a number of issues, ultimately resulting in a [[https://issues.apache.org/jira/browse/NUTCH-841|Nutch 2.0 Webapp]] == Vision: ==
[Nutch Wiki] Trivial Update of NutchAdministrationUserInterface by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The NutchAdministrationUserInterface page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/NutchAdministrationUserInterface?action=diffrev1=12rev2=13 - = proposal Nutch appliance / Nutch admin gui = + = Proposal Nutch appliance / Nutch admin gui = == Summary: ==
[Nutch Wiki] Trivial Update of NutchAdministrationUserInterface by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The NutchAdministrationUserInterface page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/NutchAdministrationUserInterface?action=diffrev1=14rev2=15 == Summary: == - The goal is to extend Apach Nutch with a comfortable [[https://issues.apache.org/jira/browse/NUTCH-929|web based administration user interface]] to monitor, configure and manage one or a set of Nutch search system instances through the [[https://issues.apache.org/jira/browse/NUTCH-880|REST-API]]. This will tie together a number of issues, ultimately resulting in a [[https://issues.apache.org/jira/browse/NUTCH-841|Nutch 2.0 Webapp]] + The goal is to extend [[http://nutch.apache.org|Apach Nutch]] with a comfortable [[https://issues.apache.org/jira/browse/NUTCH-929|web based administration user interface]] to monitor, configure and manage one or a set of Nutch search system instances through the [[https://issues.apache.org/jira/browse/NUTCH-880|REST-API]]. This will tie together a number of issues, ultimately resulting in a [[https://issues.apache.org/jira/browse/NUTCH-841|Nutch 2.0 Webapp]] == Vision: ==
[jira] [Commented] (NUTCH-929) Create a REST-based admin UI for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211404#comment-13211404 ] Lewis John McGibbney commented on NUTCH-929: As we are using org.restlet as the underlying RESTlet framework, we will need to utilise the presentation technologies supported. e.g integration with three popular template technologies : XSLT, FreeMarker or Apache Velocity. [1] http://wiki.restlet.org/docs_2.0/13-restlet/21-restlet/378-restlet/116-restlet.html Create a REST-based admin UI for Nutch -- Key: NUTCH-929 URL: https://issues.apache.org/jira/browse/NUTCH-929 Project: Nutch Issue Type: New Feature Components: administration gui Affects Versions: nutchgora Reporter: Andrzej Bialecki This is a follow up to NUTCH-880 - we need to expose the functionality of REST API in a user-friendly admin UI. Thanks to the nature of the API the UI can be implemented in any UI framework that speaks REST/JSON, so it could be a simple webapp (we already have jetty) or a Swing / Pivot / etc standalone application. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1253: Attachment: NUTCH-1253-nutchgora.patch NUTCH-1253.patch Trivial patches for both trunk and Nutchgora branch. Can you guys please test and get back on this issue. Thanks Incompatible neko and xerces versions - Key: NUTCH-1253 URL: https://issues.apache.org/jira/browse/NUTCH-1253 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Environment: Ubuntu 10.04 Reporter: Dennis Spathis Attachments: NUTCH-1253-nutchgora.patch, NUTCH-1253.patch The Nutch 1.4 distribution includes - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib- nekohtml) - xercesImpl-2.9.1.jar (under .../runtime/local/lib) These two JARs appear to be incompatible versions. When the HtmlParser (configured to use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, rebuild the HtmlParser plugin and add a catch(Throwable) clause in the getParse method to log the stacktrace.) I found that substituting a later, compatible version of nekohtml (1.9.11) fixes the problem. Curiously, and in support of the above, the nekohtml plugin.xml file in Nutch 1.4 contains the following: plugin id=lib-nekohtml name=CyberNeko HTML Parser version=1.9.11 provider-name=org.cyberneko runtime library name=nekohtml-0.9.5.jar export name=*/ /library /runtime /plugin Note the conflicting version numbers (version tag is 1.9.11 but the specified library is nekohtml-0.9.5.jar). Was the 0.9.5 version included by mistake? Was the intention rather to include 1.9.11? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1253: Patch Info: Patch Available Incompatible neko and xerces versions - Key: NUTCH-1253 URL: https://issues.apache.org/jira/browse/NUTCH-1253 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Environment: Ubuntu 10.04 Reporter: Dennis Spathis Attachments: NUTCH-1253-nutchgora.patch, NUTCH-1253.patch The Nutch 1.4 distribution includes - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib- nekohtml) - xercesImpl-2.9.1.jar (under .../runtime/local/lib) These two JARs appear to be incompatible versions. When the HtmlParser (configured to use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, rebuild the HtmlParser plugin and add a catch(Throwable) clause in the getParse method to log the stacktrace.) I found that substituting a later, compatible version of nekohtml (1.9.11) fixes the problem. Curiously, and in support of the above, the nekohtml plugin.xml file in Nutch 1.4 contains the following: plugin id=lib-nekohtml name=CyberNeko HTML Parser version=1.9.11 provider-name=org.cyberneko runtime library name=nekohtml-0.9.5.jar export name=*/ /library /runtime /plugin Note the conflicting version numbers (version tag is 1.9.11 but the specified library is nekohtml-0.9.5.jar). Was the 0.9.5 version included by mistake? Was the intention rather to include 1.9.11? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-728) Improve nutch release packaging
[ https://issues.apache.org/jira/browse/NUTCH-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-728: --- Attachment: NUTCH-728-v2.patch NUTCH-728-nutchgora.patch Updated patches for trunk and Nutchgora Improve nutch release packaging --- Key: NUTCH-728 URL: https://issues.apache.org/jira/browse/NUTCH-728 Project: Nutch Issue Type: Improvement Reporter: Sami Siren Attachments: NUTCH-728-nutchgora.patch, NUTCH-728-v2.patch, NUTCH-728.patch see the discussion from http://www.lucidimagination.com/search/document/aa4d52cbd9af026a/discuss_contents_of_nutch_release_artifact -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1276) Fix [dep-ann]
[ https://issues.apache.org/jira/browse/NUTCH-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-1276. --- Fix [dep-ann] - Key: NUTCH-1276 URL: https://issues.apache.org/jira/browse/NUTCH-1276 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Fix For: nutchgora, 1.5 Generally speaking these are more straightforward than others as it should be a case of either annotating using {code} @Deprecated {code} or of course replacing the deprecated class method with another non-deprecated implementation. Hopefully most of these occurrences will be resolved within NUTCH-1273 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1276) Fix [dep-ann]
[ https://issues.apache.org/jira/browse/NUTCH-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1276. - Resolution: Fixed Committed @ revision 1291030 in trunk Committed @ revision 1291031 in Nutchgora branch Fix [dep-ann] - Key: NUTCH-1276 URL: https://issues.apache.org/jira/browse/NUTCH-1276 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Fix For: nutchgora, 1.5 Generally speaking these are more straightforward than others as it should be a case of either annotating using {code} @Deprecated {code} or of course replacing the deprecated class method with another non-deprecated implementation. Hopefully most of these occurrences will be resolved within NUTCH-1273 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1273) Fix [deprecation] javac warnings
[ https://issues.apache.org/jira/browse/NUTCH-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211464#comment-13211464 ] Lewis John McGibbney commented on NUTCH-1273: - With this issue, do we wish to simply suppress the warnings? What other options do we have? It makes me think that we could upgrade the use of classes within our library dependencies. Any ideas? Fix [deprecation] javac warnings Key: NUTCH-1273 URL: https://issues.apache.org/jira/browse/NUTCH-1273 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.5 As part of this task, these warnings should be resolved, however this particular strand of warnings can either be resolved by adding {code} @SuppressWarnings(deprecation) {code} or by actually upgrading our class usage to rely upon non-deprecated classes. Which option is more appropriate for the project? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (NUTCH-1249) Resolve all issues flagged up by adding javac -Xlint arguement
[ https://issues.apache.org/jira/browse/NUTCH-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-1249: --- Assignee: Lewis John McGibbney Resolve all issues flagged up by adding javac -Xlint arguement -- Key: NUTCH-1249 URL: https://issues.apache.org/jira/browse/NUTCH-1249 Project: Nutch Issue Type: Improvement Components: build Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.5 There are a heap of issues flagged up by NUTCH-1237, I think over time it would be great to get these addressed and resolved. What is interesting is that adding the same arguements to /src/plugin/plugin-build.xml actually breaks my build as tests begin to fail. Some of this stuff is documented in the link below http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/javac.html#options -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1271) Fix errors @ compile time
[ https://issues.apache.org/jira/browse/NUTCH-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1271. - Resolution: Duplicate This issue duplicates the more accurate NUTCH-1249 Fix errors @ compile time - Key: NUTCH-1271 URL: https://issues.apache.org/jira/browse/NUTCH-1271 Project: Nutch Issue Type: Improvement Components: build Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.5 After adding the -Xlint commands to build.xml, we see many errors when compiling. These should be fixed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1271) Fix errors @ compile time
[ https://issues.apache.org/jira/browse/NUTCH-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-1271. --- Fix errors @ compile time - Key: NUTCH-1271 URL: https://issues.apache.org/jira/browse/NUTCH-1271 Project: Nutch Issue Type: Improvement Components: build Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.5 After adding the -Xlint commands to build.xml, we see many errors when compiling. These should be fixed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
[ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211470#comment-13211470 ] Lewis John McGibbney commented on NUTCH-978: Hi Chris did you mentor this project through GSoC? I've downloaded the .zip available in the description (which I've also attached in case the link goes AWOL) and I'm going to play about with it. I'll attach it as a patch if I get anywhere. [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing. --- Key: NUTCH-978 URL: https://issues.apache.org/jira/browse/NUTCH-978 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.2 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9 Reporter: Ammar Shadiq Assignee: Chris A. Mattmann Priority: Minor Labels: gsoc2011, mentor Fix For: nutchgora Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png Original Estimate: 1,680h Remaining Estimate: 1,680h Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc. A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File. The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration. This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible. The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging. http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
[ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-978: --- Attachment: for_GSoc.zip In it's present form this is quite literally all over the place and is merely for safe keeping. [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing. --- Key: NUTCH-978 URL: https://issues.apache.org/jira/browse/NUTCH-978 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.2 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9 Reporter: Ammar Shadiq Assignee: Chris A. Mattmann Priority: Minor Labels: gsoc2011, mentor Fix For: nutchgora Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip Original Estimate: 1,680h Remaining Estimate: 1,680h Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc. A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File. The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration. This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible. The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging. http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.
Add site fetcher.max.crawl.delay as log output by default. -- Key: NUTCH-1284 URL: https://issues.apache.org/jira/browse/NUTCH-1284 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Priority: Trivial Fix For: nutchgora, 1.5 Currently, when manually scanning our log output we cannot infer which pages are governed by a crawl delay between successive fetch attempts of any given page within the site. The value should be made available as something like: {code} 2012-02-19 12:33:33,031 INFO fetcher.Fetcher - fetching http://nutch.apache.org/ (crawl.delay=XXXms) {code} This way we can easily and quickly determine whether the fetcher is having to use this functionality or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1276) Fix [dep-ann]
[ https://issues.apache.org/jira/browse/NUTCH-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211482#comment-13211482 ] Hudson commented on NUTCH-1276: --- Integrated in Nutch-trunk #1762 (See [https://builds.apache.org/job/Nutch-trunk/1762/]) trivial commit to address NUTCH-1276 (Revision 1291030) Result = SUCCESS lewismc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1291030 Files : * /nutch/trunk/src/java/org/apache/nutch/crawl/MapWritable.java * /nutch/trunk/src/java/org/apache/nutch/net/protocols/ProtocolException.java * /nutch/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java * /nutch/trunk/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java Fix [dep-ann] - Key: NUTCH-1276 URL: https://issues.apache.org/jira/browse/NUTCH-1276 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Fix For: nutchgora, 1.5 Generally speaking these are more straightforward than others as it should be a case of either annotating using {code} @Deprecated {code} or of course replacing the deprecated class method with another non-deprecated implementation. Hopefully most of these occurrences will be resolved within NUTCH-1273 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1276) Fix [dep-ann]
[ https://issues.apache.org/jira/browse/NUTCH-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211484#comment-13211484 ] Hudson commented on NUTCH-1276: --- Integrated in nutch-trunk-maven #156 (See [https://builds.apache.org/job/nutch-trunk-maven/156/]) trivial commit to address NUTCH-1276 (Revision 1291030) Result = SUCCESS lewismc : Files : * /nutch/trunk/src/java/org/apache/nutch/crawl/MapWritable.java * /nutch/trunk/src/java/org/apache/nutch/net/protocols/ProtocolException.java * /nutch/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java * /nutch/trunk/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java Fix [dep-ann] - Key: NUTCH-1276 URL: https://issues.apache.org/jira/browse/NUTCH-1276 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Fix For: nutchgora, 1.5 Generally speaking these are more straightforward than others as it should be a case of either annotating using {code} @Deprecated {code} or of course replacing the deprecated class method with another non-deprecated implementation. Hopefully most of these occurrences will be resolved within NUTCH-1273 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1276) Fix [dep-ann]
[ https://issues.apache.org/jira/browse/NUTCH-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211507#comment-13211507 ] Hudson commented on NUTCH-1276: --- Integrated in Nutch-nutchgora #166 (See [https://builds.apache.org/job/Nutch-nutchgora/166/]) trivial commit to address NUTCH-1276 (Revision 1291031) Result = SUCCESS lewismc : Files : * /nutch/branches/nutchgora/src/java/org/apache/nutch/net/protocols/ProtocolException.java * /nutch/branches/nutchgora/src/java/org/apache/nutch/parse/OutlinkExtractor.java * /nutch/branches/nutchgora/src/test/org/apache/nutch/util/CrawlTestUtil.java Fix [dep-ann] - Key: NUTCH-1276 URL: https://issues.apache.org/jira/browse/NUTCH-1276 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Fix For: nutchgora, 1.5 Generally speaking these are more straightforward than others as it should be a case of either annotating using {code} @Deprecated {code} or of course replacing the deprecated class method with another non-deprecated implementation. Hopefully most of these occurrences will be resolved within NUTCH-1273 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-809) Parse-metatags plugin
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211586#comment-13211586 ] Elisabeth Adler commented on NUTCH-809: --- I haven't tested the plugin in 1.4 myself, but I think a few guys on the mailing list already used it with 1.4. Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.4, nutchgora Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.5 Attachments: NUTCH-809.patch, NUTCH-809_metatags_1.3.patch, metatags-plugin+tutorial.zip h2. Parse-metatags plugin The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The query-basic plugin is used to include these fields in the search e.g. in nutch-site.xml {code:xml} property namequery.basic.description.boost/name value2.0/value /property property namequery.basic.keywords.boost/name value2.0/value /property {code} This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira