[jira] [Commented] (NUTCH-1306) Commit after finished writing to solr index
[ https://issues.apache.org/jira/browse/NUTCH-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245798#comment-13245798 ] Lewis John McGibbney commented on NUTCH-1306: - Hi Dan. In trunk, we have a number of nice features which I would like to bring to your attention. Maybe you can comment on whether you would like to see some of them go into Nutchgora? Namely, NUTCH-1185, NUTCH-1000, NUTCH-996, NUTCH-991 and NUTCH-799 wdyt? Commit after finished writing to solr index --- Key: NUTCH-1306 URL: https://issues.apache.org/jira/browse/NUTCH-1306 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: nutchgora Reporter: Dan Rosher Priority: Trivial Fix For: nutchgora Attachments: NUTCH-1306.patch Commit after finished writing to solr index - otherwise a bit confusing not seeing the number of docs we expect in solr -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type
[ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241207#comment-13241207 ] Lewis John McGibbney commented on NUTCH-1024: - I like this Markus. Although I need to be honest and say that I've not had time to give it a spin as of recent so apologies for this. It looks like the process to date has been a bit frustrating so I apologize for not chipping in earlier. Anyway, we don't rely on commons for logging, could you please replace this with {code} import org.slf4j.Logger; import org.slf4j.LoggerFactory; {code} Another further point from me: You make refernce to the following conf directories {code} SCHEDULE_INC_RATE = db.fetch.schedule.adaptive.inc_rate; SCHEDULE_DEC_RATE = db.fetch.schedule.adaptive.dec_rate; SCHEDULE_MIME_FILE = db.fetch.schedule.mime.file; {code} Although I don't see the new MIME_FILE added to the patch, I also don't see the INC and DEC properties added to nutch-default.xml Thanks Dynamically set fetchInterval by MIME-type -- Key: NUTCH-1024 URL: https://issues.apache.org/jira/browse/NUTCH-1024 Project: Nutch Issue Type: New Feature Components: generator Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, NUTCH-1024-1.5-2.patch, Nutch.patch, adaptive-mimetypes.txt Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between. * simple key\tvalue\n configuration file * only set fetchInterval for new documents * keep max fetchInterval fixed by current config -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's
[ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241212#comment-13241212 ] Lewis John McGibbney commented on NUTCH-1320: - Nice Markus. +1. Is there scope for this to be applied elsewhere, or is parserchecker the only instance (so far) where you've encountered the problem? IndexChecker and ParseChecker choke on IDN's Key: NUTCH-1320 URL: https://issues.apache.org/jira/browse/NUTCH-1320 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1320-1.5-1.patch These handy debug tools do not handle IDN's and throw an NPE bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81 {code} Exception in thread main java.lang.NullPointerException at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-366) Merge URLFilters and URLNormalizers
[ https://issues.apache.org/jira/browse/NUTCH-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234349#comment-13234349 ] Lewis John McGibbney commented on NUTCH-366: Hi Apurv this is great news :) I suggest that if you have not already done so, take a look at NUTCH-365. Try to put the material Andrzej mentioned into context. In parallel I would take a look at the way the current URLFIlters and URLNormalizers are constructed with regards to 1 as above. It would be great to get this moving as a GSoC project. Merge URLFilters and URLNormalizers --- Key: NUTCH-366 URL: https://issues.apache.org/jira/browse/NUTCH-366 Project: Nutch Issue Type: Improvement Reporter: Andrzej Bialecki Labels: gsoc2012 Currently Nutch uses two subsystems related to url validation and normalization: * URLFilter: this interface checks if URLs are valid for further processing. Input URL is not changed in any way. The output is a boolean value. * URLNormalizer: this interface brings URLs to their base (normal) form, or removes unneeded URL components, or performs any other URL mangling as necessary. Input URLs are changed, and are returned as result. However, various Nutch tools run filters and normalizers in pre-determined order, i.e. normalizers first, and then filters. In some cases, where normalizers are complex and running them is costly (e.g. numerous regex rules, DNS lookups) it would make sense to run some of the filters first (e.g. prefix-based filters that select only certain protocols, or suffix-based filters that select only known extensions). This is currently not possible - we always have to run normalizers, only to later throw away urls because they failed to pass through filters. I would like to solicit comments on the following two solutions, and work on implementation of one of them: 1) we could make URLFilters and URLNormalizers implement the same interface, and basically make them interchangeable. This way users could configure their order arbitrarily, even mixing filters and normalizers out of order. This is more complicated, but gives much more flexibility - and NUTCH-365 already provides sufficient framework to implement this, including the ability to define different sequences for different steps in the workflow. 2) we could use a property url.mangling.order ;) to define whether normalizers or filters should run first. This is simple to implement, but provides only limited improvement - because either all filters or all normalizers would run, they couldn't be mixed in arbitrary order. Any comments? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-809) Parse-metatags plugin
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235108#comment-13235108 ] Lewis John McGibbney commented on NUTCH-809: Hi Julien, Can you confirm what you would like to see added to the wiki?, I will try my best to get this added, are you referring to the [0]? Also I thought the best thing to do regarding porting to Nutchgora is just to add it to the ever growing NUTCH-1104 list, so I have done so. If and when this is required over there someone can duly oblige :) Regarding adding fields to Solr I assume you mean schema and solr-mapping.xml? Finally can you expand on 'activate by default', what exactly is it that not activated by default? I read your README.txt but I can see any mention of it in there. Thanks Oh and great patch, this is one which as we know is very much appreciated by everyone. [0] http://wiki.apache.org/nutch/IndexStructure Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.4, nutchgora Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.5 Attachments: NUTCH-809-trunk.patch, NUTCH-809.patch, NUTCH-809_metatags_1.3.patch, metatags-plugin+tutorial.zip h2. Parse-metatags plugin The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The query-basic plugin is used to include these fields in the search e.g. in nutch-site.xml {code:xml} property namequery.basic.description.boost/name value2.0/value /property property namequery.basic.keywords.boost/name value2.0/value /property {code} This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1317) Max content length by MIME-type
[ https://issues.apache.org/jira/browse/NUTCH-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233828#comment-13233828 ] Lewis John McGibbney commented on NUTCH-1317: - Do you have any indication as to why this is Markus? Which plugin are you using to parse your html? Max content length by MIME-type --- Key: NUTCH-1317 URL: https://issues.apache.org/jira/browse/NUTCH-1317 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 The good old http.content.length directive is not sufficient in large internet crawls. For example, a 5MB PDF file may be parsed without issues but a 5MB HTML file may time out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
[ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232588#comment-13232588 ] Lewis John McGibbney commented on NUTCH-978: Great Ammar. Are you wanting to add this as a GSoC2012 project? I am already mentoring one project, and time/work restrictions mean that I can't step up to take on another mentoring role. If you don't wish to make this a project this year, at least the code is on here for guys to pick it up in the future. [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing. --- Key: NUTCH-978 URL: https://issues.apache.org/jira/browse/NUTCH-978 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.2 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9 Reporter: Ammar Shadiq Assignee: Chris A. Mattmann Priority: Minor Labels: gsoc2011, mentor Fix For: nutchgora Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip, version_alpha2.zip Original Estimate: 1,680h Remaining Estimate: 1,680h Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc. A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File. The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration. This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible. The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging. http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1315) reduce speculation on but ParseOutputFormat doesn't name output files correctly?
[ https://issues.apache.org/jira/browse/NUTCH-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232836#comment-13232836 ] Lewis John McGibbney commented on NUTCH-1315: - Regarding your comment e.g. does not turn on reduce speculation, my initial thought it no. I will try to confirm/iron out. Do you have any speculation settings configured for Hadoop at all? reduce speculation on but ParseOutputFormat doesn't name output files correctly? Key: NUTCH-1315 URL: https://issues.apache.org/jira/browse/NUTCH-1315 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: ubuntu 64bit, hadoop 1.0.1, 3 Node Cluster, segment size 1.5M urls Reporter: Rafael Labels: hadoop, hdfs From time to time the Reducer log contains the following and one tasktracker gets blacklisted. org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/test/crawl/segments/20120316065507/parse_text/part-1/data for DFSClient_attempt_201203151054_0028_r_01_1 on client xx.x.xx.xx.10, because this file is already being created by DFSClient_attempt_201203151054_0028_r_01_0 on xx.xx.xx.9 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1404) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1244) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1186) at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:628) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) at org.apache.hadoop.ipc.Client.call(Client.java:1066) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at $Proxy2.create(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy2.create(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.init(DFSClient.java:3245) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:713) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:182) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:555) at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.init(SequenceFile.java:1132) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:354) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:476) at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:157) at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:134) at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:92) at org.apache.nutch.parse.ParseOutputFormat.getRecordWriter(ParseOutputFormat.java:110) at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.init(ReduceTask.java:448) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:490) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093) at org.apache.hadoop.mapred.Child.main(Child.java:249) I asked the hdfs-user mailing list and i got the following answer: Looks
[jira] [Commented] (NUTCH-1273) Fix [deprecation] javac warnings
[ https://issues.apache.org/jira/browse/NUTCH-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232309#comment-13232309 ] Lewis John McGibbney commented on NUTCH-1273: - Still some work to be done with trunk. Rolling back changes with Nutchgora as I've broken it :( I'll try to pick this up again soon. Fix [deprecation] javac warnings Key: NUTCH-1273 URL: https://issues.apache.org/jira/browse/NUTCH-1273 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.5 Attachments: NUTCH-1273-nutchgora.patch, NUTCH-1273-trunk.patch, NUTCH-1273-v2-trunk.patch As part of this task, these warnings should be resolved, however this particular strand of warnings can either be resolved by adding {code} @SuppressWarnings(deprecation) {code} or by actually upgrading our class usage to rely upon non-deprecated classes. Which option is more appropriate for the project? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1310) Nutch to send HTTP-accept header
[ https://issues.apache.org/jira/browse/NUTCH-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13230400#comment-13230400 ] Lewis John McGibbney commented on NUTCH-1310: - Looks good to me Markus. +1 Nutch to send HTTP-accept header Key: NUTCH-1310 URL: https://issues.apache.org/jira/browse/NUTCH-1310 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1310-1.5-1.patch Nutch does not send a HTTP-accept header with its requests. This is usually not a problem but some firewall do not like it and will reject the request. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-882) Design a Host table in GORA
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13227963#comment-13227963 ] Lewis John McGibbney commented on NUTCH-882: Mathijs, my opinion is that you have a clean sheet of paper to begin with certain aspects of this one (simply because you've stepped up to take it on). You obviously have you own idea about how you would like to see the new host table design and also have justification behind the eventual implementation (and API break/redesign) of NutchContext. I think it's wise to think sensibly about NOT breaking the plugin API at this stage and that an incremental approach to addressing this one is a suitable strategy. Feel free to open another issue for the NutchContext issue, as quite rightly this appears to have now morphed into it's own sub domain of the umbrella issue. Design a Host table in GORA --- Key: NUTCH-882 URL: https://issues.apache.org/jira/browse/NUTCH-882 Project: Nutch Issue Type: New Feature Affects Versions: nutchgora Reporter: Julien Nioche Assignee: Julien Nioche Fix For: nutchgora Attachments: NUTCH-882-v1.patch, hostdb.patch Having a separate GORA table for storing information about hosts (and domains?) would be very useful for : * customising the behaviour of the fetching on a host basis e.g. number of threads, min time between threads etc... * storing stats * keeping metadata and possibly propagate them to the webpages * keeping a copy of the robots.txt and possibly use that later to filter the webtable * store sitemaps files and update the webtable accordingly I'll try to come up with a GORA schema for such a host table but any comments are of course already welcome -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1304) GeneratorMapper.java dosen't return when skipping and already generated mark
[ https://issues.apache.org/jira/browse/NUTCH-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225159#comment-13225159 ] Lewis John McGibbney commented on NUTCH-1304: - +1 for commit. I'll wait until this afternoon to hear back from anyone else before doing so. Thanks Dan. GeneratorMapper.java dosen't return when skipping and already generated mark Key: NUTCH-1304 URL: https://issues.apache.org/jira/browse/NUTCH-1304 Project: Nutch Issue Type: Bug Components: generator Affects Versions: nutchgora Reporter: Dan Rosher Priority: Minor Fix For: nutchgora Attachments: NUTCH-1304.patch GeneratorMapper.java dosen't return when skipping and already generated mark -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1305) Domain(blacklist)URLFilter to trim entries
[ https://issues.apache.org/jira/browse/NUTCH-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225206#comment-13225206 ] Lewis John McGibbney commented on NUTCH-1305: - +1 Domain(blacklist)URLFilter to trim entries -- Key: NUTCH-1305 URL: https://issues.apache.org/jira/browse/NUTCH-1305 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: NUTCH-1305-1.5-1.patch Both filters should handle entries with trailing whitespace. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1304) GeneratorMapper.java dosen't return when skipping and already generated mark
[ https://issues.apache.org/jira/browse/NUTCH-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225270#comment-13225270 ] Lewis John McGibbney commented on NUTCH-1304: - Please close this one off when you have time Dan you. GeneratorMapper.java dosen't return when skipping and already generated mark Key: NUTCH-1304 URL: https://issues.apache.org/jira/browse/NUTCH-1304 Project: Nutch Issue Type: Bug Components: generator Affects Versions: nutchgora Reporter: Dan Rosher Priority: Minor Fix For: nutchgora Attachments: NUTCH-1304.patch GeneratorMapper.java dosen't return when skipping and already generated mark -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-728) Improve nutch release packaging
[ https://issues.apache.org/jira/browse/NUTCH-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225389#comment-13225389 ] Lewis John McGibbney commented on NUTCH-728: Looking at this, then at what we have available on our mirrors, I don't really see the need at the moment (unless it would make release process easier) of including this code. Chris already provides us with src.tar.gz with every release? I suppose this ones really down to release manager's opinion. Improve nutch release packaging --- Key: NUTCH-728 URL: https://issues.apache.org/jira/browse/NUTCH-728 Project: Nutch Issue Type: Improvement Reporter: Sami Siren Attachments: NUTCH-728-nutchgora.patch, NUTCH-728-v2.patch, NUTCH-728.patch see the discussion from http://www.lucidimagination.com/search/document/aa4d52cbd9af026a/discuss_contents_of_nutch_release_artifact -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1253) Incompatible neko and xerces versions
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13223504#comment-13223504 ] Lewis John McGibbney commented on NUTCH-1253: - Hi Ferdy, the patches I attached were identical for branch Nutchgora and trunk. I would have assumed if trunk was incorrect then Nutchgora would have shadowed this behaviour. Incompatible neko and xerces versions - Key: NUTCH-1253 URL: https://issues.apache.org/jira/browse/NUTCH-1253 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Environment: Ubuntu 10.04 Reporter: Dennis Spathis Attachments: NUTCH-1253-nutchgora.patch, NUTCH-1253.patch The Nutch 1.4 distribution includes - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib- nekohtml) - xercesImpl-2.9.1.jar (under .../runtime/local/lib) These two JARs appear to be incompatible versions. When the HtmlParser (configured to use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, rebuild the HtmlParser plugin and add a catch(Throwable) clause in the getParse method to log the stacktrace.) I found that substituting a later, compatible version of nekohtml (1.9.11) fixes the problem. Curiously, and in support of the above, the nekohtml plugin.xml file in Nutch 1.4 contains the following: plugin id=lib-nekohtml name=CyberNeko HTML Parser version=1.9.11 provider-name=org.cyberneko runtime library name=nekohtml-0.9.5.jar export name=*/ /library /runtime /plugin Note the conflicting version numbers (version tag is 1.9.11 but the specified library is nekohtml-0.9.5.jar). Was the 0.9.5 version included by mistake? Was the intention rather to include 1.9.11? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-945) Indexing to multiple SOLR Servers
[ https://issues.apache.org/jira/browse/NUTCH-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221898#comment-13221898 ] Lewis John McGibbney commented on NUTCH-945: On user@ Julien passed some excellent comments on this one [0]. My opinion is that I would like to see these incorporated, admittedly I've not checked the patch out Sujit (so please excuse if these points are addressed). . My justification behind this is simply longevity. Markus stated {bq}If Solr 4.0 is released in the coming months (and that's what it looks like) i would suggest to patch Nutch to allow for a list of Solr server URL's instead of doing partitioning on the client site. {bq} Which I agree with, however until we witness a Solr 4.0 release (currently sitting @ 348 issues [2]) I don't see why this can't be integrated into Nutchgora. [0] http://www.mail-archive.com/user@nutch.apache.org/msg05664.html [1] http://www.mail-archive.com/user@nutch.apache.org/msg05674.html [2] https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truejqlQuery=project+%3D+SOLR+AND+resolution+%3D+Unresolved+AND+fixVersion+%3D+%224.0%22+ORDER+BY+priority+DESCmode=hide Indexing to multiple SOLR Servers - Key: NUTCH-945 URL: https://issues.apache.org/jira/browse/NUTCH-945 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.2 Reporter: Charan Malemarpuram Attachments: MurmurHashPartitioner.java, NonPartitioningPartitioner.java, patch-NUTCH-945.txt It would be nice to have a default Indexer in Nutch, which can submit docs to multiple SOLR Servers. Partitioning is always the question, when writing to multiple SOLR Servers. Default partitioning can be a simple hashcode based distribution with addition hooks to customization. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1291) Fetcher to stringify exception on // unexpected exception
[ https://issues.apache.org/jira/browse/NUTCH-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13219222#comment-13219222 ] Lewis John McGibbney commented on NUTCH-1291: - It's a plus 1 from me mate :) Fetcher to stringify exception on // unexpected exception - Key: NUTCH-1291 URL: https://issues.apache.org/jira/browse/NUTCH-1291 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Trivial Fix For: 1.5 Attachments: NUTCH-1291-1.5-1.patch During development we sometimes saw a less than helpful exception e.g. fetch of http://www.openindex.io/en/home.html failed with: java.lang.NullPointerException. This error must be a bit more descriptive. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-670) feed plugin does not parse RSS2 enclosures
[ https://issues.apache.org/jira/browse/NUTCH-670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13218098#comment-13218098 ] Lewis John McGibbney commented on NUTCH-670: Sure is. Not to worry. Thanks feed plugin does not parse RSS2 enclosures -- Key: NUTCH-670 URL: https://issues.apache.org/jira/browse/NUTCH-670 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Todd Lipcon Priority: Minor Original Estimate: 1h Remaining Estimate: 1h The feed parse in plugins/feed does not get count links found in RSS2 enclosure tags as Outlinks. It's a pretty simple patch - SyndEntry has a getEnclosures call. I'll submit the patch tomorrow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1289) In distributed mode URL's are not partitioned
[ https://issues.apache.org/jira/browse/NUTCH-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217155#comment-13217155 ] Lewis John McGibbney commented on NUTCH-1289: - Hi Dan, thanks for opening this issue and for the patch. Are you using trunk at all? If so is it possible to confirm if this functionality is already running in trunk... if not then we can get a patch cooked up. In distributed mode URL's are not partitioned - Key: NUTCH-1289 URL: https://issues.apache.org/jira/browse/NUTCH-1289 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: nutchgora Reporter: Dan Rosher Fix For: nutchgora Attachments: NUTCH-1289.patch In distributed mode URL's are not partitioned to a specific machine which means the politeness policy is voided -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1289) In distributed mode URL's are not partitioned
[ https://issues.apache.org/jira/browse/NUTCH-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217159#comment-13217159 ] Lewis John McGibbney commented on NUTCH-1289: - Markus, what is your opinion as to which suits best? Or is it the case in Nutchgora that Dan's patch is more appropriate? In distributed mode URL's are not partitioned - Key: NUTCH-1289 URL: https://issues.apache.org/jira/browse/NUTCH-1289 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: nutchgora Reporter: Dan Rosher Fix For: nutchgora Attachments: NUTCH-1289.patch In distributed mode URL's are not partitioned to a specific machine which means the politeness policy is voided -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1286) Refactoring/reimplementing crawling API (NutchApp)
[ https://issues.apache.org/jira/browse/NUTCH-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216710#comment-13216710 ] Lewis John McGibbney commented on NUTCH-1286: - For reference, an brief description from Marko regarding the UI which was designed here [0]. + a new extension point that describes ui component + a ui component is a plugin uses backend classes from nutch to provide functionality (e.g. inject, fetch, configuration or whatever) + a ui component can deploy to a webserver as a new webapp + a application that was starting a webserver e.g.jetty and deploy all implemented ui components to the webserver the goal was to use the plugin api to develop separately ui components that can be deploy to the webserver as a new context. + every ui compoment can have more than one instance + with this approach we was able to create different type of crawls (e.g. fast crawl, long running crawl ...) + every type has one instance of a ui compoment + an important ui component we implemented was a component to configure the Configuration object + with that you can configure your crawl instance with different plugins or different configurations for a fetcher or whatever our ui components was directly using the nutch backend. It would be nice to compile a diff list describing changes between implementations. [0] https://github.com/101tec/nutch Refactoring/reimplementing crawling API (NutchApp) -- Key: NUTCH-1286 URL: https://issues.apache.org/jira/browse/NUTCH-1286 Project: Nutch Issue Type: Improvement Components: administration gui, REST_api, web gui Reporter: Ferdy Galema This issue is to track changes we (Mathijs and I) have planned for the API and webapp in Nutchgora. We have a pretty good idea of how we want to be using the crawl API. It may involve some major refactoring or perhaps a side implementation next the current NutchApp functionality. It depends on how much we can reuse the existing components. The bottom line is that there will be a strictly defined Java API that provide everyting related from crawling/indexing to job control. (Listing jobs, tracking progress and aborting jobs being part of it). There will be no server or service for tracking crawling states, all will be persisted one way or the other and queryable from the API. The REST server shall be a very thin layer on top of the Java implementation. A rich web interface will be very easy layer too, once we have a cleanly (but extensive) defined API. But we will start to make to API usable from a simple command-line interface. More details will be provided later on.. feel free to comment if you have suggestions/questions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-728) Improve nutch release packaging
[ https://issues.apache.org/jira/browse/NUTCH-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216750#comment-13216750 ] Lewis John McGibbney commented on NUTCH-728: Ok to commit? Improve nutch release packaging --- Key: NUTCH-728 URL: https://issues.apache.org/jira/browse/NUTCH-728 Project: Nutch Issue Type: Improvement Reporter: Sami Siren Attachments: NUTCH-728-nutchgora.patch, NUTCH-728-v2.patch, NUTCH-728.patch see the discussion from http://www.lucidimagination.com/search/document/aa4d52cbd9af026a/discuss_contents_of_nutch_release_artifact -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1253) Incompatible neko and xerces versions
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216751#comment-13216751 ] Lewis John McGibbney commented on NUTCH-1253: - Anyone had time to try this one out? Incompatible neko and xerces versions - Key: NUTCH-1253 URL: https://issues.apache.org/jira/browse/NUTCH-1253 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Environment: Ubuntu 10.04 Reporter: Dennis Spathis Attachments: NUTCH-1253-nutchgora.patch, NUTCH-1253.patch The Nutch 1.4 distribution includes - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib- nekohtml) - xercesImpl-2.9.1.jar (under .../runtime/local/lib) These two JARs appear to be incompatible versions. When the HtmlParser (configured to use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, rebuild the HtmlParser plugin and add a catch(Throwable) clause in the getParse method to log the stacktrace.) I found that substituting a later, compatible version of nekohtml (1.9.11) fixes the problem. Curiously, and in support of the above, the nekohtml plugin.xml file in Nutch 1.4 contains the following: plugin id=lib-nekohtml name=CyberNeko HTML Parser version=1.9.11 provider-name=org.cyberneko runtime library name=nekohtml-0.9.5.jar export name=*/ /library /runtime /plugin Note the conflicting version numbers (version tag is 1.9.11 but the specified library is nekohtml-0.9.5.jar). Was the 0.9.5 version included by mistake? Was the intention rather to include 1.9.11? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents
[ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214598#comment-13214598 ] Lewis John McGibbney commented on NUTCH-965: Yeah this is confirmed Ferdy. I spun a build and your right. Another headache to deal with :) Relentless! Skip parsing for truncated documents Key: NUTCH-965 URL: https://issues.apache.org/jira/browse/NUTCH-965 Project: Nutch Issue Type: Improvement Components: parser Reporter: Alexis Assignee: Lewis John McGibbney Fix For: nutchgora, 1.5 Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch The issue you're likely to run into when parsing truncated FLV files is described here: http://www.mail-archive.com/user@nutch.apache.org/msg01880.html The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
[ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212576#comment-13212576 ] Lewis John McGibbney commented on NUTCH-978: No bother Chris. So far questions that have been asked 1. provide a quick run down on the issue, summarizing all of the above 2. what were the motivations, purpose and technical challenges encountered whilst working on it? 3. Why did the issue drop away and what do you think is required to get it back on track and possibly in the codebase? [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing. --- Key: NUTCH-978 URL: https://issues.apache.org/jira/browse/NUTCH-978 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.2 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9 Reporter: Ammar Shadiq Assignee: Chris A. Mattmann Priority: Minor Labels: gsoc2011, mentor Fix For: nutchgora Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip Original Estimate: 1,680h Remaining Estimate: 1,680h Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc. A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File. The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration. This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible. The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging. http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
[ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212582#comment-13212582 ] Lewis John McGibbney commented on NUTCH-978: Replies: 1 2. The main motivation of this issue is for processing news document required for my undergrad thesis of Bahasa Indonesia news text clustering, it's needed a prepossessing to extract the title, news content, date, related news link separately. 2. The most biggest technical challenge for me is processing the web page so it could be parsered as an XML document and could be queried by XPath. 3. The issue is drop away, because with a small tweak a could get it working for only my thesis requirements, i haven't tested it with web page other than the web pages i used for my thesis so i think it's not anyway nearly finished yet. And since the proposal is not accepted as a GSOC project, i lost motivation to continue to work on this issue and decided to work on my thesis instead. related issue : https://issues.apache.org/jira/browse/NUTCH-185 [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing. --- Key: NUTCH-978 URL: https://issues.apache.org/jira/browse/NUTCH-978 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.2 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9 Reporter: Ammar Shadiq Assignee: Chris A. Mattmann Priority: Minor Labels: gsoc2011, mentor Fix For: nutchgora Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip Original Estimate: 1,680h Remaining Estimate: 1,680h Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc. A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File. The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration. This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible. The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging. http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
[ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212584#comment-13212584 ] Lewis John McGibbney commented on NUTCH-978: Generally speaking the plugin sounds sounds really useful, the only problem I see is that it is very specific and for it to be integrated into the code base usually we need to make it specific enough to address some given task fully and in a well defined and well justified manner, but we also need to make it general enough to be used in many different contexts. This increases usability and user feedback as well engagement. 4. With regards to the biggest technical challenge being the processing of web page's, how far did you get with this? We're you able to process it with enough precision to satisfy your requirements? 5. How were you querying it with XPath? You cannot query with XPath, but instead with XQuery. Do you maybe mean that this enabled you to navigate the document and address various parts of it is XPath? 6. Ok I understand why it has crumbled slightly, but I think if the code is there is would be a huge waster if we didn't try to revive it, possibly getting it integrated into the code base, and maybe getting it added as a contrib component but not shipping it within the core codebase if the former was not a viable option. I've had a look at NUTCH-185, but I think we can discard this as it was addressed a very long time ago, it's also already integrated into the codebase. I was referring more to Jira issues which were currently open, which we could maybe merge or combine to give this a more general and possibly more justified arguement for inclusion in the codebase... what do you think? Does NUTCH-585 fit this? [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing. --- Key: NUTCH-978 URL: https://issues.apache.org/jira/browse/NUTCH-978 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.2 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9 Reporter: Ammar Shadiq Assignee: Chris A. Mattmann Priority: Minor Labels: gsoc2011, mentor Fix For: nutchgora Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip Original Estimate: 1,680h Remaining Estimate: 1,680h Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc. A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File. The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration. This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible. The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging. http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip -- This message is automatically generated by JIRA. If you think it was sent incorrectly,
[jira] [Commented] (NUTCH-1001) bin/nutch fetch/parse handle crawl/segments directory
[ https://issues.apache.org/jira/browse/NUTCH-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211779#comment-13211779 ] Lewis John McGibbney commented on NUTCH-1001: - Great :0) bin/nutch fetch/parse handle crawl/segments directory - Key: NUTCH-1001 URL: https://issues.apache.org/jira/browse/NUTCH-1001 Project: Nutch Issue Type: Improvement Reporter: Gabriele Kahlout Priority: Minor Fix For: 1.5 Attachments: NUTCH-1001.patch I'm having issues porting scripts across different systems to support the step of extracting the latest/only segments resulting from the generate phase. Variants include: $ export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1` #[1] $ s1=`ls -d crawl/segments/2* | tail -1` #[2] $ segment=`$HADOOP_HOME/bin/hadoop dfs -ls crawl/segments | tail -1 | grep -o [a-zA-Z0-9/\-]* |tail -1` $ segment=`$HADOOP_HOME/bin/hdfs -ls crawl/segments | tail -1 | grep -o [a-zA-Z0-9/\-]* |tail -1` And I'm not sure what windows users would have to do. Some users may also do with: bin/nutch fetch with crawl/segments/2* But I don't see a need in having the user extract/worry-about the latest/only segment, and have it a described step in every nutch tutorial. More over only fetch and parse expect a segment while other commands are fine with the directory of segments. Therefore, I think it's beneficial if fetch and parse also handle directories of segments. [1] http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ [2] http://wiki.apache.org/nutch/NutchTutorial#Command_Line_Searching -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1281) tika parser not work properly with unwanted file types that passed from filters in nutch
[ https://issues.apache.org/jira/browse/NUTCH-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211314#comment-13211314 ] Lewis John McGibbney commented on NUTCH-1281: - Hi behnam, there is a similar issue open and a patch has been submitted for Nutchgora. I wonder if you can check it out and comment on the link between these two. NUTCH-965 Also would it be possible for you to attach your code changes as a patch against trunk? Which I guess is what you are using. Thank you tika parser not work properly with unwanted file types that passed from filters in nutch Key: NUTCH-1281 URL: https://issues.apache.org/jira/browse/NUTCH-1281 Project: Nutch Issue Type: Improvement Components: parser Reporter: behnam nikbakht when in parse-plugins.xml, set this property: mimeType name=* plugin id=parse-tika / /mimeType all unwanted files that pass from all filters, refered to tika but for some file types like .flv, tika parser has problem and hunged and cause to fail in parse Job. if this file types passed from regex-urlfilter and other filters, parse job failed. for this problem I suggest that add some properties for valid file types, and use this code in TikaParser.java, like this: public ParseResult getParse(Content content) { String mimeType = content.getContentType(); + String[]validTypes=new String[]{application/pdf,application/x-tika-msoffice,application/x-tika- ooxml,application/vnd.oasis.opendocument.text,text/plain,application/rtf,application/rss+xml,application/x-bzip2,application/x-gzip,application/x-javascript,application/javascript,text/javascript,application/x-shockwave-flash,application/zip,text/xml,application/xml}; + boolean valid=false; + for(int k=0;kvalidTypes.length;k++){ + if(validTypes[k].compareTo(mimeType.toLowerCase())==0) + valid=true; + } + if(!valid) + return new ParseStatus(ParseStatus.NOTPARSED, Can't parse for unwanted filetype + mimeType).getEmptyParseResult(content.getUrl(), getConf()); URL base; -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1278) Fetch Improvement in threads per host
[ https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211320#comment-13211320 ] Lewis John McGibbney commented on NUTCH-1278: - Behnam, this looks interesting but there are a few problems here. 1) It would be much much easier for us to apply, test and comment on your contribution if you included it in a simple .patch file. This can be done like so {code} $ cd $NUTCH_HOME $ svn diff NUTCH-patch-name.patch {code} The current zip format for the patch(es), plus the fact that every class has been patched separately from thier own respective directories makes it really hard for us to work with this. 2) I doesn't appear that this patch is actually applies against trunk? Maybe 1.4? You can check out trunk here [1] I'm getting errors when trying to apply HttpBase then gave up and started writing this. 3) for a change to the fetcher of this scale, it would be really nice if you could provide a test within the test suite we already maintain [2]. As I said this looks really great, and sorry for the rather lengthy initial response, but for us to consider this for integration it would be great for your contributions to meet this minimum requirement as they are highly appreciated. Thank you [1] https://svn.apache.org/repos/asf/nutch/trunk/ [2] https://svn.apache.org/viewvc/nutch/trunk/src/test/org/apache/nutch/fetcher/TestFetcher.java?view=markup Fetch Improvement in threads per host - Key: NUTCH-1278 URL: https://issues.apache.org/jira/browse/NUTCH-1278 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 1.4 Reporter: behnam nikbakht Attachments: NUTCH-1278.zip the value of maxThreads is equal to fetcher.threads.per.host and is constant for every host there is a possibility with using of dynamic values for every host that influeced with number of blocked requests. this means that if number of blocked requests for one host increased, then we most decrease this value and increase http.timeout -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-929) Create a REST-based admin UI for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211404#comment-13211404 ] Lewis John McGibbney commented on NUTCH-929: As we are using org.restlet as the underlying RESTlet framework, we will need to utilise the presentation technologies supported. e.g integration with three popular template technologies : XSLT, FreeMarker or Apache Velocity. [1] http://wiki.restlet.org/docs_2.0/13-restlet/21-restlet/378-restlet/116-restlet.html Create a REST-based admin UI for Nutch -- Key: NUTCH-929 URL: https://issues.apache.org/jira/browse/NUTCH-929 Project: Nutch Issue Type: New Feature Components: administration gui Affects Versions: nutchgora Reporter: Andrzej Bialecki This is a follow up to NUTCH-880 - we need to expose the functionality of REST API in a user-friendly admin UI. Thanks to the nature of the API the UI can be implemented in any UI framework that speaks REST/JSON, so it could be a simple webapp (we already have jetty) or a Swing / Pivot / etc standalone application. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1273) Fix [deprecation] javac warnings
[ https://issues.apache.org/jira/browse/NUTCH-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211464#comment-13211464 ] Lewis John McGibbney commented on NUTCH-1273: - With this issue, do we wish to simply suppress the warnings? What other options do we have? It makes me think that we could upgrade the use of classes within our library dependencies. Any ideas? Fix [deprecation] javac warnings Key: NUTCH-1273 URL: https://issues.apache.org/jira/browse/NUTCH-1273 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.5 As part of this task, these warnings should be resolved, however this particular strand of warnings can either be resolved by adding {code} @SuppressWarnings(deprecation) {code} or by actually upgrading our class usage to rely upon non-deprecated classes. Which option is more appropriate for the project? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
[ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211470#comment-13211470 ] Lewis John McGibbney commented on NUTCH-978: Hi Chris did you mentor this project through GSoC? I've downloaded the .zip available in the description (which I've also attached in case the link goes AWOL) and I'm going to play about with it. I'll attach it as a patch if I get anywhere. [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing. --- Key: NUTCH-978 URL: https://issues.apache.org/jira/browse/NUTCH-978 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.2 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9 Reporter: Ammar Shadiq Assignee: Chris A. Mattmann Priority: Minor Labels: gsoc2011, mentor Fix For: nutchgora Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png Original Estimate: 1,680h Remaining Estimate: 1,680h Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc. A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File. The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration. This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible. The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging. http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1079) StringBuffer converted to StringBuilder
[ https://issues.apache.org/jira/browse/NUTCH-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210905#comment-13210905 ] Lewis John McGibbney commented on NUTCH-1079: - I kinda got this feeling Julien. Thanks. We'll I think based on the discussion above, there seems to be no overwhelming reason for changing all of this. You did however begin to make a point of sorts Markus, any thoughts now that this one has had a bit of time to settle in? StringBuffer converted to StringBuilder --- Key: NUTCH-1079 URL: https://issues.apache.org/jira/browse/NUTCH-1079 Project: Nutch Issue Type: Improvement Components: fetcher, indexer Reporter: Karthik K Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: NUTCH-1079.patch, NUTCH-rel_14-1079.patch All across the codebase, it contains StringBuffer, when thread-safety is probably not intended. This patch replaces StringBuffer to StringBuilder, as applicable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1246) Upgrade to Hadoop 1.0.0
[ https://issues.apache.org/jira/browse/NUTCH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210937#comment-13210937 ] Lewis John McGibbney commented on NUTCH-1246: - Committed @ revision 1245921 in nutchgora thanks Julien. There was one small change where the jackson dependency was related to org.restlet instead of org.codehaus, it is integral to some nucthgora functionality so couldn't be removed. Also hadoop- test dependencies have been upgraded to 0.20.205. Upgrade to Hadoop 1.0.0 --- Key: NUTCH-1246 URL: https://issues.apache.org/jira/browse/NUTCH-1246 Project: Nutch Issue Type: Improvement Affects Versions: nutchgora, 1.5 Reporter: Julien Nioche -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1210) DomainBlacklistFilter
[ https://issues.apache.org/jira/browse/NUTCH-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210188#comment-13210188 ] Lewis John McGibbney commented on NUTCH-1210: - Hey Markus. In /conf we also have .template files for current filters of this nature. I don't know if you want to include one of those :0| DomainBlacklistFilter - Key: NUTCH-1210 URL: https://issues.apache.org/jira/browse/NUTCH-1210 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1210-1.5-1.patch The current DomainFilter acts as a white list. We also need a filter that acts as a black list so we can allow tld's and/or domains with DomainFilter but blacklist specific subdomains. If we would patch the current DomainFilter for this behaviour it would break current semantics such as it's precedence. Therefore i would propose a new filter instead. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1246) Upgrade to Hadoop 1.0.0
[ https://issues.apache.org/jira/browse/NUTCH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210312#comment-13210312 ] Lewis John McGibbney commented on NUTCH-1246: - How is this issue? Upgrade to Hadoop 1.0.0 --- Key: NUTCH-1246 URL: https://issues.apache.org/jira/browse/NUTCH-1246 Project: Nutch Issue Type: Improvement Affects Versions: nutchgora, 1.5 Reporter: Julien Nioche -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210498#comment-13210498 ] Lewis John McGibbney commented on NUTCH-585: I like this contribution Elisabeth. Is there any way it could be updated to trunk with the following suggestions 1) Please rename the package names to org.apache.nutch.blah.blah 2) In your ivy.xml please change the ivy-configuration.xml to {code} configurations include file=../../..//ivy/ivy-configurations.xml/ /configurations {code} This is eclipse specific. 3) Would it be possible to change the CHANGES.txt to package.html and store it in the lowest most folder within the java directory 4) It would really put the cherry on top if we could get a test case scenario, this would be a big +1. 5) I think the name is maybe a bit large... but I am fine keeping it if you think it is appropriate as it is your patch afterall. Thank you for the contribution. [PARSE-HTML plugin] Block certain parts of HTML code from being indexed --- Key: NUTCH-585 URL: https://issues.apache.org/jira/browse/NUTCH-585 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: All operating systems Reporter: Andrea Spinelli Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: blacklist_whitelist_plugin.patch, nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches. We have modified the plugin so that it ignores HTML code between certain HTML comments, like !-- START-IGNORE -- ... ignored part ... !-- STOP-IGNORE -- We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml). We are almost ready to contribute our code snippet. Looking forward for any expression of interest - or for an explanation why waht we are doing is plain wrong! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1001) bin/nutch fetch/parse handle crawl/segments directory
[ https://issues.apache.org/jira/browse/NUTCH-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210506#comment-13210506 ] Lewis John McGibbney commented on NUTCH-1001: - Hi Gabriele are you interested in incorporating the comments into this patch? It was unfortunate not to get in to 1.4, but we have no immediate plan for 1.5 so it would be great to revive this issue? bin/nutch fetch/parse handle crawl/segments directory - Key: NUTCH-1001 URL: https://issues.apache.org/jira/browse/NUTCH-1001 Project: Nutch Issue Type: Improvement Reporter: Gabriele Kahlout Priority: Minor Fix For: 1.5 Attachments: NUTCH-1001.patch I'm having issues porting scripts across different systems to support the step of extracting the latest/only segments resulting from the generate phase. Variants include: $ export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1` #[1] $ s1=`ls -d crawl/segments/2* | tail -1` #[2] $ segment=`$HADOOP_HOME/bin/hadoop dfs -ls crawl/segments | tail -1 | grep -o [a-zA-Z0-9/\-]* |tail -1` $ segment=`$HADOOP_HOME/bin/hdfs -ls crawl/segments | tail -1 | grep -o [a-zA-Z0-9/\-]* |tail -1` And I'm not sure what windows users would have to do. Some users may also do with: bin/nutch fetch with crawl/segments/2* But I don't see a need in having the user extract/worry-about the latest/only segment, and have it a described step in every nutch tutorial. More over only fetch and parse expect a segment while other commands are fine with the directory of segments. Therefore, I think it's beneficial if fetch and parse also handle directories of segments. [1] http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ [2] http://wiki.apache.org/nutch/NutchTutorial#Command_Line_Searching -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1079) StringBuffer converted to StringBuilder
[ https://issues.apache.org/jira/browse/NUTCH-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210516#comment-13210516 ] Lewis John McGibbney commented on NUTCH-1079: - How is this guys? It seems that there was a level of agreement wrt appends over concats, but the patch/issue never seemed to get updated and has now stagnated slightly. Any chance of reviving the patient? StringBuffer converted to StringBuilder --- Key: NUTCH-1079 URL: https://issues.apache.org/jira/browse/NUTCH-1079 Project: Nutch Issue Type: Improvement Components: fetcher, indexer Reporter: Karthik K Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: NUTCH-1079.patch, NUTCH-rel_14-1079.patch All across the codebase, it contains StringBuffer, when thread-safety is probably not intended. This patch replaces StringBuffer to StringBuilder, as applicable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1210) DomainBlacklistFilter
[ https://issues.apache.org/jira/browse/NUTCH-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210529#comment-13210529 ] Lewis John McGibbney commented on NUTCH-1210: - One last thing, I think your patch requires {code} ant dir=urlfilter-domainblacklist target=deploy/test/clean/ {code} in src/plugin/build.xml. Thanks DomainBlacklistFilter - Key: NUTCH-1210 URL: https://issues.apache.org/jira/browse/NUTCH-1210 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1210-1.5-1.patch The current DomainFilter acts as a white list. We also need a filter that acts as a black list so we can allow tld's and/or domains with DomainFilter but blacklist specific subdomains. If we would patch the current DomainFilter for this behaviour it would break current semantics such as it's precedence. Therefore i would propose a new filter instead. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1246) Upgrade to Hadoop 1.0.0
[ https://issues.apache.org/jira/browse/NUTCH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210534#comment-13210534 ] Lewis John McGibbney commented on NUTCH-1246: - Removal of jackson library in ivy/ivy.xml committed @ revision 1245749 in trunk Upgrade to Hadoop 1.0.0 --- Key: NUTCH-1246 URL: https://issues.apache.org/jira/browse/NUTCH-1246 Project: Nutch Issue Type: Improvement Affects Versions: nutchgora, 1.5 Reporter: Julien Nioche -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1193) Incorrect url transform to lowercase: parameter solr
[ https://issues.apache.org/jira/browse/NUTCH-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210542#comment-13210542 ] Lewis John McGibbney commented on NUTCH-1193: - Committed @ revision 1245753 in trunk. Thank you Eduardo for reporting. Incorrect url transform to lowercase: parameter solr Key: NUTCH-1193 URL: https://issues.apache.org/jira/browse/NUTCH-1193 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.3 Reporter: Eduardo dos Santos Leggiero Priority: Trivial Labels: crawling Fix For: 1.5 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1279) Check if limit has been reached in GeneraterReducer must be the first check performance-wise.
[ https://issues.apache.org/jira/browse/NUTCH-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208341#comment-13208341 ] Lewis John McGibbney commented on NUTCH-1279: - Hi Ferdy, have you checked whether this is the case in trunk as well? I know the fetcher architecture is slightly different. Check if limit has been reached in GeneraterReducer must be the first check performance-wise. - Key: NUTCH-1279 URL: https://issues.apache.org/jira/browse/NUTCH-1279 Project: Nutch Issue Type: Improvement Components: generator Reporter: Ferdy Galema Priority: Minor Fix For: nutchgora Attachments: NUTCH-1279.txt The (count = limit) should be put up front in the reduce method of the generator, because that way when the limit is reached the reduce method will return faster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1278) Fetch Improvement in threads per host
[ https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208413#comment-13208413 ] Lewis John McGibbney commented on NUTCH-1278: - Hi Behnam. Do you have a patch for trunk? Thank you Fetch Improvement in threads per host - Key: NUTCH-1278 URL: https://issues.apache.org/jira/browse/NUTCH-1278 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 1.4 Reporter: behnam nikbakht the value of maxThreads is equal to fetcher.threads.per.host and is constant for every host there is a possibility with using of dynamic values for every host that influeced with number of blocked requests. this means that if number of blocked requests for one host increased, then we most decrease this value and increase http.timeout -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1210) DomainBlacklistFilter
[ https://issues.apache.org/jira/browse/NUTCH-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208459#comment-13208459 ] Lewis John McGibbney commented on NUTCH-1210: - This looks really nice Markus. I like the documentation and test as well. I would like to try it out with another couple of test scenarios before passing my full opinion, which I will be able to do this afternoon. DomainBlacklistFilter - Key: NUTCH-1210 URL: https://issues.apache.org/jira/browse/NUTCH-1210 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1210-1.5-1.patch The current DomainFilter acts as a white list. We also need a filter that acts as a black list so we can allow tld's and/or domains with DomainFilter but blacklist specific subdomains. If we would patch the current DomainFilter for this behaviour it would break current semantics such as it's precedence. Therefore i would propose a new filter instead. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1210) DomainBlacklistFilter
[ https://issues.apache.org/jira/browse/NUTCH-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208881#comment-13208881 ] Lewis John McGibbney commented on NUTCH-1210: - Hi Markus. 1) I would ask one tiny change in ivy.xml from {code} configurations include file=${nutch.root}/ivy/ivy-configurations.xml/ /configurations {code} to {code} configurations include file=../../..//ivy/ivy-configurations.xml/ /configurations {code} this is purely for consistency as I think it's easier to configure in Eclipse as the ${nutch.root} variable hasn't been specified. 2) Also domainblacklist-urlfilter.txt is not included in the patch under /conf. Would it be possible to have a file there with some commented out documentation so users at least have something to go on? 3) Your documentation in the main class also mentions that the property can be overridden in nutch-*.xml, however no property exists in nutch-default for people to go on meaning that it is likely people will become confused when trying to set the property from nutch-site.xml. My tests seemt obe failing with trunk therefore there is something up with my trunk co, so I'll go get that sorted then test a bit more. Thanks DomainBlacklistFilter - Key: NUTCH-1210 URL: https://issues.apache.org/jira/browse/NUTCH-1210 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1210-1.5-1.patch The current DomainFilter acts as a white list. We also need a filter that acts as a black list so we can allow tld's and/or domains with DomainFilter but blacklist specific subdomains. If we would patch the current DomainFilter for this behaviour it would break current semantics such as it's precedence. Therefore i would propose a new filter instead. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1222) Upgrade to new Hadoop 0.22.0
[ https://issues.apache.org/jira/browse/NUTCH-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207730#comment-13207730 ] Lewis John McGibbney commented on NUTCH-1222: - Is this necessary anymore Markus now that we are using 1.0.0? Upgrade to new Hadoop 0.22.0 Key: NUTCH-1222 URL: https://issues.apache.org/jira/browse/NUTCH-1222 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Critical Fix For: 1.5 Attachments: NUTCH-1222-1.5-1.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1222) Upgrade to new Hadoop 0.22.0
[ https://issues.apache.org/jira/browse/NUTCH-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207754#comment-13207754 ] Lewis John McGibbney commented on NUTCH-1222: - Hey Markus it was more a question rather than anything else really. I personally don't have much of a preference as I'm not following the Hadoop project decisions as closely as others, therefore I don't know intricate differences in development tracks if I'm honest. Maybe you may wish to keep this open or something. Up to you I guess :0) Upgrade to new Hadoop 0.22.0 Key: NUTCH-1222 URL: https://issues.apache.org/jira/browse/NUTCH-1222 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Critical Fix For: 1.5 Attachments: NUTCH-1222-1.5-1.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1205) Upgrade gora modules to 0.2-SNAPSHOT in ivy/ivy.xml
[ https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206933#comment-13206933 ] Lewis John McGibbney commented on NUTCH-1205: - OK when I apply the patch, I'm seeing {code} [ivy:resolve] :: problems summary :: [ivy:resolve] WARNINGS [ivy:resolve] [FAILED ] maven-plugins#maven-cobertura-plugin;1.3!maven-cobertura-plugin.plugin: (0ms) [ivy:resolve] local: tried [ivy:resolve] /home/lewis/.ivy2/local/maven-plugins/maven-cobertura-plugin/1.3/plugins/maven-cobertura-plugin.plugin [ivy:resolve] maven2: tried [ivy:resolve] http://repo1.maven.org/maven2/maven-plugins/maven-cobertura-plugin/1.3/maven-cobertura-plugin-1.3.plugin [ivy:resolve] apache-snapshot: tried [ivy:resolve] http://repository.apache.org/content/groups/snapshots-group/maven-plugins/maven-cobertura-plugin/1.3/maven-cobertura-plugin-1.3.plugin [ivy:resolve] [FAILED ] maven-plugins#maven-findbugs-plugin;1.3.1!maven-findbugs-plugin.plugin: (0ms) [ivy:resolve] local: tried [ivy:resolve] /home/lewis/.ivy2/local/maven-plugins/maven-findbugs-plugin/1.3.1/plugins/maven-findbugs-plugin.plugin [ivy:resolve] maven2: tried [ivy:resolve] http://repo1.maven.org/maven2/maven-plugins/maven-findbugs-plugin/1.3.1/maven-findbugs-plugin-1.3.1.plugin [ivy:resolve] apache-snapshot: tried [ivy:resolve] http://repository.apache.org/content/groups/snapshots-group/maven-plugins/maven-findbugs-plugin/1.3.1/maven-findbugs-plugin-1.3.1.plugin [ivy:resolve] :: [ivy:resolve] :: FAILED DOWNLOADS:: [ivy:resolve] :: ^ see resolution messages for details ^ :: [ivy:resolve] :: [ivy:resolve] :: maven-plugins#maven-cobertura-plugin;1.3!maven-cobertura-plugin.plugin [ivy:resolve] :: maven-plugins#maven-findbugs-plugin;1.3.1!maven-findbugs-plugin.plugin [ivy:resolve] :: {code} There is a really weird extension for the plugins e.g. {code} maven-cobertura-plugin.plugin {code} I've tried excluding these as both individual exclusions for the Gora atrifacts and as global exclusions for maven-plugins but noneof this works. Been doing some reading on ivysettings on the ant/ivy website but there is a bit fo documentation so it's taking a while. Upgrade gora modules to 0.2-SNAPSHOT in ivy/ivy.xml --- Key: NUTCH-1205 URL: https://issues.apache.org/jira/browse/NUTCH-1205 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora Attachments: NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, NUTCH-1205-v4.patch, NUTCH-1205.patch Although gora trunk is unstable, work is ongoing to get this fixed. For the time being, I think Nutchgora should use gora trunk as this will identify more vulnerabilities. I'll get the trivial patch submitted shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1205) Upgrade gora modules to 0.2-SNAPSHOT in ivy/ivy.xml
[ https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206939#comment-13206939 ] Lewis John McGibbney commented on NUTCH-1205: - To add to this, I can confirm that we are pulling the most up to date maven artifacts from the apache snapshots repository, so at least we are using bleeding edge here. Upgrade gora modules to 0.2-SNAPSHOT in ivy/ivy.xml --- Key: NUTCH-1205 URL: https://issues.apache.org/jira/browse/NUTCH-1205 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora Attachments: NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, NUTCH-1205-v4.patch, NUTCH-1205.patch Although gora trunk is unstable, work is ongoing to get this fixed. For the time being, I think Nutchgora should use gora trunk as this will identify more vulnerabilities. I'll get the trivial patch submitted shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1205) Upgrade gora modules to 0.2-SNAPSHOT in ivy/ivy.xml
[ https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206961#comment-13206961 ] Lewis John McGibbney commented on NUTCH-1205: - Yeah it's another kettle of fish altogether. I'll get on it and hopefully get it sorted out. I'll ensure that the final patch includes the hsqldb upgrage as well Ferdy. Thanks for now. Upgrade gora modules to 0.2-SNAPSHOT in ivy/ivy.xml --- Key: NUTCH-1205 URL: https://issues.apache.org/jira/browse/NUTCH-1205 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora Attachments: NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, NUTCH-1205-v4.patch, NUTCH-1205.patch Although gora trunk is unstable, work is ongoing to get this fixed. For the time being, I think Nutchgora should use gora trunk as this will identify more vulnerabilities. I'll get the trivial patch submitted shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin
[ https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205750#comment-13205750 ] Lewis John McGibbney commented on NUTCH-1129: - Hi Markus. I'm really gutted about this one, I've not had time to sort it out. I want to say the following things though. - Any23 is now available on repository.apache.org [1], however I think we need to change our ivy resolver to fetch these 0.7.0-snapshots. Should be pretty trivial though. - Any23 already has a crawler plugin implementation (nothing like the stuff we offer in Nutch ;0)) I'm not aware of the code, but it might be worth a swatch? [2] Unfortunately the documentation is not great at all as I'm sure you'll agree. [1] https://repository.apache.org/index.html#nexus-search;quick~org.apache.any23 [2] https://svn.apache.org/viewvc/incubator/any23/trunk/plugins/basic-crawler/ Any23 Nutch plugin -- Key: NUTCH-1129 URL: https://issues.apache.org/jira/browse/NUTCH-1129 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.5 This plugin should build on the Any23 library to provide us with a plugin which extracts RDF data from HTTP and file resources. Although as of writing Any23 not part of the ASF, the project is working towards integration into the Apache Incubator. Once the project proves its value, this would be an excellent addition to the Nutch 1.X codebase. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1269) Generate main problems
[ https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203453#comment-13203453 ] Lewis John McGibbney commented on NUTCH-1269: - Hi Behnam. Can you please package the above code as a patch against 1.5 (trunk). That way we can try it if we get time. Thank you Generate main problems -- Key: NUTCH-1269 URL: https://issues.apache.org/jira/browse/NUTCH-1269 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.4 Environment: software Reporter: behnam nikbakht Labels: Generate, MaxHostCount, MaxNumSegments there are some problems with current Generate method, with maxNumSegments and maxHostCount options: 1. first, size of generated segments are different 2. with maxHostCount option, it is unclear that it was applied or not 3. urls from one host are distributed non-uniform between segments we change Generator.java as described below: in Selector class: private int maxNumSegments; private int segmentSize; private int maxHostCount; public void config ... maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments; maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100); ... public void reduce(FloatWritable key, IteratorSelectorEntry values, OutputCollectorFloatWritable,SelectorEntry output, Reporter reporter) throws IOException { int limit2=(int)((limit*3)/2); while (values.hasNext()) { if(count == limit) break; if (count % segmentSize == 0 ) { if (currentsegmentnum maxNumSegments-1){ currentsegmentnum++; } else currentsegmentnum=0; } boolean full=true; for(int jk=0;jkmaxNumSegments;jk++){ if (segCounts[jk]segmentSize){ full=false; } } if(full){ break; } SelectorEntry entry = values.next(); Text url = entry.url; //logWrite(Generated3:+limit+-+count+-+url.toString()); String urlString = url.toString(); URL u = null; String hostordomain = null; try { if (normalise normalizers != null) { urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); } u = new URL(urlString); if (byDomain) { hostordomain = URLUtil.getDomainName(u); } else { hostordomain = new URL(urlString).getHost(); } hostordomain = hostordomain.toLowerCase(); boolean countLimit=true; // only filter if we are counting hosts or domains int[] hostCount = hostCounts.get(hostordomain); //host count: {a,b,c,d} means that from this host there are a urls in segment 0 and b urls in seg 1 and ... if (hostCount == null) { hostCount = new int[maxNumSegments]; for(int kl=0;klhostCount.length;kl++) hostCount[kl]=0; hostCounts.put(hostordomain, hostCount); } int selectedSeg=currentsegmentnum; int minCount=hostCount[selectedSeg]; for(int jk=0;jkmaxNumSegments;jk++){ if(hostCount[jk]minCount){ minCount=hostCount[jk]; selectedSeg=jk; } } if(hostCount[selectedSeg]=maxHostCount){ count++; entry.segnum = new IntWritable(selectedSeg); hostCount[selectedSeg]++; output.collect(key, entry); } } catch (Exception e) { LOG.warn(Malformed URL: ' + urlString + ', skipping ( logWrite(Generate-malform:+hostordomain+-+url.toString()); + StringUtils.stringifyException(e) + )); //continue; } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1270) some of Deflate encoded pages not fetched
[ https://issues.apache.org/jira/browse/NUTCH-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203515#comment-13203515 ] Lewis John McGibbney commented on NUTCH-1270: - Hi Benham, again thanks for opening this ticket, but could you possibly patch this against trunk (1.5)? Thankyou some of Deflate encoded pages not fetched - Key: NUTCH-1270 URL: https://issues.apache.org/jira/browse/NUTCH-1270 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.4 Environment: software Reporter: behnam nikbakht Labels: fetch, processDeflateEncoded it is a problem with some of web pages that fetched but their content can not retrived after this change, this error fixed we change lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java public byte[] processDeflateEncoded(byte[] compressed, URL url) throws IOException { if (LOGGER.isTraceEnabled()) { LOGGER.trace(inflating); } byte[] content = DeflateUtils.inflateBestEffort(compressed, getMaxContent()); +if(content==null) + content = DeflateUtils.inflateBestEffort(compressed, 20); if (content == null) throw new IOException(inflateBestEffort returned null); if (LOGGER.isTraceEnabled()) { LOGGER.trace(fetched + compressed.length + bytes of compressed content (expanded to + content.length + bytes) from + url); } return content; } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata
[ https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202603#comment-13202603 ] Lewis John McGibbney commented on NUTCH-1259: - Hey Markus. I'm literally up to my eye balls with stuff the now so sorry for not having the time to look through your work. The best I can do is have a look tomorrow, I'll give it my all then. Thanks TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata -- Key: NUTCH-1259 URL: https://issues.apache.org/jira/browse/NUTCH-1259 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1259-1.5-1.patch The MIME-type detected by Tika's Detect() API is never added to a Parse's ContentMetaData or ParseMetaData. Because of this bad Content-Types will end up in the documents. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1140) index-more plugin, resetTitle method creates multiple values in the Title field
[ https://issues.apache.org/jira/browse/NUTCH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13200716#comment-13200716 ] Lewis John McGibbney commented on NUTCH-1140: - Hi Joe. This one seems to have slipped under the radar somewhat! Can you please attach a patch under 1.5 (trunk) please ? Thank you if this is possible. index-more plugin, resetTitle method creates multiple values in the Title field --- Key: NUTCH-1140 URL: https://issues.apache.org/jira/browse/NUTCH-1140 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.3 Reporter: Joe Liedtke Priority: Minor Fix For: 1.5 Attachments: MoreIndexingFilter.093011.patch From the comments in MoreIndexingFilter.java, the index-more plugin is meant to reset the Title field of a document if it contains a Content-Disposition header. The current behavior is to add a Title regardless of whether one exists or not, which can cause issues down the line with the Solr Indexing process, and based on a thread in the nutch user list it appears that this is causing some users to mark the title as multi-valued in the schema: http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8 The following patch removes the title field before adding a new one, which has resolved the issue for me: --- MoreIndexingFilter.old2011-09-30 11:44:35.0 + +++ MoreIndexingFilter.java 2011-09-30 09:58:48.0 + @@ -276,6 +276,7 @@ for (int i=0; ipatterns.length; i++) { if (matcher.contains(contentDisposition,patterns[i])) { result = matcher.getMatch(); +doc.removeField(title); doc.add(title, result.group(1)); break; } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1256) WebGraph to dump host + score
[ https://issues.apache.org/jira/browse/NUTCH-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196894#comment-13196894 ] Lewis John McGibbney commented on NUTCH-1256: - I like this Markus. I wonder if it is possible for you to add some in-line documentation? Or Javadoc, depends on what you wish. Also if you get time, is it possible for this to be added here http://wiki.apache.org/nutch/bin/nutch%20nodedumper WebGraph to dump host + score - Key: NUTCH-1256 URL: https://issues.apache.org/jira/browse/NUTCH-1256 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1256-1.5-1.patch WebGraph's NodeDumper tool can dump url,score information but a host|domain,score output can also be put to good use. This is likely to require a new MapReduce job as the NodeDumper's atonomy is not suited to return max or or summed scores. Code could also be merged with the tool. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1081) ant tests fail
[ https://issues.apache.org/jira/browse/NUTCH-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197288#comment-13197288 ] Lewis John McGibbney commented on NUTCH-1081: - Hi Ferdy. Have you noticed anything dodgy with this? ant tests fail --- Key: NUTCH-1081 URL: https://issues.apache.org/jira/browse/NUTCH-1081 Project: Nutch Issue Type: Bug Components: fetcher, generator, injector, storage Affects Versions: nutchgora Environment: Ubuntu release 11.04 (natty) Kernerl Linux 2.6.38-10-generic GNOME 2.32.1 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: nutchgora The following tests fail when running ant test on trunk 2.0 {code} [junit] Running org.apache.nutch.api.TestAPI [junit] Tests run: 4, Failures: 1, Errors: 0, Time elapsed: 11.028 sec [junit] Test org.apache.nutch.api.TestAPI FAILED [junit] Running org.apache.nutch.crawl.TestGenerator [junit] Tests run: 4, Failures: 0, Errors: 4, Time elapsed: 0.478 sec [junit] Test org.apache.nutch.crawl.TestGenerator FAILED [junit] Running org.apache.nutch.crawl.TestInjector [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.474 sec [junit] Test org.apache.nutch.crawl.TestInjector FAILED [junit] Running org.apache.nutch.fetcher.TestFetcher [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 0.526 sec [junit] Test org.apache.nutch.fetcher.TestFetcher FAILED [junit] Running org.apache.nutch.storage.TestGoraStorage [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 0.468 sec [junit] Test org.apache.nutch.storage.TestGoraStorage FAILED {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents
[ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190545#comment-13190545 ] Lewis John McGibbney commented on NUTCH-965: Hi can anyone advise if I should be looking @ ParseUtil class in trunk? I'm a bit confused and Eclipse doesn't seem to be helping out much. Skip parsing for truncated documents Key: NUTCH-965 URL: https://issues.apache.org/jira/browse/NUTCH-965 Project: Nutch Issue Type: Improvement Components: parser Reporter: Alexis Assignee: Lewis John McGibbney Fix For: nutchgora, 1.5 Attachments: NUTCH-965-v2.patch, parserJob.patch The issue you're likely to run into when parsing truncated FLV files is described here: http://www.mail-archive.com/user@nutch.apache.org/msg01880.html The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1254) NTLMv2 is not supported and HttpClient returns error code 500
[ https://issues.apache.org/jira/browse/NUTCH-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188491#comment-13188491 ] Lewis John McGibbney commented on NUTCH-1254: - Hi Remi are you able to provide a patch for trunk which either recognizes whether to use NTML2 dynamically or maybe sets this as a boolean value in nutch-default.xml? NTLMv2 is not supported and HttpClient returns error code 500 - Key: NUTCH-1254 URL: https://issues.apache.org/jira/browse/NUTCH-1254 Project: Nutch Issue Type: Bug Components: fetcher Reporter: Remi Tassing Priority: Minor When trying to access some SharePoint(IIS) website using NTLMv2 authentication, Nutch fails and gets an error code 500. HttpClient only supports an early version of NTLM but not NTLMv2. HttpUrlConnection can be used instead. [1]http://oaklandsoftware.com/papers/ntlm.html [2]http://developer-resource.blogspot.com/2008/06/ntlm-authentication-from-java.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188505#comment-13188505 ] Lewis John McGibbney commented on NUTCH-1086: - When trying to access some SharePoint(IIS) website using NTLMv2 authentication, Nutch fails and gets an error code 500. HttpClient only supports an early version of NTLM but not NTLMv2. HttpUrlConnection can be used instead. [1]http://oaklandsoftware.com/papers/ntlm.html [2]http://developer-resource.blogspot.com/2008/06/ntlm-authentication-from-java.html Rewrite protocol-httpclient --- Key: NUTCH-1086 URL: https://issues.apache.org/jira/browse/NUTCH-1086 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Markus Jelsma There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient. http://hc.apache.org/httpcomponents-client-ga/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1246) Upgrade to Hadoop 1.0.0
[ https://issues.apache.org/jira/browse/NUTCH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185562#comment-13185562 ] Lewis John McGibbney commented on NUTCH-1246: - Subtask? Upgrade to Hadoop 1.0.0 --- Key: NUTCH-1246 URL: https://issues.apache.org/jira/browse/NUTCH-1246 Project: Nutch Issue Type: Improvement Affects Versions: nutchgora, 1.5 Reporter: Julien Nioche -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int
[ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185844#comment-13185844 ] Lewis John McGibbney commented on NUTCH-1247: - Where in CrawlDatum is the CrawlDBReader map method on line 159 getting the RetriesSinceFetch() from? {code} output.collect(new Text(retry + value.getRetriesSinceFetch()), COUNT_1); {code} Also, excuse my naivety but can you be more verbose about why the byte value for CrawlDatum.retries goes bad? CrawlDatum.retries should be int Key: NUTCH-1247 URL: https://issues.apache.org/jira/browse/NUTCH-1247 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Fix For: 1.5 CrawlDatum.retries is a byte and goes bad with larger values. 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-809) Parse-metatags plugin
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184880#comment-13184880 ] Lewis John McGibbney commented on NUTCH-809: Hi Elisabeth although I haven't had time to look through your zip yet a big thank you must be aimed your way. If you have time and are willing please create a new page on the Nutch wiki under plugin central. As you can see this issue is closely linked to some others of similar nature so it may/may not change in the future, however community driven documentation is exactly what we are after and it is greatly welcomed. Please contact me off list or @ dev@ with your wiki username and I will add you to a the wiki contributers page. Thank you [1] http://wiki.apache.org/nutch/PluginCentral Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.4, nutchgora Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.5 Attachments: NUTCH-809.patch, NUTCH-809_metatags_1.3.patch, metatags-plugin+tutorial.zip h2. Parse-metatags plugin The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The query-basic plugin is used to include these fields in the search e.g. in nutch-site.xml {code:xml} property namequery.basic.description.boost/name value2.0/value /property property namequery.basic.keywords.boost/name value2.0/value /property {code} This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185266#comment-13185266 ] Lewis John McGibbney commented on NUTCH-797: This has been committed but the issue is still open and marked as unresolved. I've just spent around 30 mins looking through the three open issues closely surrounding this problem area with constructing outlinks beginning with ?'s. I think that we need to have a close look to try and sort the three issues out. parse-tika is not properly constructing URLs when the target begins with a ? -- Key: NUTCH-797 URL: https://issues.apache.org/jira/browse/NUTCH-797 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1 Environment: Win 7, Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Also repro's on RHEL and java 1.4.2 Reporter: Robert Hohman Assignee: Andrzej Bialecki Priority: Minor Fix For: nutchgora Attachments: NUTCH-797.patch, pureQueryUrl-2.patch, pureQueryUrl.patch This is my first bug and patch on nutch, so apologies if I have not provided enough detail. In crawling the page at http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are links in the page that look like this: a href=?co=0sk=0p=2pi=12/a/tdtda href=?co=0sk=0p=3pi=13/a in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as getOutlinks looks for links, it comes across this link, and constucts a new url with a base URL class built from http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a target of ?co=0sk=0p=2pi=1 The URL class, per RFC 3986 at http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how to merge these two, and per the RFC, the URL class merges these to: http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1 because the RFC explicitly states that the rightmost url segment (the Search.aspx in this case) should be ripped off before combining. While this is compliant with the RFC, it means the URLs which are created for the next round of fetching are incorrect. Modern browsers seem to handle this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure exception or handling of what is a poorly formed url on accenture's part. I have fixed this by modifying DOMContentUtils to look for the case where a ? begins the target, and then pulling the rightmost component out of the base and inserting it into the target before the ?, so the target in this example becomes: Search.aspx?co=0sk=0p=2pi=1 The URL class then properly constructs the new url as: http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1 If it is agreed that this solution works, I believe the other html parsers in nutch would need to be modified in a similar way. Can I get feedback on this proposed solution? Specifically I'm worried about unforeseen side effects. Much thanks Here is the patch info: Index: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java === --- src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (revision 916362) +++ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (working copy) @@ -299,6 +299,50 @@ return false; } + private URL fixURL(URL base, String target) throws MalformedURLException + { + // handle params that are embedded into the base url - move them to target + // so URL class constructs the new url class properly + if (base.toString().indexOf(';') 0) + return fixEmbeddedParams(base, target); + + // handle the case that there is a target that is a pure query. + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on how to assemble + // URLs but I've seen this in numerous places, for example at + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 + // It has urls in the page of the form href=?co=0sk=0pg=1, and by default + // URL constructs the base+target combo as + // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, incorrectly + // dropping the Search.aspx target + // + // Browsers handle these just fine, they must have an exception similar to this + if (target.startsWith(?)) + { + return fixPureQueryTargets(base, target); + } + + return new URL(base, target); + } + + private URL fixPureQueryTargets(URL base, String target) throws MalformedURLException +
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185285#comment-13185285 ] Lewis John McGibbney commented on NUTCH-1031: - Hi Julien, out of shear curiosity, how do we currently parse robots.txt? I found some files (which don't do parsing) in o.a.n.protocol but I've never known what we use for robots.txt Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Julien Nioche Priority: Minor Labels: robots.txt Fix For: 1.5 We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1237) Improve javac arguements for more verbose output
[ https://issues.apache.org/jira/browse/NUTCH-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13180357#comment-13180357 ] Lewis John McGibbney commented on NUTCH-1237: - Any problems with committing this one? All local tests pass as per Julien's committs to fix the nightly build. It's providing us with a wealth of info of where the code can be improved. Improve javac arguements for more verbose output - Key: NUTCH-1237 URL: https://issues.apache.org/jira/browse/NUTCH-1237 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 1.4, nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: nutchgora, 1.5 Attachments: NUTCH-1237-nutchgora.patch, NUTCH-1237-trunk.patch, NUTCH-1237-trunk.patch When trying to fix another problem I stumbled across this one. I think it is important to ensure that the javac outputs info regarding deprecation and unchecked operations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-926) Nutch follows wrong url in META http-equiv=refresh tag
[ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13180556#comment-13180556 ] Lewis John McGibbney commented on NUTCH-926: Hey guys, just looking at our critical issues and hadn't noticed this one previously, did anyone have a look at this issue and can we reproduce? Nutch follows wrong url in META http-equiv=refresh tag - Key: NUTCH-926 URL: https://issues.apache.org/jira/browse/NUTCH-926 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.2 Environment: gnu/linux centOs Reporter: Marco Novo Priority: Critical Attachments: ParseOutputFormat.java.patch We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains. So WWW.DOMAIN1.COM .. .. .. WWW.RIGHTDOMAIN.COM .. .. .. .. WWW.DOMAIN.COM We sets nutch to: NOT FOLLOW EXERNAL LINKS During crawling of WWW.RIGHTDOMAIN.COM if a page contains !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN html head title/title META http-equiv=refresh content=0; url=http://WRONG.RIGHTDOMAIN.COM; /head body /body /html Nutch continues to crawl the WRONG subdomains! But it should not do this!! During crawling of WWW.RIGHTDOMAIN.COM if a page contains !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN html head title/title META http-equiv=refresh content=0; url=http://WWW.WRONGDOMAIN.COM; /head body /body /html Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have done a patch so we will attach it -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-874) Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora
[ https://issues.apache.org/jira/browse/NUTCH-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13180562#comment-13180562 ] Lewis John McGibbney commented on NUTCH-874: I know the heat has kind of shifted away from Nutchgora but it would be great to clarify what this issues actually encapsulates. Was/is it is the case that some plugins in Nutchgora are not actually working with the Nutchgora API? I kinda confused with this one! Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora -- Key: NUTCH-874 URL: https://issues.apache.org/jira/browse/NUTCH-874 Project: Nutch Issue Type: Bug Components: parser Environment: Nutch 2.0 Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Critical Fix For: nutchgora I just noticed while fixing NUTCH-564 that the ExtParser hasn't been brought up to date with Nutch 2.0 trunk. We should review the plugins in src/plugin to make sure they all work with Gora/Nutchbase now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1138) remove LogUtil from trunk and nutch gora
[ https://issues.apache.org/jira/browse/NUTCH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13178719#comment-13178719 ] Lewis John McGibbney commented on NUTCH-1138: - Hey Markus, it's been committed in trunk but I was wanting to get on with the nutchgora patch asap. Leave it with me and I'll commit and close shortly. Thank you remove LogUtil from trunk and nutch gora Key: NUTCH-1138 URL: https://issues.apache.org/jira/browse/NUTCH-1138 Project: Nutch Issue Type: Improvement Affects Versions: 1.4, nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.5 Attachments: Document1.txt, NUTCH-1138-trunk-20111023.patch This should move towards the removal of the LogUtil class from both codebases as per comments in NUTCH-1078. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1237) Improve javac arguements for more verbose output
[ https://issues.apache.org/jira/browse/NUTCH-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176605#comment-13176605 ] Lewis John McGibbney commented on NUTCH-1237: - Revised patch for trunk, I forget something in the first one. Thanks Improve javac arguements for more verbose output - Key: NUTCH-1237 URL: https://issues.apache.org/jira/browse/NUTCH-1237 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 1.4, nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: nutchgora, 1.5 Attachments: NUTCH-1237-nutchgora.patch, NUTCH-1237-trunk.patch, NUTCH-1237-trunk.patch When trying to fix another problem I stumbled across this one. I think it is important to ensure that the javac outputs info regarding deprecation and unchecked operations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1138) remove LogUtil from trunk and nutch gora
[ https://issues.apache.org/jira/browse/NUTCH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176192#comment-13176192 ] Lewis John McGibbney commented on NUTCH-1138: - Partly committed @ revision 1224917 in trunk Fully committed @ revision 1224919 in trunk (second commit removed LogUtil altogether). remove LogUtil from trunk and nutch gora Key: NUTCH-1138 URL: https://issues.apache.org/jira/browse/NUTCH-1138 Project: Nutch Issue Type: Improvement Affects Versions: 1.4, nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.5 Attachments: Document1.txt, NUTCH-1138-trunk-20111023.patch This should move towards the removal of the LogUtil class from both codebases as per comments in NUTCH-1078. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1237) Improve javac arguements for more verbose output
[ https://issues.apache.org/jira/browse/NUTCH-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176220#comment-13176220 ] Lewis John McGibbney commented on NUTCH-1237: - If I can get a +1 I'll commit. Thank you Improve javac arguements for more verbose output - Key: NUTCH-1237 URL: https://issues.apache.org/jira/browse/NUTCH-1237 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 1.4, nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: nutchgora, 1.5 Attachments: NUTCH-1237-nutchgora.patch, NUTCH-1237-trunk.patch When trying to fix another problem I stumbled across this one. I think it is important to ensure that the javac outputs info regarding deprecation and unchecked operations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1205) Upgrade gora modules to 0.2-incubating in ivy/ivy.xml
[ https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176224#comment-13176224 ] Lewis John McGibbney commented on NUTCH-1205: - for reference to fix the above problems http://stackoverflow.com/questions/197986/what-causes-javac-to-issue-the-uses-unchecked-or-unsafe-operations-warning Upgrade gora modules to 0.2-incubating in ivy/ivy.xml - Key: NUTCH-1205 URL: https://issues.apache.org/jira/browse/NUTCH-1205 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora Attachments: NUTCH-1205-v2.patch, NUTCH-1205.patch Although gora trunk is unstable, work is ongoing to get this fixed. For the time being, I think Nutchgora should use gora trunk as this will identify more vulnerabilities. I'll get the trivial patch submitted shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1217) Update NOTICE.txt to drop some copyrights
[ https://issues.apache.org/jira/browse/NUTCH-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175949#comment-13175949 ] Lewis John McGibbney commented on NUTCH-1217: - Hi Guys, as I've looked deeper in to this the first patch is a load of dribble. As we are pulling the overwhelming majority of our dependencies from upstream repositories using Ivy, there is no need to include them in the NOTICE.txt declarations. We only ship with JavaSWF Automaton libraries (both of which are plugins). I'll commit this, do the same for Nutchgora then shut this one off. One last question, is anyone aware if our licences for the above two packages are OK? I am not aware but I am more than happy to have a word with the authors to find out. Thanks Update NOTICE.txt to drop some copyrights - Key: NUTCH-1217 URL: https://issues.apache.org/jira/browse/NUTCH-1217 Project: Nutch Issue Type: Improvement Components: documentation Affects Versions: 1.4 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: nutchgora, 1.5 Attachments: NUTCH-1217-trunk.patch We have many references to software copyrights which should be dropped. Most of these relate to the Lucene legacy days. -Carrot2 -saxpath -jaxen -jdom -snowball -violinstrings -Jena -bouncycastle -fontbox -jempbox -pdfbox -rome Also some need to be added -slf4j -activation -mortbay (jetty) -jline -junit -stax -wstx As I am unfamiliar with most of these, and that is important to inlcude all references to software outside of the ASF, I would appreciate if this list could act as a beginning for completing this issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1081) ant tests fail
[ https://issues.apache.org/jira/browse/NUTCH-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175956#comment-13175956 ] Lewis John McGibbney commented on NUTCH-1081: - Hi Ferdy. There has been almost no problems within the CI testing environment for a number of weeks/months. Any failures seem to have been down to the project building on Ubuntu slaves as oppose to Solaris slaves, the failures are a result of incorrect envars being specified. I've added some more functionality to the nutchgora build characteristics e.g. Publish JUnit test result report and publish Javadoc. So as agreed we will keep an eye on this. ant tests fail --- Key: NUTCH-1081 URL: https://issues.apache.org/jira/browse/NUTCH-1081 Project: Nutch Issue Type: Bug Components: fetcher, generator, injector, storage Affects Versions: nutchgora Environment: Ubuntu release 11.04 (natty) Kernerl Linux 2.6.38-10-generic GNOME 2.32.1 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: nutchgora The following tests fail when running ant test on trunk 2.0 {code} [junit] Running org.apache.nutch.api.TestAPI [junit] Tests run: 4, Failures: 1, Errors: 0, Time elapsed: 11.028 sec [junit] Test org.apache.nutch.api.TestAPI FAILED [junit] Running org.apache.nutch.crawl.TestGenerator [junit] Tests run: 4, Failures: 0, Errors: 4, Time elapsed: 0.478 sec [junit] Test org.apache.nutch.crawl.TestGenerator FAILED [junit] Running org.apache.nutch.crawl.TestInjector [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.474 sec [junit] Test org.apache.nutch.crawl.TestInjector FAILED [junit] Running org.apache.nutch.fetcher.TestFetcher [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 0.526 sec [junit] Test org.apache.nutch.fetcher.TestFetcher FAILED [junit] Running org.apache.nutch.storage.TestGoraStorage [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 0.468 sec [junit] Test org.apache.nutch.storage.TestGoraStorage FAILED {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1218) Improve trunk API documentation
[ https://issues.apache.org/jira/browse/NUTCH-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175966#comment-13175966 ] Lewis John McGibbney commented on NUTCH-1218: - Does anyone have any objections for me to hack away at this making commits when I can? The intention is to work my way through the core classes, providing a description of each package, then get in to more detail with individual classes within the 'core' bunch of classes. After this I'll move on to the plugin's. After that I'll move on to Nutchgora!!! Improve trunk API documentation --- Key: NUTCH-1218 URL: https://issues.apache.org/jira/browse/NUTCH-1218 Project: Nutch Issue Type: Sub-task Components: documentation Affects Versions: 1.4 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.5 Attachments: NUTCH-1218.patch The trunk API Java documentation could do with some improving. This issue should track that. It should however not seek to change any functionality within the codebase, only to substantiate and improve the existing documentation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1218) Improve trunk API documentation
[ https://issues.apache.org/jira/browse/NUTCH-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175967#comment-13175967 ] Lewis John McGibbney commented on NUTCH-1218: - Another thing, even if I make the changes to trunk, it would be great to view them dynamically on the trunk Javadoc site [1] e.g. publish them after every commit to see the actual changes at incremental stages. Any advice on this? From looking at build.xml, it appears that this we only fully publish the Javadoc when releasing... Is this the case? If not then can someone please advise me otherwise? Thanks guys Improve trunk API documentation --- Key: NUTCH-1218 URL: https://issues.apache.org/jira/browse/NUTCH-1218 Project: Nutch Issue Type: Sub-task Components: documentation Affects Versions: 1.4 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.5 Attachments: NUTCH-1218.patch The trunk API Java documentation could do with some improving. This issue should track that. It should however not seek to change any functionality within the codebase, only to substantiate and improve the existing documentation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1138) remove LogUtil from trunk and nutch gora
[ https://issues.apache.org/jira/browse/NUTCH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176046#comment-13176046 ] Lewis John McGibbney commented on NUTCH-1138: - Looking for logging irregularities in hadoop.log after running a medium sized crawl over mini MR cluster I'm struggling to see any adverse behaviour produced as a result of applying this patch. Most WARN's can be attributed to new MR API and I've a couple of java.net.SocketException: Connection reset ERRORS, which we must expect from time to time :0) remove LogUtil from trunk and nutch gora Key: NUTCH-1138 URL: https://issues.apache.org/jira/browse/NUTCH-1138 Project: Nutch Issue Type: Improvement Affects Versions: 1.4, nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.5 Attachments: Document1.txt, NUTCH-1138-trunk-20111023.patch This should move towards the removal of the LogUtil class from both codebases as per comments in NUTCH-1078. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1216) Add trivial comment to lib/native/README.txt
[ https://issues.apache.org/jira/browse/NUTCH-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13163775#comment-13163775 ] Lewis John McGibbney commented on NUTCH-1216: - if you's guys are happy to add this then please say. Add trivial comment to lib/native/README.txt Key: NUTCH-1216 URL: https://issues.apache.org/jira/browse/NUTCH-1216 Project: Nutch Issue Type: Improvement Components: documentation Affects Versions: 1.4, nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Trivial Fix For: nutchgora, 1.5 Attachments: NUTCH-1216-nutchgora.patch, NUTCH-1216-trunk.patch This trivial issue simply adds missing comments to the above file. The WARN logging which is churned out has caused a small degree of confusion in the past, therefore this sorts that out :0) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1200) Resolving Ivy dependencies in several plugins
[ https://issues.apache.org/jira/browse/NUTCH-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13160847#comment-13160847 ] Lewis John McGibbney commented on NUTCH-1200: - Hi Blaise I would direct you to this tutorial [1]. It covers everything you should need to get Nutch working within your Eclipse IDE. It takes about a half hour or so to set up but definitely works as I have been debugging some simple jobs from within Eclipse. Please get back to us on the user lists if you are having any problems. Thank you [1] http://wiki.apache.org/nutch/RunNutchInEclipse Resolving Ivy dependencies in several plugins -- Key: NUTCH-1200 URL: https://issues.apache.org/jira/browse/NUTCH-1200 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.5 Attachments: NUTCH-1200-trunk.patch, NUTCH-1200-v2-trunk.patch When configuring Nutch 1.5-SNAPSHOT in Eclipse, I noticed that any plugins requiring additional libraries OVER AND ABOVE the ones specified in NUTCH_HOME/ivy/ivy.xml cannot resolve the dependencies. In specific the classes are {code} - FeedParser dependency org=net.java.dev.rome name=rome rev=1.0.0 conf=*-master/ - URLAutomationFilter - dependency org=dk.brics name=automaton rev=???/ - SWFParser dependency org=com.google.gwt name=gwt-incubator rev=2.0.1/ - HTMLParser dependency org=net.sourceforge.nekohtml name=nekohtml rev=1.9.15/ {code} Further to this, I cannot locate the dk.brics dependency! Finally, the plugin/ivy.xml files for the above plugins cannot be parsed corectly due to the ${nutch.root} vairable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1210) DomainBlacklistFilter
[ https://issues.apache.org/jira/browse/NUTCH-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13156805#comment-13156805 ] Lewis John McGibbney commented on NUTCH-1210: - Hi Markus I think this is a great idea. It bears *some* similarity to this old issue NUTCH-208 and I think it would be an excellent contribution. DomainBlacklistFilter - Key: NUTCH-1210 URL: https://issues.apache.org/jira/browse/NUTCH-1210 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 The current DomainFilter acts as a white list. We also need a filter that acts as a black list so we can allow tld's and/or domains with DomainFilter but blacklist specific subdomains. If we would patch the current DomainFilter for this behaviour it would break current semantics such as it's precedence. Therefore i would propose a new filter instead. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1205) Upgrade gora modules to 0.2-SNAPSHOT
[ https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13155898#comment-13155898 ] Lewis John McGibbney commented on NUTCH-1205: - This can't be progressed with until we get the Gora 0.2-SNAPSHOT's loaded to http://repo1.maven.org/maven2/org/apache/gora/ I'll work on this Upgrade gora modules to 0.2-SNAPSHOT Key: NUTCH-1205 URL: https://issues.apache.org/jira/browse/NUTCH-1205 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora Although gora trunk is unstable, work is ongoing to get this fixed. For the time being, I think Nutchgora should use gora trunk as this will identify more vulnerabilities. I'll get the trivial patch submitted shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1189) add commented out default settings to gora.properties files
[ https://issues.apache.org/jira/browse/NUTCH-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150500#comment-13150500 ] Lewis John McGibbney commented on NUTCH-1189: - Yes it certainly should. Couple of things to sort out just now then I'll come back to this, get the other properties added in a commented fashion then we should be good to fire this one off. Thanks for now. add commented out default settings to gora.properties files Key: NUTCH-1189 URL: https://issues.apache.org/jira/browse/NUTCH-1189 Project: Nutch Issue Type: Sub-task Components: storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: nutchgora Attachments: NUTCH-1189-v2.patch, NUTCH-1189-v3.patch, NUTCH-1189.patch This issues should have been dealt with as part of its parent issue, however I think as it is a fairly lareg task in itself, it needs to be done independently. The gora.properties file should, amongst other settings, and beside the extreme basic defaults for sqlstore, include defaults for opening HBase, Cassandra, etc servers on their default ports etc. Leaving this down to individual interpretation puts a huge owness of the user, hence constructing a barrier to entry for getting the configuration settings up and running. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1148) Nutchgora job jar functionalilty is broken: PluginManifestParser cannot load plugins from system classloader.
[ https://issues.apache.org/jira/browse/NUTCH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150502#comment-13150502 ] Lewis John McGibbney commented on NUTCH-1148: - I'm happy to see this one go Ferdy. It's been sitting for a while and has slipped through thr net until now. Any other comments please? Nutchgora job jar functionalilty is broken: PluginManifestParser cannot load plugins from system classloader. - Key: NUTCH-1148 URL: https://issues.apache.org/jira/browse/NUTCH-1148 Project: Nutch Issue Type: Bug Affects Versions: nutchgora Reporter: Ferdy Galema Priority: Critical Attachments: NUTCH-1148-v1.patch This affects running nutchgora using Hadoop it's RunJar mechanism (hadoop jar ...). The mr tasks are perfectly able to load the plugins (please note NUTCH-937). But, when the plugins are loaded from the *job submitter* process itself, loading plugins might fail due to classloading issues. This is caused by the fact that PluginManifestParser does not use the contextClassLoader that is set by RunJar. This classloader contains the plugins folder. At least the FetcherJob is affected by this, because the job submitter uses getFields of Protocol implementations, therefore loading the plugins. The current 1.x is not affected because it does not load plugins at any point during the job submission. This might of course change so I propose to 'fix' the issue in the 1.x branch as well. The solution is fairly simple, PluginManifestParser should use the contextClassLoader of the current thread instead of using the system classloader. I will attach patch right away. It currently works but it still needs some further testing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1068) Automaton performance improvements based on Lucene code base
[ https://issues.apache.org/jira/browse/NUTCH-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149214#comment-13149214 ] Lewis John McGibbney commented on NUTCH-1068: - Hi Kirby. I understand that this was a while ago now but as no-one has commented I thought we may as well keep something moving after our conversation of dev lists. Can you explain how you propose to integrate this into Nutch code? I am unsure where to start as it is a github patch. It's also a huge patch. The performance stuff you mention sounds appealing but I really don't know enough just now, especially as I can't use this patch with trunk code. Thank you Automaton performance improvements based on Lucene code base Key: NUTCH-1068 URL: https://issues.apache.org/jira/browse/NUTCH-1068 Project: Nutch Issue Type: Improvement Reporter: Kirby Bohling Attachments: automaton.diff The Lucene team maintains a modified Automaton library cut down to precisely what they need. It can have significant performance enhancements. I am attempting to backport and shepherd a patch for the original Automaton library. The original Lucene code is here: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/ The Lucene code is likely slightly faster, as it includes several micro optimizations I removed to avoid having to request re-license permission. I would definitely performance test using the Lucene RegEx vs. the patched code. The Lucene code also uses code points not characters, which might make a difference for UTF-16 vs. UTF-32 in obscure cases (I believe the Lucene code builds a UTF-32 clean DFA for accuracy, and then translates it to a UTF-8 DFA for performance but I'm not 100% sure. I don't need/use any of that code, and currently really only worried about ASCII DFAs). When making heavy use of the NFA-to-DFA transformation, I see a 4x speed up. It likely has a 1.5-2x speed up for regular expression execution from what I can tell. The Nutch backend uses this code in a couple of places, and it likely would lead to performance benefits for those areas. I will attach my backported version for the Automaton 1.11-7 release. While I don't own any of the copyright, all of the code is copyrighted under the BSD license, or the ASF 2.0 license. It is pretty obviously approved for ASF usage. I am not checking that the patch is usable as I'm not the copyright holder. If that is an issue, I'll say yes, I just don't believe I have any legal standing to do so. I don't want to create licensing issues for the ASF. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1070) Run nutch under native windows (no cygwin)
[ https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13143966#comment-13143966 ] Lewis John McGibbney commented on NUTCH-1070: - His this issue to be closed off Radium? Thanks Run nutch under native windows (no cygwin) -- Key: NUTCH-1070 URL: https://issues.apache.org/jira/browse/NUTCH-1070 Project: Nutch Issue Type: New Feature Affects Versions: 1.3 Environment: Windows XP Home Reporter: Radim Kolar Priority: Minor Labels: windows Its possible to run Nutch in windows without cygwin. 1. Startup script needs to be ported from SH to BAT 2. Because hadoop runs on unix only, we must emulate unix commands to make it work. Luckily only chmod, bash and df needs to be emulated -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1070) Run nutch under native windows (no cygwin)
[ https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144084#comment-13144084 ] Lewis John McGibbney commented on NUTCH-1070: - Thanks for your comments Radim. Any objectives to closing this one off? Run nutch under native windows (no cygwin) -- Key: NUTCH-1070 URL: https://issues.apache.org/jira/browse/NUTCH-1070 Project: Nutch Issue Type: New Feature Affects Versions: 1.3 Environment: Windows XP Home Reporter: Radim Kolar Priority: Minor Labels: windows Its possible to run Nutch in windows without cygwin. 1. Startup script needs to be ported from SH to BAT 2. Because hadoop runs on unix only, we must emulate unix commands to make it work. Luckily only chmod, bash and df needs to be emulated -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1189) add commented out default settings to gora.properties files
[ https://issues.apache.org/jira/browse/NUTCH-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144199#comment-13144199 ] Lewis John McGibbney commented on NUTCH-1189: - In addition, there is scope to provide a much richer info resource within this file but I will get round to that later. add commented out default settings to gora.properties files Key: NUTCH-1189 URL: https://issues.apache.org/jira/browse/NUTCH-1189 Project: Nutch Issue Type: Sub-task Components: storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: nutchgora Attachments: NUTCH-1189-v2.patch, NUTCH-1189.patch This issues should have been dealt with as part of its parent issue, however I think as it is a fairly lareg task in itself, it needs to be done independently. The gora.properties file should, amongst other settings, and beside the extreme basic defaults for sqlstore, include defaults for opening HBase, Cassandra, etc servers on their default ports etc. Leaving this down to individual interpretation puts a huge owness of the user, hence constructing a barrier to entry for getting the configuration settings up and running. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1188) ERROR util.LogUtil - Cannot log with method [null]
[ https://issues.apache.org/jira/browse/NUTCH-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142048#comment-13142048 ] Lewis John McGibbney commented on NUTCH-1188: - Hi guys, I think it's critical that we get this one ironed out before we begin firing RC's. Can we confirm that in our trunk 1.4 development code (and in Nutchgora branch) that this has been sorted out previously and that it is only an issue in the now deprecated 1.4 branch. Thanks ERROR util.LogUtil - Cannot log with method [null] -- Key: NUTCH-1188 URL: https://issues.apache.org/jira/browse/NUTCH-1188 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.4 Environment: no special enviroment Reporter: Zhang JinYan Attachments: LogUtil.patch LogUtil has static fields,which is initialized like this: FATAL = Logger.class.getMethod(error, new Class[] { Object.class }); but the Logger has no such method,the correct method is: void org.slf4j.Logger.error(String msg) So,LogUtil's static fields are not initialized correctly(they are null) --- Run crawl,you will find msg in hadoop.log: 2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null] java.lang.NullPointerException at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103) at java.io.PrintStream.write(PrintStream.java:432) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272) at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85) at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168) at java.io.PrintStream.newLine(PrintStream.java:496) at java.io.PrintStream.println(PrintStream.java:757) at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492) at java.lang.Throwable.printStackTrace(Throwable.java:468) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665) Patch: FATAL = Logger.class.getMethod(error, new Class[] { String.class }); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1189) add commented out default settings to gora.properties files
[ https://issues.apache.org/jira/browse/NUTCH-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142453#comment-13142453 ] Lewis John McGibbney commented on NUTCH-1189: - Hi Ferdy, does this have any knock-on effect what what we would wish to include within gora.properties? I understand that you can manually add peoprties to your HBASEHOME/conf/hbase-site.xml, however if you think any additional properties would add value to this patch please re-submit the patch. Your usage of HBase far exceeds my use case so please feel free. add commented out default settings to gora.properties files Key: NUTCH-1189 URL: https://issues.apache.org/jira/browse/NUTCH-1189 Project: Nutch Issue Type: Sub-task Components: storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: nutchgora Attachments: NUTCH-1189.patch This issues should have been dealt with as part of its parent issue, however I think as it is a fairly lareg task in itself, it needs to be done independently. The gora.properties file should, amongst other settings, and beside the extreme basic defaults for sqlstore, include defaults for opening HBase, Cassandra, etc servers on their default ports etc. Leaving this down to individual interpretation puts a huge owness of the user, hence constructing a barrier to entry for getting the configuration settings up and running. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1189) add commented out default settings to gora.properties files
[ https://issues.apache.org/jira/browse/NUTCH-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141238#comment-13141238 ] Lewis John McGibbney commented on NUTCH-1189: - Ferdy, would it be possible for you to attach a patch for HBase (if required), I will work on the Cassandra stuff, then hopefully we can knock ours heads together with some others to get the remaining back ends included within the gora.poperties file. add commented out default settings to gora.properties files Key: NUTCH-1189 URL: https://issues.apache.org/jira/browse/NUTCH-1189 Project: Nutch Issue Type: Sub-task Components: storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: nutchgora This issues should have been dealt with as part of its parent issue, however I think as it is a fairly lareg task in itself, it needs to be done independently. The gora.properties file should, amongst other settings, and beside the extreme basic defaults for sqlstore, include defaults for opening HBase, Cassandra, etc servers on their default ports etc. Leaving this down to individual interpretation puts a huge owness of the user, hence constructing a barrier to entry for getting the configuration settings up and running. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1188) ERROR util.LogUtil - Cannot log with method [null]
[ https://issues.apache.org/jira/browse/NUTCH-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141240#comment-13141240 ] Lewis John McGibbney commented on NUTCH-1188: - Thank you for this patch. In the short term, when we get one other +1, I would like to commit. Can I ask you to have a look @ NUTCH-1138 and comment on whether the patch is any use for your activities. It is our vision to remove LogUtil and use the Slf4j/Log4j framework for all logging. Thank you very much for this patch. ERROR util.LogUtil - Cannot log with method [null] -- Key: NUTCH-1188 URL: https://issues.apache.org/jira/browse/NUTCH-1188 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.4 Environment: no special enviroment Reporter: Zhang JinYan Attachments: LogUtil.patch LogUtil has static fields,which is initialized like this: FATAL = Logger.class.getMethod(error, new Class[] { Object.class }); but the Logger has no such method,the correct method is: void org.slf4j.Logger.error(String msg) So,LogUtil's static fields are not initialized correctly(they are null) --- Run crawl,you will find msg in hadoop.log: 2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null] java.lang.NullPointerException at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103) at java.io.PrintStream.write(PrintStream.java:432) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272) at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85) at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168) at java.io.PrintStream.newLine(PrintStream.java:496) at java.io.PrintStream.println(PrintStream.java:757) at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492) at java.lang.Throwable.printStackTrace(Throwable.java:468) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665) Patch: FATAL = Logger.class.getMethod(error, new Class[] { String.class }); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1188) ERROR util.LogUtil - Cannot log with method [null]
[ https://issues.apache.org/jira/browse/NUTCH-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141300#comment-13141300 ] Lewis John McGibbney commented on NUTCH-1188: - Is it just me, or has this already been committed along with NUTCH-1078 in trunk [1] when Julien fixed it in Nutchgora branch [2]! [1] http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/util/LogUtil.java?r1=1175075r2=1177290diff_format=h [2] http://svn.apache.org/viewvc/nutch/branches/nutchgora/src/java/org/apache/nutch/util/LogUtil.java?r1=983885r2=988544diff_format=h ERROR util.LogUtil - Cannot log with method [null] -- Key: NUTCH-1188 URL: https://issues.apache.org/jira/browse/NUTCH-1188 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.4 Environment: no special enviroment Reporter: Zhang JinYan Attachments: LogUtil.patch LogUtil has static fields,which is initialized like this: FATAL = Logger.class.getMethod(error, new Class[] { Object.class }); but the Logger has no such method,the correct method is: void org.slf4j.Logger.error(String msg) So,LogUtil's static fields are not initialized correctly(they are null) --- Run crawl,you will find msg in hadoop.log: 2011-11-01 22:38:14,442 ERROR util.LogUtil - Cannot log with method [null] java.lang.NullPointerException at org.apache.nutch.util.LogUtil$1.flush(LogUtil.java:103) at java.io.PrintStream.write(PrintStream.java:432) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272) at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:85) at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:168) at java.io.PrintStream.newLine(PrintStream.java:496) at java.io.PrintStream.println(PrintStream.java:757) at java.lang.Throwable.printStackTraceAsCause(Throwable.java:492) at java.lang.Throwable.printStackTrace(Throwable.java:468) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:197) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:665) Patch: FATAL = Logger.class.getMethod(error, new Class[] { String.class }); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1138) remove LogUtil from trunk and nutch gora
[ https://issues.apache.org/jira/browse/NUTCH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13141376#comment-13141376 ] Lewis John McGibbney commented on NUTCH-1138: - Hi. Current 1.4 development is located at the trunk area of the SVN area. Is this where the confusion is possibly stemming from? When we make code commits, we are committing to the trunk 1.4 development, rather than the branch-1.4 development. The reasoning behind this can be seen on the latest announcement on the Nutch home page. remove LogUtil from trunk and nutch gora Key: NUTCH-1138 URL: https://issues.apache.org/jira/browse/NUTCH-1138 Project: Nutch Issue Type: Improvement Affects Versions: 1.4, nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.5 Attachments: Document1.txt, NUTCH-1138-trunk-20111023.patch This should move towards the removal of the LogUtil class from both codebases as per comments in NUTCH-1078. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1156) building errors with gora-hbase as a backend; update ivy.xml to use correct dependancies
[ https://issues.apache.org/jira/browse/NUTCH-1156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13140068#comment-13140068 ] Lewis John McGibbney commented on NUTCH-1156: - I forgot that we had reopened this one, yes +1 for committing the thrift exclusion. building errors with gora-hbase as a backend; update ivy.xml to use correct dependancies Key: NUTCH-1156 URL: https://issues.apache.org/jira/browse/NUTCH-1156 Project: Nutch Issue Type: Bug Components: build Affects Versions: nutchgora Reporter: Ferdy Fix For: nutchgora Attachments: NUTCH-1156-v1.patch, NUTCH-1156-v2.patch, NUTCH-1156-v3.patch, NUTCH-1156-v4.patch This patch makes sure nutchgora can actually be built when gora-hbase is uncommented in ivy.xml. Note that is still commented though, so sql is still the default backend. However whenever one wishes to use hbase (as we do) simply uncommenting the section in ivy.xml won't do the trick. This patch fixes this. Changes in ivy.xml: -Set correct version for gora-hbase and config. -Add thrift to exclude as it is not available in the repos; it is not needed in most cases but please correct me if I'm wrong. -Additionally, it removes the comment that hbase library itself should be manually added, as this not needed anymore. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1138) remove LogUtil from trunk and nutch gora
[ https://issues.apache.org/jira/browse/NUTCH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13140669#comment-13140669 ] Lewis John McGibbney commented on NUTCH-1138: - OK so this patch for trunk seems to pass all my tests so far. Could I ask for someone to provisionally apply it and test for a day or so, as I'm expecting somewhere down the line for errors to slip through the net. remove LogUtil from trunk and nutch gora Key: NUTCH-1138 URL: https://issues.apache.org/jira/browse/NUTCH-1138 Project: Nutch Issue Type: Improvement Affects Versions: 1.4, nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.5 Attachments: Document1.txt, NUTCH-1138-trunk-20111023.patch This should move towards the removal of the LogUtil class from both codebases as per comments in NUTCH-1078. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-842) AutoGenerate WebPage code
[ https://issues.apache.org/jira/browse/NUTCH-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13135899#comment-13135899 ] Lewis John McGibbney commented on NUTCH-842: Good while ago that this issue was last in view. Does anyone have an opinion on where we are with this one. The patch doesn't incorporate the latter comments as above, is this something which would be required? AutoGenerate WebPage code - Key: NUTCH-842 URL: https://issues.apache.org/jira/browse/NUTCH-842 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Assignee: Doğacan Güney Fix For: nutchgora Attachments: NUTCH-842.patch This issue will track the addition of an ant task that will automatically generate o.a.n.storage.WebPage (and ProtocolStatus and ParseStatus) from src/gora/webpage.avsc. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira