Re: question about ObjectCache
On 10/04/2012 05:00, Xiaolong Yang wrote: Hi,all I'm reading source code of nutch and I have some puzzled about the ObjectCache.java in package org.apache.nutch.util.I just find it may be little benefit to use it in urlnormalizers and urlfiters.I also have read some discuss about cache in Nutch-169 and Nutch-501.But I can't understand it. Can anyone tell me where ObjectCache be used and get a good benefit in nutch ? ObjectCache is designed to cache ready-to-use instances of Nutch plugins. The process of finding, instantiating and initializing plugins is inefficient, because it involves parsing plugin descriptors, initializing plugins, collecting the ones that implement correct extension points, etc. It would kill performance if this process were invoked each time you want to run all plugins of a given type (e.g. URLNormalizer-s). The facade URLNormalizers/URLFilters and others make sure that plugin instances of a given type are initialized once per lifetime of a JVM, and then they are cached in ObjectCache, so that next time you want to use them they can be retrieved from a cache, instead of going again through the process of parsing/instantiating/initializing. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] [Updated] (NUTCH-1330) OutlinkDB to preserve back up
[ https://issues.apache.org/jira/browse/NUTCH-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1330: - Attachment: NUTCH-1330-1.6-2.patch Previous patch is bad and came from an old checkout. This is the proper patch. OutlinkDB to preserve back up - Key: NUTCH-1330 URL: https://issues.apache.org/jira/browse/NUTCH-1330 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.6 Attachments: NUTCH-1330-1.6-1.patch, NUTCH-1330-1.6-2.patch The webgraph's outlinkDB is the single source for all scoring jobs and GB's that eventually come out. In case of disaster, that didn't happen yet, it should be able to preserve back up just like other DB's. This means users with an existing outlinkdb must move it from a crawl/webgraphdb/outlinks/ to crawl/webgraphdb/outlinks/current/. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Nutch 1.x trunk release
Hi guys, Chris - any idea of if / when you'll have the time to do a RC for trunk? Thanks Julien On 3 April 2012 15:30, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Thanks Lewis! Cheers, Chris P.S. Hopefully by this weekend... On Apr 3, 2012, at 7:23 AM, Lewis John Mcgibbney wrote: Hi, On Tue, Apr 3, 2012 at 3:12 PM, Markus Jelsma markus.jel...@openindex.io wrote: Seems fine. Only updating KEYS is no longer necessary. Now sorted. Thanks whenever you can get round to this Chris. Best Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: Nutch 1.x trunk release
Hey Julien, Yeah my weekend flew by -- this and the SIS RC are the top items on my opensource TODO :) Hopefully this week... Cheers, Chris On Apr 10, 2012, at 8:07 AM, Julien Nioche wrote: Hi guys, Chris - any idea of if / when you'll have the time to do a RC for trunk? Thanks Julien On 3 April 2012 15:30, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Thanks Lewis! Cheers, Chris P.S. Hopefully by this weekend... On Apr 3, 2012, at 7:23 AM, Lewis John Mcgibbney wrote: Hi, On Tue, Apr 3, 2012 at 3:12 PM, Markus Jelsma markus.jel...@openindex.io wrote: Seems fine. Only updating KEYS is no longer necessary. Now sorted. Thanks whenever you can get round to this Chris. Best Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: question about ObjectCache
Hi,Andrzej Thank you for your detail answers.I have understood it uses.[?] 2012/4/10 Andrzej Bialecki a...@getopt.org On 10/04/2012 05:00, Xiaolong Yang wrote: Hi,all I'm reading source code of nutch and I have some puzzled about the ObjectCache.java in package org.apache.nutch.util.I just find it may be little benefit to use it in urlnormalizers and urlfiters.I also have read some discuss about cache in Nutch-169 and Nutch-501.But I can't understand it. Can anyone tell me where ObjectCache be used and get a good benefit in nutch ? ObjectCache is designed to cache ready-to-use instances of Nutch plugins. The process of finding, instantiating and initializing plugins is inefficient, because it involves parsing plugin descriptors, initializing plugins, collecting the ones that implement correct extension points, etc. It would kill performance if this process were invoked each time you want to run all plugins of a given type (e.g. URLNormalizer-s). The facade URLNormalizers/URLFilters and others make sure that plugin instances of a given type are initialized once per lifetime of a JVM, and then they are cached in ObjectCache, so that next time you want to use them they can be retrieved from a cache, instead of going again through the process of parsing/instantiating/**initializing. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __** [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com 349.gif
[jira] [Commented] (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic
[ https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13251297#comment-13251297 ] Manuel Antonio Novoa commented on NUTCH-422: I use this plugin to index properties and html img tags? For example img alt= this is the text I want to index src= this is another text that I want to index index-extra plugin creates additional fields in the index, based on configurable logic -- Key: NUTCH-422 URL: https://issues.apache.org/jira/browse/NUTCH-422 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 0.8.1 Environment: All environments Reporter: Alan Tanaman Assignee: Sami Siren Fix For: 1.5 Attachments: ExtraIndexingFilter.java, index-extra-v1.0-bin-java1.5.zip, index-extra-v1.0-source.zip Extract from the Readme file: A. Introduction The index-extra plugin allows you to configure additional fields that you wish to be added to the index, based on one of the following sources: - The parsed text - Meta data fields - Previously created document-to-be-indexed fields - Plain constant string - Java expression combining one or more of the above, and resolving to a string A regex can also be applied to any of the above, allowing fields to be created based on patterns extracted from the source. B. Installation 1) Binaries only: Copy the 'index-extra' folder within index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure Enable the plugin by updating the nutch-site.xml file 2) Source code:Always refer to the Nutch wiki for detailed instructions on building Nutch. In short: Copy the 'index-extra' folder within index-extra-v1.0-source.zip to NUTCHDIR/src/plugin Update the build.xml in NUTCHDIR/src/plugin to include plugin Update the NUTCHDIR/default.properties file to include plugin run ant to build Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure Enable the plugin by updating the nutch-site.xml file C. Known Issues 1) For this plugin to work correctly on any document field, it is necessary to run the other index filters first, so that all basic document fields are generated first. To do this, configure the indexingfilter.order property. (Please see patch NUTCH-421 to enable indexingfilter.order property. If this patch is not applied, the plugin will still work, but will not be able to use document fields created by other index filter plugins.) 2) At this stage, field boost can not be used as Nutch scoring overrides the field boost with its own document-level boost calculation. This occurs at the end of org.apache.nutch.indexer.Indexer's reduce method. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira