Re: question about ObjectCache

2012-04-10 Thread Andrzej Bialecki

On 10/04/2012 05:00, Xiaolong Yang wrote:

Hi,all

I'm reading source code of nutch and I have some puzzled about the
ObjectCache.java in package org.apache.nutch.util.I just find it may be
little benefit to use it in urlnormalizers and urlfiters.I also have
read some discuss about cache in Nutch-169 and Nutch-501.But I can't
understand it.

Can anyone tell me where ObjectCache be used and get a good benefit in
nutch ?


ObjectCache is designed to cache ready-to-use instances of Nutch 
plugins. The process of finding, instantiating and initializing plugins 
is inefficient, because it involves parsing plugin descriptors, 
initializing plugins, collecting the ones that implement correct 
extension points, etc.


It would kill performance if this process were invoked each time you 
want to run all plugins of a given type (e.g. URLNormalizer-s). The 
facade URLNormalizers/URLFilters and others make sure that plugin 
instances of a given type are initialized once per lifetime of a JVM, 
and then they are cached in ObjectCache, so that next time you want to 
use them they can be retrieved from a cache, instead of going again 
through the process of parsing/instantiating/initializing.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] [Updated] (NUTCH-1330) OutlinkDB to preserve back up

2012-04-10 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1330:
-

Attachment: NUTCH-1330-1.6-2.patch

Previous patch is bad and came from an old checkout. This is the proper patch.

 OutlinkDB to preserve back up
 -

 Key: NUTCH-1330
 URL: https://issues.apache.org/jira/browse/NUTCH-1330
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1330-1.6-1.patch, NUTCH-1330-1.6-2.patch


 The webgraph's outlinkDB is the single source for all scoring jobs and GB's 
 that eventually come out. In case of disaster, that didn't happen yet, it 
 should be able to preserve back up just like other DB's. This means users 
 with an existing outlinkdb must move it from a crawl/webgraphdb/outlinks/ to 
 crawl/webgraphdb/outlinks/current/.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Nutch 1.x trunk release

2012-04-10 Thread Julien Nioche
Hi guys,

Chris - any idea of if / when you'll have the time to do a RC for trunk?

Thanks

Julien

On 3 April 2012 15:30, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Thanks Lewis!

 Cheers,
 Chris

 P.S. Hopefully by this weekend...

 On Apr 3, 2012, at 7:23 AM, Lewis John Mcgibbney wrote:

  Hi,
 
  On Tue, Apr 3, 2012 at 3:12 PM, Markus Jelsma 
 markus.jel...@openindex.io wrote:
 
 
  Seems fine. Only updating KEYS is no longer necessary.
 
  Now sorted.
 
  Thanks whenever you can get round to this Chris.
 
  Best
 
  Lewis


 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Nutch 1.x trunk release

2012-04-10 Thread Mattmann, Chris A (388J)
Hey Julien,

Yeah my weekend flew by -- this and the SIS RC are the top items on my
opensource TODO :)

Hopefully this week...

Cheers,
Chris

On Apr 10, 2012, at 8:07 AM, Julien Nioche wrote:

 Hi guys, 
 
 Chris - any idea of if / when you'll have the time to do a RC for trunk?
 
 Thanks
 
 Julien
 
 On 3 April 2012 15:30, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 Thanks Lewis!
 
 Cheers,
 Chris
 
 P.S. Hopefully by this weekend...
 
 On Apr 3, 2012, at 7:23 AM, Lewis John Mcgibbney wrote:
 
  Hi,
 
  On Tue, Apr 3, 2012 at 3:12 PM, Markus Jelsma markus.jel...@openindex.io 
  wrote:
 
 
  Seems fine. Only updating KEYS is no longer necessary.
 
  Now sorted.
 
  Thanks whenever you can get round to this Chris.
 
  Best
 
  Lewis
 
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++
 
 
 
 
 -- 
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: question about ObjectCache

2012-04-10 Thread Xiaolong Yang
Hi,Andrzej

 Thank you for your detail answers.I have understood it uses.[?]

2012/4/10 Andrzej Bialecki a...@getopt.org

 On 10/04/2012 05:00, Xiaolong Yang wrote:

 Hi,all

 I'm reading source code of nutch and I have some puzzled about the
 ObjectCache.java in package org.apache.nutch.util.I just find it may be
 little benefit to use it in urlnormalizers and urlfiters.I also have
 read some discuss about cache in Nutch-169 and Nutch-501.But I can't
 understand it.

 Can anyone tell me where ObjectCache be used and get a good benefit in
 nutch ?


 ObjectCache is designed to cache ready-to-use instances of Nutch plugins.
 The process of finding, instantiating and initializing plugins is
 inefficient, because it involves parsing plugin descriptors, initializing
 plugins, collecting the ones that implement correct extension points, etc.

 It would kill performance if this process were invoked each time you want
 to run all plugins of a given type (e.g. URLNormalizer-s). The facade
 URLNormalizers/URLFilters and others make sure that plugin instances of a
 given type are initialized once per lifetime of a JVM, and then they are
 cached in ObjectCache, so that next time you want to use them they can be
 retrieved from a cache, instead of going again through the process of
 parsing/instantiating/**initializing.

 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __**
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com


349.gif

[jira] [Commented] (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

2012-04-10 Thread Manuel Antonio Novoa (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13251297#comment-13251297
 ] 

Manuel Antonio Novoa commented on NUTCH-422:


I use this plugin to index properties and html img tags?

For example img alt= this is the text I want to index src= this is another 
text that I want to index 

 index-extra plugin creates additional fields in the index, based on 
 configurable logic
 --

 Key: NUTCH-422
 URL: https://issues.apache.org/jira/browse/NUTCH-422
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.8.1
 Environment: All environments
Reporter: Alan Tanaman
Assignee: Sami Siren
 Fix For: 1.5

 Attachments: ExtraIndexingFilter.java, 
 index-extra-v1.0-bin-java1.5.zip, index-extra-v1.0-source.zip


 Extract from the Readme file:
 A.  Introduction
 The index-extra plugin allows you to configure additional fields that you 
 wish to be added to the index, based on one of the following sources:
   - The parsed text
   - Meta data fields
   - Previously created document-to-be-indexed fields
   - Plain constant string
   - Java expression combining one or more of the above, and resolving to 
 a string
 A regex can also be applied to any of the above, allowing fields to be 
 created based on patterns extracted from the source.
 B.  Installation
 1)  Binaries only:  Copy the 'index-extra' folder within 
 index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
 Copy the 'index-extra-conf.xml' file to 
 NUTCHDIR/conf, and configure
 Enable the plugin by updating the nutch-site.xml file
 2)  Source code:Always refer to the Nutch wiki for detailed 
 instructions on building Nutch.  In short:
 Copy the 'index-extra' folder within 
 index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
 Update the build.xml in NUTCHDIR/src/plugin to 
 include plugin
 Update the NUTCHDIR/default.properties file to 
 include plugin
 run ant to build
 Copy the 'index-extra-conf.xml' file to 
 NUTCHDIR/conf, and configure
 Enable the plugin by updating the nutch-site.xml file
 C.  Known Issues
 1)  For this plugin to work correctly on any document field, it is 
 necessary to run the other index filters
 first, so that all basic document fields are generated first.  To do 
 this, configure the indexingfilter.order
 property.  (Please see patch NUTCH-421 to enable indexingfilter.order 
 property. If this patch is not applied,
 the plugin will still work, but will not be able to use document fields 
 created by other index filter plugins.)
 2)  At this stage, field boost can not be used as Nutch scoring overrides 
 the field boost with its own
 document-level boost calculation.  This occurs at the end of 
 org.apache.nutch.indexer.Indexer's reduce method.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira