date:20120321

[jira] [Updated] (NUTCH-809) Parse-metatags plugin

2012-03-21 Thread Julien Nioche (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Attachment: NUTCH-809-trunk.patch

Patch for Nutch-809 against trunk. Delegates the indexing to index-metatags

 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.4, nutchgora
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.5

 Attachments: NUTCH-809-trunk.patch, NUTCH-809.patch, 
 NUTCH-809_metatags_1.3.patch, metatags-plugin+tutorial.zip


 h2. Parse-metatags plugin
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The query-basic plugin is used to include these fields in the search e.g. in 
 nutch-site.xml
 {code:xml}
 property
   namequery.basic.description.boost/name
   value2.0/value
 /property
 property
   namequery.basic.keywords.boost/name
   value2.0/value
 /property
 {code}
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-809) Parse-metatags plugin

2012-03-21 Thread Julien Nioche (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234316#comment-13234316
 ] 

Julien Nioche commented on NUTCH-809:
-

Trunk : Committed revision 1303371.

Not activated by default. See nutch-default.xml for details. 

TODO update the WIKI, port to the gora branch add fields to SOLR and activate 
it by default (any volunteers?)

 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.4, nutchgora
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.5

 Attachments: NUTCH-809-trunk.patch, NUTCH-809.patch, 
 NUTCH-809_metatags_1.3.patch, metatags-plugin+tutorial.zip


 h2. Parse-metatags plugin
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The query-basic plugin is used to include these fields in the search e.g. in 
 nutch-site.xml
 {code:xml}
 property
   namequery.basic.description.boost/name
   value2.0/value
 /property
 property
   namequery.basic.keywords.boost/name
   value2.0/value
 /property
 {code}
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-809) Parse-metatags plugin

2012-03-21 Thread Hudson (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234340#comment-13234340
 ] 

Hudson commented on NUTCH-809:
--

Integrated in nutch-trunk-maven #206 (See 
[https://builds.apache.org/job/nutch-trunk-maven/206/])
NUTCH-809 Parse-metatags plugin (jnioche) (Revision 1303371)

 Result = SUCCESS
jnioche : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/plugin/build.xml
* /nutch/trunk/src/plugin/parse-metatags
* /nutch/trunk/src/plugin/parse-metatags/README.txt
* /nutch/trunk/src/plugin/parse-metatags/build.xml
* /nutch/trunk/src/plugin/parse-metatags/ivy.xml
* /nutch/trunk/src/plugin/parse-metatags/plugin.xml
* /nutch/trunk/src/plugin/parse-metatags/sample
* /nutch/trunk/src/plugin/parse-metatags/sample/testMetatags.html
* /nutch/trunk/src/plugin/parse-metatags/src
* /nutch/trunk/src/plugin/parse-metatags/src/java
* /nutch/trunk/src/plugin/parse-metatags/src/java/org
* /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache
* /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch
* /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch/parse
* 
/nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/MetaTagsParser.java
* /nutch/trunk/src/plugin/parse-metatags/src/test
* /nutch/trunk/src/plugin/parse-metatags/src/test/org
* /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache
* /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch
* /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse
* /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse/html
* 
/nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse/html/TestMetatagParser.java


 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.4, nutchgora
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.5

 Attachments: NUTCH-809-trunk.patch, NUTCH-809.patch, 
 NUTCH-809_metatags_1.3.patch, metatags-plugin+tutorial.zip


 h2. Parse-metatags plugin
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The query-basic plugin is used to include these fields in the search e.g. in 
 nutch-site.xml
 {code:xml}
 property
   namequery.basic.description.boost/name
   value2.0/value
 /property
 property
   namequery.basic.keywords.boost/name
   value2.0/value
 /property
 {code}
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-366) Merge URLFilters and URLNormalizers

2012-03-21 Thread Lewis John McGibbney (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234349#comment-13234349
]

Lewis John McGibbney commented on NUTCH-366:

Hi Apurv this is great news :)
I suggest that if you have not already done so, take a look at NUTCH-365. Try
to put the material Andrzej mentioned into context. In parallel I would take a
look at the way the current URLFIlters and URLNormalizers are constructed with
regards to 1 as above. It would be great to get this moving as a GSoC project.

Merge URLFilters and URLNormalizers
---

Key: NUTCH-366
URL: https://issues.apache.org/jira/browse/NUTCH-366
Project: Nutch
Issue Type: Improvement
Reporter: Andrzej Bialecki
Labels: gsoc2012

Currently Nutch uses two subsystems related to url validation and
normalization:
* URLFilter: this interface checks if URLs are valid for further processing.
Input URL is not changed in any way. The output is a boolean value.
* URLNormalizer: this interface brings URLs to their base (normal) form, or
removes unneeded URL components, or performs any other URL mangling as
necessary. Input URLs are changed, and are returned as result.
However, various Nutch tools run filters and normalizers in pre-determined
order, i.e. normalizers first, and then filters. In some cases, where
normalizers are complex and running them is costly (e.g. numerous regex
rules, DNS lookups) it would make sense to run some of the filters first
(e.g. prefix-based filters that select only certain protocols, or
suffix-based filters that select only known extensions). This is currently
not possible - we always have to run normalizers, only to later throw away
urls because they failed to pass through filters.
I would like to solicit comments on the following two solutions, and work on
implementation of one of them:
1) we could make URLFilters and URLNormalizers implement the same interface,
and basically make them interchangeable. This way users could configure their
order arbitrarily, even mixing filters and normalizers out of order. This is
more complicated, but gives much more flexibility - and NUTCH-365 already
provides sufficient framework to implement this, including the ability to
define different sequences for different steps in the workflow.
2) we could use a property url.mangling.order ;) to define whether
normalizers or filters should run first. This is simple to implement, but
provides only limited improvement - because either all filters or all
normalizers would run, they couldn't be mixed in arbitrary order.
Any comments?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] [Created] (NUTCH-1319) HostNormalizer

2012-03-21 Thread Markus Jelsma


Hi Mathijs,

We use this in the fetcher (parse=true) and when updating the CrawlDB 
and with the free generator. We use it in the fetcher because we follow 
outlinks and make sure we follow the desired host and in the CrawlDB 
because there we update records for recently added host normalizer 
rules.


It is just an URL normalizer like the others but only changes the host 
part. This is not covered in other standard normalizers. The 
BasicURLNormalizer cannot do this and the RegexURLNormalizer is far too 
heavy to take 20MB of expressions and harder to auto-generate. A simple 
map lookup is very fast.


Cheers,

On Wed, 21 Mar 2012 22:22:54 +0100, Mathijs Homminga 
mathijs.hommi...@kalooga.com wrote:

Hi Markus,

How (where in the process) do you like to use this normalizer. Isn't
this functionality already covered by the URL normalizer(s)?

Mathijs Homminga

On Mar 21, 2012, at 22:06, Markus Jelsma (Created) (JIRA)
j...@apache.org wrote:


HostNormalizer
--

Key: NUTCH-1319
URL: 
https://issues.apache.org/jira/browse/NUTCH-1319

Project: Nutch
 Issue Type: New Feature
   Reporter: Markus Jelsma
   Assignee: Markus Jelsma
Fix For: 1.5


Nutch would benefit from having a host normalizer. A host normalizer 
maps a given host to the desired host. A basic example is to map 
www.apache.org to apache.org. The Apache website is one of many on the 
internet that has a duplicate website on the same domain just because 
it allows both www and non-www to return HTTP 200 and proper content.


It is also able to handle wildcards such as *.example.org to 
example.org if there are multiple sub domains that actually point to 
the same website.


Large internet crawls tend to get polluted very quickly due to these 
problems. It also leads to skewed scores in the webgraph as different 
websites link to different versions of the same duplicate website.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA 
administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: 
http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1104) Port issues from trunk NutchGora branch

2012-03-21 Thread Lewis John McGibbney (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1104:


Description: 
Umbrella issue for tracking issues that should be ported from 1.x trunk to the 
NutchGora branch. Please mark ported issues by modifying this description.

NOT YET PORTED:

* NUTCH-809 Parse-metatags plugin
* NUTCH-987 Support HTTP auth for Solr communication
* NUTCH-1028 Log parser keys
* NUTCH-1036 Solr jobs should increment counters in Reporter
* NUTCH-1057 Make fetcher thread time out configurable
* NUTCH-1067 Configure minimum throughput for fetcher
* NUTCH-1101 Options to purge db_gone records in updatedb
* NUTCH-1102 Fetcher, rely on fetcher.parse directive only
* NUTCH-1105 MaxContentLength option for index-basic
* NUTCH-940 Statis field plugin
* NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk
* NUTCH-1207 ParserChecker to output signature
* NUTCH-1090 InvertLinks should inform when ignoring internal links
* NUTCH-1174 Outlinks are not properly normalized
* NUTCH-1203 ParseSegment to show number of milliseconds per parse
* NUTCH-1173 DomainStats doesn't count db_not_modified
* NUTCH-1155 Host/domain limit in generator is generate.max.count+1
* NUTCH-1061 Migrate MoreIndexingFilter from Apache ORO to java.util.regex
* NUTCH-1142 Normalization and filtering in WebGraph
* NUTCH-1153 LinkRank not to log all keys and not to write Hadoop _SUCCESS file
* NUTCH-1195 Add Solr 4x (trunk) example schema
* NUTCH-1141 Configurable Fetcher queue depth
* NUTCH-1214 DomainStats tool should be named for what it's doing
* NUTCH-1213 Pass additional SolrParams when indexing to Solr
* NUTCH-1211 URLFilterChecker command line help doesn't inform user of STDIN 
requirements
* NUTCH-1231 Upgrade to Tika 1.0
* NUTCH-1230 MimeType API deprecated and breaks with Tika 1.0
* NUTCH-1235 Upgrade to new Hadoop 0.20.205.0
* NUTCH-1184 Fetcher to parse and follow Nth degree outlinks
* NUTCH-1214 DomainStats tool should be named for what it's doing
* NUTCH-1207 ParserChecker to output signature
* NUTCH-1174 Outlinks are not properly normalized
* NUTCH-1173 DomainStats doesn't count db_not_modified
* NUTCH-1142 Normalization and filtering in WebGraph

PORTED:
* No issues yet


NOT GOING TO BE PORTED:
* No issues, explain why it should not be ported



  was:
Umbrella issue for tracking issues that should be ported from 1.x trunk to the 
NutchGora branch. Please mark ported issues by modifying this description.

NOT YET PORTED:

* NUTCH-987 Support HTTP auth for Solr communication
* NUTCH-1028 Log parser keys
* NUTCH-1036 Solr jobs should increment counters in Reporter
* NUTCH-1057 Make fetcher thread time out configurable
* NUTCH-1067 Configure minimum throughput for fetcher
* NUTCH-1101 Options to purge db_gone records in updatedb
* NUTCH-1102 Fetcher, rely on fetcher.parse directive only
* NUTCH-1105 MaxContentLength option for index-basic
* NUTCH-940 Statis field plugin
* NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk
* NUTCH-1207 ParserChecker to output signature
* NUTCH-1090 InvertLinks should inform when ignoring internal links
* NUTCH-1174 Outlinks are not properly normalized
* NUTCH-1203 ParseSegment to show number of milliseconds per parse
* NUTCH-1173 DomainStats doesn't count db_not_modified
* NUTCH-1155 Host/domain limit in generator is generate.max.count+1
* NUTCH-1061 Migrate MoreIndexingFilter from Apache ORO to java.util.regex
* NUTCH-1142 Normalization and filtering in WebGraph
* NUTCH-1153 LinkRank not to log all keys and not to write Hadoop _SUCCESS file
* NUTCH-1195 Add Solr 4x (trunk) example schema
* NUTCH-1141 Configurable Fetcher queue depth
* NUTCH-1214 DomainStats tool should be named for what it's doing
* NUTCH-1213 Pass additional SolrParams when indexing to Solr
* NUTCH-1211 URLFilterChecker command line help doesn't inform user of STDIN 
requirements
* NUTCH-1231 Upgrade to Tika 1.0
* NUTCH-1230 MimeType API deprecated and breaks with Tika 1.0
* NUTCH-1235 Upgrade to new Hadoop 0.20.205.0
* NUTCH-1184 Fetcher to parse and follow Nth degree outlinks
* NUTCH-1214 DomainStats tool should be named for what it's doing
* NUTCH-1207 ParserChecker to output signature
* NUTCH-1174 Outlinks are not properly normalized
* NUTCH-1173 DomainStats doesn't count db_not_modified
* NUTCH-1142 Normalization and filtering in WebGraph

PORTED:
* No issues yet


NOT GOING TO BE PORTED:
* No issues, explain why it should not be ported




 Port issues from trunk NutchGora branch
 ---

 Key: NUTCH-1104
 URL: https://issues.apache.org/jira/browse/NUTCH-1104
 Project: Nutch
  Issue Type: Task
Affects Versions: nutchgora
Reporter: Markus Jelsma
 Fix For: nutchgora


 Umbrella issue for tracking issues that should be ported from 1.x trunk to 
 the

[jira] [Commented] (NUTCH-809) Parse-metatags plugin

2012-03-21 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235108#comment-13235108
 ] 

Lewis John McGibbney commented on NUTCH-809:


Hi Julien,

Can you confirm what you would like to see added to the wiki?, I will try my 
best to get this added, are you referring to the [0]? Also I thought the best 
thing to do regarding porting to Nutchgora is just to add it to the ever 
growing NUTCH-1104 list, so I have done so. If and when this is required over 
there someone can duly oblige :)
Regarding adding fields to Solr I assume you mean schema and solr-mapping.xml?
Finally can you expand on 'activate by default', what exactly is it that not 
activated by default? I read your README.txt but I can see any mention of it in 
there.   
Thanks

Oh and great patch, this is one which as we know is very much appreciated by 
everyone. 
[0] http://wiki.apache.org/nutch/IndexStructure

 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.4, nutchgora
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.5

 Attachments: NUTCH-809-trunk.patch, NUTCH-809.patch, 
 NUTCH-809_metatags_1.3.patch, metatags-plugin+tutorial.zip


 h2. Parse-metatags plugin
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The query-basic plugin is used to include these fields in the search e.g. in 
 nutch-site.xml
 {code:xml}
 property
   namequery.basic.description.boost/name
   value2.0/value
 /property
 property
   namequery.basic.keywords.boost/name
   value2.0/value
 /property
 {code}
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-809) Parse-metatags plugin

2012-03-21 Thread Hudson (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235333#comment-13235333
 ] 

Hudson commented on NUTCH-809:
--

Integrated in Nutch-trunk #1794 (See 
[https://builds.apache.org/job/Nutch-trunk/1794/])
NUTCH-809 Parse-metatags plugin (jnioche) (Revision 1303371)

 Result = SUCCESS
jnioche : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1303371
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/plugin/build.xml
* /nutch/trunk/src/plugin/parse-metatags
* /nutch/trunk/src/plugin/parse-metatags/README.txt
* /nutch/trunk/src/plugin/parse-metatags/build.xml
* /nutch/trunk/src/plugin/parse-metatags/ivy.xml
* /nutch/trunk/src/plugin/parse-metatags/plugin.xml
* /nutch/trunk/src/plugin/parse-metatags/sample
* /nutch/trunk/src/plugin/parse-metatags/sample/testMetatags.html
* /nutch/trunk/src/plugin/parse-metatags/src
* /nutch/trunk/src/plugin/parse-metatags/src/java
* /nutch/trunk/src/plugin/parse-metatags/src/java/org
* /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache
* /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch
* /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch/parse
* 
/nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/MetaTagsParser.java
* /nutch/trunk/src/plugin/parse-metatags/src/test
* /nutch/trunk/src/plugin/parse-metatags/src/test/org
* /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache
* /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch
* /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse
* /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse/html
* 
/nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse/html/TestMetatagParser.java


 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.4, nutchgora
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.5

 Attachments: NUTCH-809-trunk.patch, NUTCH-809.patch, 
 NUTCH-809_metatags_1.3.patch, metatags-plugin+tutorial.zip


 h2. Parse-metatags plugin
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The query-basic plugin is used to include these fields in the search e.g. in 
 nutch-site.xml
 {code:xml}
 property
   namequery.basic.description.boost/name
   value2.0/value
 /property
 property
   namequery.basic.keywords.boost/name
   value2.0/value
 /property
 {code}
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-809) Parse-metatags plugin

[jira] [Commented] (NUTCH-809) Parse-metatags plugin

[jira] [Commented] (NUTCH-809) Parse-metatags plugin

[jira] [Commented] (NUTCH-366) Merge URLFilters and URLNormalizers

Re: [jira] [Created] (NUTCH-1319) HostNormalizer

[jira] [Updated] (NUTCH-1104) Port issues from trunk NutchGora branch

[jira] [Commented] (NUTCH-809) Parse-metatags plugin

[jira] [Commented] (NUTCH-809) Parse-metatags plugin

8 matches

Site Navigation

Mail list logo

Footer information