[jira] [Updated] (NUTCH-809) Parse-metatags plugin
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-809: Attachment: NUTCH-809-trunk.patch Patch for Nutch-809 against trunk. Delegates the indexing to index-metatags Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.4, nutchgora Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.5 Attachments: NUTCH-809-trunk.patch, NUTCH-809.patch, NUTCH-809_metatags_1.3.patch, metatags-plugin+tutorial.zip h2. Parse-metatags plugin The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The query-basic plugin is used to include these fields in the search e.g. in nutch-site.xml {code:xml} property namequery.basic.description.boost/name value2.0/value /property property namequery.basic.keywords.boost/name value2.0/value /property {code} This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-809) Parse-metatags plugin
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234316#comment-13234316 ] Julien Nioche commented on NUTCH-809: - Trunk : Committed revision 1303371. Not activated by default. See nutch-default.xml for details. TODO update the WIKI, port to the gora branch add fields to SOLR and activate it by default (any volunteers?) Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.4, nutchgora Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.5 Attachments: NUTCH-809-trunk.patch, NUTCH-809.patch, NUTCH-809_metatags_1.3.patch, metatags-plugin+tutorial.zip h2. Parse-metatags plugin The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The query-basic plugin is used to include these fields in the search e.g. in nutch-site.xml {code:xml} property namequery.basic.description.boost/name value2.0/value /property property namequery.basic.keywords.boost/name value2.0/value /property {code} This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-809) Parse-metatags plugin
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234340#comment-13234340 ] Hudson commented on NUTCH-809: -- Integrated in nutch-trunk-maven #206 (See [https://builds.apache.org/job/nutch-trunk-maven/206/]) NUTCH-809 Parse-metatags plugin (jnioche) (Revision 1303371) Result = SUCCESS jnioche : Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/plugin/build.xml * /nutch/trunk/src/plugin/parse-metatags * /nutch/trunk/src/plugin/parse-metatags/README.txt * /nutch/trunk/src/plugin/parse-metatags/build.xml * /nutch/trunk/src/plugin/parse-metatags/ivy.xml * /nutch/trunk/src/plugin/parse-metatags/plugin.xml * /nutch/trunk/src/plugin/parse-metatags/sample * /nutch/trunk/src/plugin/parse-metatags/sample/testMetatags.html * /nutch/trunk/src/plugin/parse-metatags/src * /nutch/trunk/src/plugin/parse-metatags/src/java * /nutch/trunk/src/plugin/parse-metatags/src/java/org * /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache * /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch * /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch/parse * /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/MetaTagsParser.java * /nutch/trunk/src/plugin/parse-metatags/src/test * /nutch/trunk/src/plugin/parse-metatags/src/test/org * /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache * /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch * /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse * /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse/html * /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse/html/TestMetatagParser.java Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.4, nutchgora Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.5 Attachments: NUTCH-809-trunk.patch, NUTCH-809.patch, NUTCH-809_metatags_1.3.patch, metatags-plugin+tutorial.zip h2. Parse-metatags plugin The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The query-basic plugin is used to include these fields in the search e.g. in nutch-site.xml {code:xml} property namequery.basic.description.boost/name value2.0/value /property property namequery.basic.keywords.boost/name value2.0/value /property {code} This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-366) Merge URLFilters and URLNormalizers
[ https://issues.apache.org/jira/browse/NUTCH-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234349#comment-13234349 ] Lewis John McGibbney commented on NUTCH-366: Hi Apurv this is great news :) I suggest that if you have not already done so, take a look at NUTCH-365. Try to put the material Andrzej mentioned into context. In parallel I would take a look at the way the current URLFIlters and URLNormalizers are constructed with regards to 1 as above. It would be great to get this moving as a GSoC project. Merge URLFilters and URLNormalizers --- Key: NUTCH-366 URL: https://issues.apache.org/jira/browse/NUTCH-366 Project: Nutch Issue Type: Improvement Reporter: Andrzej Bialecki Labels: gsoc2012 Currently Nutch uses two subsystems related to url validation and normalization: * URLFilter: this interface checks if URLs are valid for further processing. Input URL is not changed in any way. The output is a boolean value. * URLNormalizer: this interface brings URLs to their base (normal) form, or removes unneeded URL components, or performs any other URL mangling as necessary. Input URLs are changed, and are returned as result. However, various Nutch tools run filters and normalizers in pre-determined order, i.e. normalizers first, and then filters. In some cases, where normalizers are complex and running them is costly (e.g. numerous regex rules, DNS lookups) it would make sense to run some of the filters first (e.g. prefix-based filters that select only certain protocols, or suffix-based filters that select only known extensions). This is currently not possible - we always have to run normalizers, only to later throw away urls because they failed to pass through filters. I would like to solicit comments on the following two solutions, and work on implementation of one of them: 1) we could make URLFilters and URLNormalizers implement the same interface, and basically make them interchangeable. This way users could configure their order arbitrarily, even mixing filters and normalizers out of order. This is more complicated, but gives much more flexibility - and NUTCH-365 already provides sufficient framework to implement this, including the ability to define different sequences for different steps in the workflow. 2) we could use a property url.mangling.order ;) to define whether normalizers or filters should run first. This is simple to implement, but provides only limited improvement - because either all filters or all normalizers would run, they couldn't be mixed in arbitrary order. Any comments? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [jira] [Created] (NUTCH-1319) HostNormalizer
Hi Mathijs, We use this in the fetcher (parse=true) and when updating the CrawlDB and with the free generator. We use it in the fetcher because we follow outlinks and make sure we follow the desired host and in the CrawlDB because there we update records for recently added host normalizer rules. It is just an URL normalizer like the others but only changes the host part. This is not covered in other standard normalizers. The BasicURLNormalizer cannot do this and the RegexURLNormalizer is far too heavy to take 20MB of expressions and harder to auto-generate. A simple map lookup is very fast. Cheers, On Wed, 21 Mar 2012 22:22:54 +0100, Mathijs Homminga mathijs.hommi...@kalooga.com wrote: Hi Markus, How (where in the process) do you like to use this normalizer. Isn't this functionality already covered by the URL normalizer(s)? Mathijs Homminga On Mar 21, 2012, at 22:06, Markus Jelsma (Created) (JIRA) j...@apache.org wrote: HostNormalizer -- Key: NUTCH-1319 URL: https://issues.apache.org/jira/browse/NUTCH-1319 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Nutch would benefit from having a host normalizer. A host normalizer maps a given host to the desired host. A basic example is to map www.apache.org to apache.org. The Apache website is one of many on the internet that has a duplicate website on the same domain just because it allows both www and non-www to return HTTP 200 and proper content. It is also able to handle wildcards such as *.example.org to example.org if there are multiple sub domains that actually point to the same website. Large internet crawls tend to get polluted very quickly due to these problems. It also leads to skewed scores in the webgraph as different websites link to different versions of the same duplicate website. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1104) Port issues from trunk NutchGora branch
[ https://issues.apache.org/jira/browse/NUTCH-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1104: Description: Umbrella issue for tracking issues that should be ported from 1.x trunk to the NutchGora branch. Please mark ported issues by modifying this description. NOT YET PORTED: * NUTCH-809 Parse-metatags plugin * NUTCH-987 Support HTTP auth for Solr communication * NUTCH-1028 Log parser keys * NUTCH-1036 Solr jobs should increment counters in Reporter * NUTCH-1057 Make fetcher thread time out configurable * NUTCH-1067 Configure minimum throughput for fetcher * NUTCH-1101 Options to purge db_gone records in updatedb * NUTCH-1102 Fetcher, rely on fetcher.parse directive only * NUTCH-1105 MaxContentLength option for index-basic * NUTCH-940 Statis field plugin * NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk * NUTCH-1207 ParserChecker to output signature * NUTCH-1090 InvertLinks should inform when ignoring internal links * NUTCH-1174 Outlinks are not properly normalized * NUTCH-1203 ParseSegment to show number of milliseconds per parse * NUTCH-1173 DomainStats doesn't count db_not_modified * NUTCH-1155 Host/domain limit in generator is generate.max.count+1 * NUTCH-1061 Migrate MoreIndexingFilter from Apache ORO to java.util.regex * NUTCH-1142 Normalization and filtering in WebGraph * NUTCH-1153 LinkRank not to log all keys and not to write Hadoop _SUCCESS file * NUTCH-1195 Add Solr 4x (trunk) example schema * NUTCH-1141 Configurable Fetcher queue depth * NUTCH-1214 DomainStats tool should be named for what it's doing * NUTCH-1213 Pass additional SolrParams when indexing to Solr * NUTCH-1211 URLFilterChecker command line help doesn't inform user of STDIN requirements * NUTCH-1231 Upgrade to Tika 1.0 * NUTCH-1230 MimeType API deprecated and breaks with Tika 1.0 * NUTCH-1235 Upgrade to new Hadoop 0.20.205.0 * NUTCH-1184 Fetcher to parse and follow Nth degree outlinks * NUTCH-1214 DomainStats tool should be named for what it's doing * NUTCH-1207 ParserChecker to output signature * NUTCH-1174 Outlinks are not properly normalized * NUTCH-1173 DomainStats doesn't count db_not_modified * NUTCH-1142 Normalization and filtering in WebGraph PORTED: * No issues yet NOT GOING TO BE PORTED: * No issues, explain why it should not be ported was: Umbrella issue for tracking issues that should be ported from 1.x trunk to the NutchGora branch. Please mark ported issues by modifying this description. NOT YET PORTED: * NUTCH-987 Support HTTP auth for Solr communication * NUTCH-1028 Log parser keys * NUTCH-1036 Solr jobs should increment counters in Reporter * NUTCH-1057 Make fetcher thread time out configurable * NUTCH-1067 Configure minimum throughput for fetcher * NUTCH-1101 Options to purge db_gone records in updatedb * NUTCH-1102 Fetcher, rely on fetcher.parse directive only * NUTCH-1105 MaxContentLength option for index-basic * NUTCH-940 Statis field plugin * NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk * NUTCH-1207 ParserChecker to output signature * NUTCH-1090 InvertLinks should inform when ignoring internal links * NUTCH-1174 Outlinks are not properly normalized * NUTCH-1203 ParseSegment to show number of milliseconds per parse * NUTCH-1173 DomainStats doesn't count db_not_modified * NUTCH-1155 Host/domain limit in generator is generate.max.count+1 * NUTCH-1061 Migrate MoreIndexingFilter from Apache ORO to java.util.regex * NUTCH-1142 Normalization and filtering in WebGraph * NUTCH-1153 LinkRank not to log all keys and not to write Hadoop _SUCCESS file * NUTCH-1195 Add Solr 4x (trunk) example schema * NUTCH-1141 Configurable Fetcher queue depth * NUTCH-1214 DomainStats tool should be named for what it's doing * NUTCH-1213 Pass additional SolrParams when indexing to Solr * NUTCH-1211 URLFilterChecker command line help doesn't inform user of STDIN requirements * NUTCH-1231 Upgrade to Tika 1.0 * NUTCH-1230 MimeType API deprecated and breaks with Tika 1.0 * NUTCH-1235 Upgrade to new Hadoop 0.20.205.0 * NUTCH-1184 Fetcher to parse and follow Nth degree outlinks * NUTCH-1214 DomainStats tool should be named for what it's doing * NUTCH-1207 ParserChecker to output signature * NUTCH-1174 Outlinks are not properly normalized * NUTCH-1173 DomainStats doesn't count db_not_modified * NUTCH-1142 Normalization and filtering in WebGraph PORTED: * No issues yet NOT GOING TO BE PORTED: * No issues, explain why it should not be ported Port issues from trunk NutchGora branch --- Key: NUTCH-1104 URL: https://issues.apache.org/jira/browse/NUTCH-1104 Project: Nutch Issue Type: Task Affects Versions: nutchgora Reporter: Markus Jelsma Fix For: nutchgora Umbrella issue for tracking issues that should be ported from 1.x trunk to the
[jira] [Commented] (NUTCH-809) Parse-metatags plugin
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235108#comment-13235108 ] Lewis John McGibbney commented on NUTCH-809: Hi Julien, Can you confirm what you would like to see added to the wiki?, I will try my best to get this added, are you referring to the [0]? Also I thought the best thing to do regarding porting to Nutchgora is just to add it to the ever growing NUTCH-1104 list, so I have done so. If and when this is required over there someone can duly oblige :) Regarding adding fields to Solr I assume you mean schema and solr-mapping.xml? Finally can you expand on 'activate by default', what exactly is it that not activated by default? I read your README.txt but I can see any mention of it in there. Thanks Oh and great patch, this is one which as we know is very much appreciated by everyone. [0] http://wiki.apache.org/nutch/IndexStructure Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.4, nutchgora Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.5 Attachments: NUTCH-809-trunk.patch, NUTCH-809.patch, NUTCH-809_metatags_1.3.patch, metatags-plugin+tutorial.zip h2. Parse-metatags plugin The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The query-basic plugin is used to include these fields in the search e.g. in nutch-site.xml {code:xml} property namequery.basic.description.boost/name value2.0/value /property property namequery.basic.keywords.boost/name value2.0/value /property {code} This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-809) Parse-metatags plugin
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235333#comment-13235333 ] Hudson commented on NUTCH-809: -- Integrated in Nutch-trunk #1794 (See [https://builds.apache.org/job/Nutch-trunk/1794/]) NUTCH-809 Parse-metatags plugin (jnioche) (Revision 1303371) Result = SUCCESS jnioche : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1303371 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/plugin/build.xml * /nutch/trunk/src/plugin/parse-metatags * /nutch/trunk/src/plugin/parse-metatags/README.txt * /nutch/trunk/src/plugin/parse-metatags/build.xml * /nutch/trunk/src/plugin/parse-metatags/ivy.xml * /nutch/trunk/src/plugin/parse-metatags/plugin.xml * /nutch/trunk/src/plugin/parse-metatags/sample * /nutch/trunk/src/plugin/parse-metatags/sample/testMetatags.html * /nutch/trunk/src/plugin/parse-metatags/src * /nutch/trunk/src/plugin/parse-metatags/src/java * /nutch/trunk/src/plugin/parse-metatags/src/java/org * /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache * /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch * /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch/parse * /nutch/trunk/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/MetaTagsParser.java * /nutch/trunk/src/plugin/parse-metatags/src/test * /nutch/trunk/src/plugin/parse-metatags/src/test/org * /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache * /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch * /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse * /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse/html * /nutch/trunk/src/plugin/parse-metatags/src/test/org/apache/nutch/parse/html/TestMetatagParser.java Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.4, nutchgora Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.5 Attachments: NUTCH-809-trunk.patch, NUTCH-809.patch, NUTCH-809_metatags_1.3.patch, metatags-plugin+tutorial.zip h2. Parse-metatags plugin The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The query-basic plugin is used to include these fields in the search e.g. in nutch-site.xml {code:xml} property namequery.basic.description.boost/name value2.0/value /property property namequery.basic.keywords.boost/name value2.0/value /property {code} This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira