[jira] [Commented] (NUTCH-1562) Order of execution for scoring filters
[ https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788059#comment-13788059 ] Hudson commented on NUTCH-1562: --- SUCCESS: Integrated in Nutch-trunk #2380 (See [https://builds.apache.org/job/Nutch-trunk/2380/]) NUTCH-1562 (jnioche: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1529813) * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFilters.java * /nutch/trunk/src/java/org/apache/nutch/net/URLFilters.java * /nutch/trunk/src/java/org/apache/nutch/parse/HtmlParseFilters.java * /nutch/trunk/src/java/org/apache/nutch/plugin/PluginRepository.java * /nutch/trunk/src/java/org/apache/nutch/scoring/ScoringFilters.java > Order of execution for scoring filters > -- > > Key: NUTCH-1562 > URL: https://issues.apache.org/jira/browse/NUTCH-1562 > Project: Nutch > Issue Type: Bug > Components: documentation >Affects Versions: 1.6, 2.1 >Reporter: Julien Nioche > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1562-trunk.patch, NUTCH-1562-trunk.patch.v2, > NUTCH-1562-trunk.patch.v3 > > > The documentation in nutch-default.xml states that : > {quote} > > scoring.filter.order > > The order in which scoring filters are applied. > This may be left empty (in which case all available scoring > filters will be applied in the order defined in plugin-includes > and plugin-excludes), or a space separated list of implementation > classes. > > > {quote} > however if no order is specified the filters are ordered randomly and not in > the order defined in plugin-includes. > The other *order parameters (e.g. urlfilter.order) have a different > documentation and "are loaded and applied in system defined order" which > corresponds to what the code does. > The patch attached is for 1.x and puts the code in accordance with the > documentation by ordering the filters according to the order of the plugins, > which gives users more control without having to specify the classes > explicitly in scoring.filter.order. > We could extend the same idea to the other *order params. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1562) Order of execution for scoring filters
[ https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788039#comment-13788039 ] Julien Nioche commented on NUTCH-1562: -- Hi Seb You are right about the order from plugin.includes, this had completely passed me by. I really like your patch, it makes loads of sense to centralize that code and will make it simpler to address NUTCH-1606 for instance. Will commit your patch shortly with a minor modification (getOrderedPlugins() is synchronized) Thanks > Order of execution for scoring filters > -- > > Key: NUTCH-1562 > URL: https://issues.apache.org/jira/browse/NUTCH-1562 > Project: Nutch > Issue Type: Bug > Components: documentation >Affects Versions: 1.6, 2.1 >Reporter: Julien Nioche > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1562-trunk.patch, NUTCH-1562-trunk.patch.v2, > NUTCH-1562-trunk.patch.v3 > > > The documentation in nutch-default.xml states that : > {quote} > > scoring.filter.order > > The order in which scoring filters are applied. > This may be left empty (in which case all available scoring > filters will be applied in the order defined in plugin-includes > and plugin-excludes), or a space separated list of implementation > classes. > > > {quote} > however if no order is specified the filters are ordered randomly and not in > the order defined in plugin-includes. > The other *order parameters (e.g. urlfilter.order) have a different > documentation and "are loaded and applied in system defined order" which > corresponds to what the code does. > The patch attached is for 1.x and puts the code in accordance with the > documentation by ordering the filters according to the order of the plugins, > which gives users more control without having to specify the classes > explicitly in scoring.filter.order. > We could extend the same idea to the other *order params. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1562) Order of execution for scoring filters
[ https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13787361#comment-13787361 ] Sebastian Nagel commented on NUTCH-1562: Hi Julien, originally, this issue was only about ordering of scoring filters in "order defined in plugin-includes and plugin-excludes". Is this ever possible? It seems that the order of filter plugins does not depend on how "plugin.includes" is written - order is stable but "random". Property "plugin.includes" is a regular expression only used to filter plugins. Unrolling a regex to an ordered list is not simple, sometimes almost impossible because both {{scoring-(depth|opic)}} and {{scoring-(d\[Ee]pth|.p.c)}} are valid and cause exactly the same plugins loaded (until you start implementing a {{scoring-apoc}} plugin. Maybe we should simply fix the description in nutch-default.xml? +1 to fix the NPE. But this could be done at one point for all filter plugins (scoring/url/parse/indexing). Attached a new patch which tries to "centralize" the code to load filter plugins in an order defined by a property. > Order of execution for scoring filters > -- > > Key: NUTCH-1562 > URL: https://issues.apache.org/jira/browse/NUTCH-1562 > Project: Nutch > Issue Type: Bug > Components: documentation >Affects Versions: 1.6, 2.1 >Reporter: Julien Nioche > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1562-trunk.patch, NUTCH-1562-trunk.patch.v2, > NUTCH-1562-trunk.patch.v3 > > > The documentation in nutch-default.xml states that : > {quote} > > scoring.filter.order > > The order in which scoring filters are applied. > This may be left empty (in which case all available scoring > filters will be applied in the order defined in plugin-includes > and plugin-excludes), or a space separated list of implementation > classes. > > > {quote} > however if no order is specified the filters are ordered randomly and not in > the order defined in plugin-includes. > The other *order parameters (e.g. urlfilter.order) have a different > documentation and "are loaded and applied in system defined order" which > corresponds to what the code does. > The patch attached is for 1.x and puts the code in accordance with the > documentation by ordering the filters according to the order of the plugins, > which gives users more control without having to specify the classes > explicitly in scoring.filter.order. > We could extend the same idea to the other *order params. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1562) Order of execution for scoring filters
[ https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13786239#comment-13786239 ] Markus Jelsma commented on NUTCH-1562: -- Looks fine! +1 > Order of execution for scoring filters > -- > > Key: NUTCH-1562 > URL: https://issues.apache.org/jira/browse/NUTCH-1562 > Project: Nutch > Issue Type: Bug > Components: documentation >Affects Versions: 1.6, 2.1 >Reporter: Julien Nioche > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1562-trunk.patch, NUTCH-1562-trunk.patch.v2 > > > The documentation in nutch-default.xml states that : > {quote} > > scoring.filter.order > > The order in which scoring filters are applied. > This may be left empty (in which case all available scoring > filters will be applied in the order defined in plugin-includes > and plugin-excludes), or a space separated list of implementation > classes. > > > {quote} > however if no order is specified the filters are ordered randomly and not in > the order defined in plugin-includes. > The other *order parameters (e.g. urlfilter.order) have a different > documentation and "are loaded and applied in system defined order" which > corresponds to what the code does. > The patch attached is for 1.x and puts the code in accordance with the > documentation by ordering the filters according to the order of the plugins, > which gives users more control without having to specify the classes > explicitly in scoring.filter.order. > We could extend the same idea to the other *order params. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1562) Order of execution for scoring filters
[ https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13786026#comment-13786026 ] Julien Nioche commented on NUTCH-1562: -- Will commit early next week unless someone has any objections > Order of execution for scoring filters > -- > > Key: NUTCH-1562 > URL: https://issues.apache.org/jira/browse/NUTCH-1562 > Project: Nutch > Issue Type: Bug > Components: documentation >Affects Versions: 1.6, 2.1 >Reporter: Julien Nioche > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1562-trunk.patch, NUTCH-1562-trunk.patch.v2 > > > The documentation in nutch-default.xml states that : > {quote} > > scoring.filter.order > > The order in which scoring filters are applied. > This may be left empty (in which case all available scoring > filters will be applied in the order defined in plugin-includes > and plugin-excludes), or a space separated list of implementation > classes. > > > {quote} > however if no order is specified the filters are ordered randomly and not in > the order defined in plugin-includes. > The other *order parameters (e.g. urlfilter.order) have a different > documentation and "are loaded and applied in system defined order" which > corresponds to what the code does. > The patch attached is for 1.x and puts the code in accordance with the > documentation by ordering the filters according to the order of the plugins, > which gives users more control without having to specify the classes > explicitly in scoring.filter.order. > We could extend the same idea to the other *order params. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1562) Order of execution for scoring filters
[ https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13637532#comment-13637532 ] Julien Nioche commented on NUTCH-1562: -- Hi guys Lewis : cat conf/nutch-default.xml | grep 'orderindexingfilter.order urlnormalizer.order htmlparsefilter.order urlfilter.order scoring.filter.order Lufeng : it is a prerequisite that the corresponding plugins are listed in plugin.includes. We could indeed log a friendly error message instead of hitting a NPE > Order of execution for scoring filters > -- > > Key: NUTCH-1562 > URL: https://issues.apache.org/jira/browse/NUTCH-1562 > Project: Nutch > Issue Type: Bug > Components: documentation >Affects Versions: 1.6, 2.1 >Reporter: Julien Nioche > Fix For: 1.7, 2.2 > > Attachments: NUTCH-1562-trunk.patch > > > The documentation in nutch-default.xml states that : > {quote} > > scoring.filter.order > > The order in which scoring filters are applied. > This may be left empty (in which case all available scoring > filters will be applied in the order defined in plugin-includes > and plugin-excludes), or a space separated list of implementation > classes. > > > {quote} > however if no order is specified the filters are ordered randomly and not in > the order defined in plugin-includes. > The other *order parameters (e.g. urlfilter.order) have a different > documentation and "are loaded and applied in system defined order" which > corresponds to what the code does. > The patch attached is for 1.x and puts the code in accordance with the > documentation by ordering the filters according to the order of the plugins, > which gives users more control without having to specify the classes > explicitly in scoring.filter.order. > We could extend the same idea to the other *order params. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1562) Order of execution for scoring filters
[ https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13637247#comment-13637247 ] lufeng commented on NUTCH-1562: --- Hi Julien, if someone define the scoring.filter.order like opic,depth filters and these filters are not included in plugin.includes property, maybe forget it. it will throw an exception like this. {code:java} java.lang.NullPointerException at org.apache.nutch.scoring.ScoringFilters.injectedScore(ScoringFilters.java:112) at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:164) at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:63) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) 2013-04-20 21:19:10,983 ERROR crawl.Injector - Injector: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1327) at org.apache.nutch.crawl.Injector.inject(Injector.java:281) at org.apache.nutch.crawl.Injector.run(Injector.java:318) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Injector.main(Injector.java:308) {code} Should we consider this situation or not? > Order of execution for scoring filters > -- > > Key: NUTCH-1562 > URL: https://issues.apache.org/jira/browse/NUTCH-1562 > Project: Nutch > Issue Type: Bug > Components: documentation >Affects Versions: 1.6, 2.1 >Reporter: Julien Nioche > Fix For: 1.7, 2.2 > > Attachments: NUTCH-1562-trunk.patch > > > The documentation in nutch-default.xml states that : > {quote} > > scoring.filter.order > > The order in which scoring filters are applied. > This may be left empty (in which case all available scoring > filters will be applied in the order defined in plugin-includes > and plugin-excludes), or a space separated list of implementation > classes. > > > {quote} > however if no order is specified the filters are ordered randomly and not in > the order defined in plugin-includes. > The other *order parameters (e.g. urlfilter.order) have a different > documentation and "are loaded and applied in system defined order" which > corresponds to what the code does. > The patch attached is for 1.x and puts the code in accordance with the > documentation by ordering the filters according to the order of the plugins, > which gives users more control without having to specify the classes > explicitly in scoring.filter.order. > We could extend the same idea to the other *order params. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1562) Order of execution for scoring filters
[ https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13637058#comment-13637058 ] Lewis John McGibbney commented on NUTCH-1562: - Hi Julien. Good one. Can you suggest the other *order params? I think we should cook up a 2.x patch (I will try) as this is a good catch and most certainly an improvement. > Order of execution for scoring filters > -- > > Key: NUTCH-1562 > URL: https://issues.apache.org/jira/browse/NUTCH-1562 > Project: Nutch > Issue Type: Bug > Components: documentation >Affects Versions: 1.6, 2.1 >Reporter: Julien Nioche > Fix For: 1.7, 2.2 > > Attachments: NUTCH-1562-trunk.patch > > > The documentation in nutch-default.xml states that : > {quote} > > scoring.filter.order > > The order in which scoring filters are applied. > This may be left empty (in which case all available scoring > filters will be applied in the order defined in plugin-includes > and plugin-excludes), or a space separated list of implementation > classes. > > > {quote} > however if no order is specified the filters are ordered randomly and not in > the order defined in plugin-includes. > The other *order parameters (e.g. urlfilter.order) have a different > documentation and "are loaded and applied in system defined order" which > corresponds to what the code does. > The patch attached is for 1.x and puts the code in accordance with the > documentation by ordering the filters according to the order of the plugins, > which gives users more control without having to specify the classes > explicitly in scoring.filter.order. > We could extend the same idea to the other *order params. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira