[ 
https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13787361#comment-13787361
 ] 

Sebastian Nagel commented on NUTCH-1562:
----------------------------------------

Hi Julien,
originally, this issue was only about ordering of scoring filters in "order 
defined in plugin-includes and plugin-excludes". Is this ever possible? It 
seems that the order of filter plugins does not depend on how "plugin.includes" 
is written - order is stable but "random". Property "plugin.includes" is a 
regular expression only used to filter plugins. Unrolling a regex to an ordered 
list is not simple, sometimes almost impossible because both 
{{scoring-(depth|opic)}} and {{scoring-(d\[Ee]pth|.p.c)}} are valid and cause 
exactly the same plugins loaded (until you start implementing a 
{{scoring-apoc}} plugin. Maybe we should simply fix the description in 
nutch-default.xml?

+1 to fix the NPE. But this could be done at one point for all filter plugins 
(scoring/url/parse/indexing). Attached a new patch which tries to "centralize" 
the code to load filter plugins in an order defined by a property.

> Order of execution for scoring filters
> --------------------------------------
>
>                 Key: NUTCH-1562
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1562
>             Project: Nutch
>          Issue Type: Bug
>          Components: documentation
>    Affects Versions: 1.6, 2.1
>            Reporter: Julien Nioche
>             Fix For: 2.3, 1.8
>
>         Attachments: NUTCH-1562-trunk.patch, NUTCH-1562-trunk.patch.v2, 
> NUTCH-1562-trunk.patch.v3
>
>
> The documentation in nutch-default.xml states that :
> {quote}
> <property>
>   <name>scoring.filter.order</name>
>   <value></value>
>   <description>The order in which scoring filters are applied.
>   This may be left empty (in which case all available scoring
>   filters will be applied in the order defined in plugin-includes
>   and plugin-excludes), or a space separated list of implementation
>   classes.
>   </description>
> </property>
> {quote}
> however if no order is specified the filters are ordered randomly and not in 
> the order defined in plugin-includes.
> The other *order parameters (e.g. urlfilter.order) have a different 
> documentation and "are loaded and applied in system defined order" which 
> corresponds to what the code does.
> The patch attached is for 1.x and puts the code in accordance with the 
> documentation by ordering the filters according to the order of the plugins, 
> which gives users more control without having to specify the classes 
> explicitly in scoring.filter.order.
> We could extend the same idea to the other *order params.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to