[jira] [Commented] (NUTCH-1562) Order of execution for scoring filters

2013-10-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788059#comment-13788059
 ] 

Hudson commented on NUTCH-1562:
---

SUCCESS: Integrated in Nutch-trunk #2380 (See 
[https://builds.apache.org/job/Nutch-trunk/2380/])
NUTCH-1562 (jnioche: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1529813)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFilters.java
* /nutch/trunk/src/java/org/apache/nutch/net/URLFilters.java
* /nutch/trunk/src/java/org/apache/nutch/parse/HtmlParseFilters.java
* /nutch/trunk/src/java/org/apache/nutch/plugin/PluginRepository.java
* /nutch/trunk/src/java/org/apache/nutch/scoring/ScoringFilters.java


> Order of execution for scoring filters
> --
>
> Key: NUTCH-1562
> URL: https://issues.apache.org/jira/browse/NUTCH-1562
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.6, 2.1
>Reporter: Julien Nioche
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1562-trunk.patch, NUTCH-1562-trunk.patch.v2, 
> NUTCH-1562-trunk.patch.v3
>
>
> The documentation in nutch-default.xml states that :
> {quote}
> 
>   scoring.filter.order
>   
>   The order in which scoring filters are applied.
>   This may be left empty (in which case all available scoring
>   filters will be applied in the order defined in plugin-includes
>   and plugin-excludes), or a space separated list of implementation
>   classes.
>   
> 
> {quote}
> however if no order is specified the filters are ordered randomly and not in 
> the order defined in plugin-includes.
> The other *order parameters (e.g. urlfilter.order) have a different 
> documentation and "are loaded and applied in system defined order" which 
> corresponds to what the code does.
> The patch attached is for 1.x and puts the code in accordance with the 
> documentation by ordering the filters according to the order of the plugins, 
> which gives users more control without having to specify the classes 
> explicitly in scoring.filter.order.
> We could extend the same idea to the other *order params.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1562) Order of execution for scoring filters

2013-10-07 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788039#comment-13788039
 ] 

Julien Nioche commented on NUTCH-1562:
--

Hi Seb

You are right about the order from plugin.includes, this had completely passed 
me by. I really like your patch, it makes loads of sense to centralize that 
code and will make it simpler to address NUTCH-1606 for instance.

Will commit your patch shortly with a minor modification (getOrderedPlugins() 
is synchronized)

Thanks 

> Order of execution for scoring filters
> --
>
> Key: NUTCH-1562
> URL: https://issues.apache.org/jira/browse/NUTCH-1562
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.6, 2.1
>Reporter: Julien Nioche
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1562-trunk.patch, NUTCH-1562-trunk.patch.v2, 
> NUTCH-1562-trunk.patch.v3
>
>
> The documentation in nutch-default.xml states that :
> {quote}
> 
>   scoring.filter.order
>   
>   The order in which scoring filters are applied.
>   This may be left empty (in which case all available scoring
>   filters will be applied in the order defined in plugin-includes
>   and plugin-excludes), or a space separated list of implementation
>   classes.
>   
> 
> {quote}
> however if no order is specified the filters are ordered randomly and not in 
> the order defined in plugin-includes.
> The other *order parameters (e.g. urlfilter.order) have a different 
> documentation and "are loaded and applied in system defined order" which 
> corresponds to what the code does.
> The patch attached is for 1.x and puts the code in accordance with the 
> documentation by ordering the filters according to the order of the plugins, 
> which gives users more control without having to specify the classes 
> explicitly in scoring.filter.order.
> We could extend the same idea to the other *order params.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1562) Order of execution for scoring filters

2013-10-05 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13787361#comment-13787361
 ] 

Sebastian Nagel commented on NUTCH-1562:


Hi Julien,
originally, this issue was only about ordering of scoring filters in "order 
defined in plugin-includes and plugin-excludes". Is this ever possible? It 
seems that the order of filter plugins does not depend on how "plugin.includes" 
is written - order is stable but "random". Property "plugin.includes" is a 
regular expression only used to filter plugins. Unrolling a regex to an ordered 
list is not simple, sometimes almost impossible because both 
{{scoring-(depth|opic)}} and {{scoring-(d\[Ee]pth|.p.c)}} are valid and cause 
exactly the same plugins loaded (until you start implementing a 
{{scoring-apoc}} plugin. Maybe we should simply fix the description in 
nutch-default.xml?

+1 to fix the NPE. But this could be done at one point for all filter plugins 
(scoring/url/parse/indexing). Attached a new patch which tries to "centralize" 
the code to load filter plugins in an order defined by a property.

> Order of execution for scoring filters
> --
>
> Key: NUTCH-1562
> URL: https://issues.apache.org/jira/browse/NUTCH-1562
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.6, 2.1
>Reporter: Julien Nioche
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1562-trunk.patch, NUTCH-1562-trunk.patch.v2, 
> NUTCH-1562-trunk.patch.v3
>
>
> The documentation in nutch-default.xml states that :
> {quote}
> 
>   scoring.filter.order
>   
>   The order in which scoring filters are applied.
>   This may be left empty (in which case all available scoring
>   filters will be applied in the order defined in plugin-includes
>   and plugin-excludes), or a space separated list of implementation
>   classes.
>   
> 
> {quote}
> however if no order is specified the filters are ordered randomly and not in 
> the order defined in plugin-includes.
> The other *order parameters (e.g. urlfilter.order) have a different 
> documentation and "are loaded and applied in system defined order" which 
> corresponds to what the code does.
> The patch attached is for 1.x and puts the code in accordance with the 
> documentation by ordering the filters according to the order of the plugins, 
> which gives users more control without having to specify the classes 
> explicitly in scoring.filter.order.
> We could extend the same idea to the other *order params.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1562) Order of execution for scoring filters

2013-10-04 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13786239#comment-13786239
 ] 

Markus Jelsma commented on NUTCH-1562:
--

Looks fine! +1

> Order of execution for scoring filters
> --
>
> Key: NUTCH-1562
> URL: https://issues.apache.org/jira/browse/NUTCH-1562
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.6, 2.1
>Reporter: Julien Nioche
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1562-trunk.patch, NUTCH-1562-trunk.patch.v2
>
>
> The documentation in nutch-default.xml states that :
> {quote}
> 
>   scoring.filter.order
>   
>   The order in which scoring filters are applied.
>   This may be left empty (in which case all available scoring
>   filters will be applied in the order defined in plugin-includes
>   and plugin-excludes), or a space separated list of implementation
>   classes.
>   
> 
> {quote}
> however if no order is specified the filters are ordered randomly and not in 
> the order defined in plugin-includes.
> The other *order parameters (e.g. urlfilter.order) have a different 
> documentation and "are loaded and applied in system defined order" which 
> corresponds to what the code does.
> The patch attached is for 1.x and puts the code in accordance with the 
> documentation by ordering the filters according to the order of the plugins, 
> which gives users more control without having to specify the classes 
> explicitly in scoring.filter.order.
> We could extend the same idea to the other *order params.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1562) Order of execution for scoring filters

2013-10-04 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13786026#comment-13786026
 ] 

Julien Nioche commented on NUTCH-1562:
--

Will commit early next week unless someone has any objections

> Order of execution for scoring filters
> --
>
> Key: NUTCH-1562
> URL: https://issues.apache.org/jira/browse/NUTCH-1562
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.6, 2.1
>Reporter: Julien Nioche
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1562-trunk.patch, NUTCH-1562-trunk.patch.v2
>
>
> The documentation in nutch-default.xml states that :
> {quote}
> 
>   scoring.filter.order
>   
>   The order in which scoring filters are applied.
>   This may be left empty (in which case all available scoring
>   filters will be applied in the order defined in plugin-includes
>   and plugin-excludes), or a space separated list of implementation
>   classes.
>   
> 
> {quote}
> however if no order is specified the filters are ordered randomly and not in 
> the order defined in plugin-includes.
> The other *order parameters (e.g. urlfilter.order) have a different 
> documentation and "are loaded and applied in system defined order" which 
> corresponds to what the code does.
> The patch attached is for 1.x and puts the code in accordance with the 
> documentation by ordering the filters according to the order of the plugins, 
> which gives users more control without having to specify the classes 
> explicitly in scoring.filter.order.
> We could extend the same idea to the other *order params.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1562) Order of execution for scoring filters

2013-04-21 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13637532#comment-13637532
 ] 

Julien Nioche commented on NUTCH-1562:
--

Hi guys

Lewis : 

cat conf/nutch-default.xml | grep 'orderindexingfilter.order
  urlnormalizer.order
  htmlparsefilter.order
  urlfilter.order
  scoring.filter.order

Lufeng : it is a prerequisite that the corresponding plugins are listed in 
plugin.includes. We could indeed log a friendly error message instead of 
hitting a NPE 




> Order of execution for scoring filters
> --
>
> Key: NUTCH-1562
> URL: https://issues.apache.org/jira/browse/NUTCH-1562
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.6, 2.1
>Reporter: Julien Nioche
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1562-trunk.patch
>
>
> The documentation in nutch-default.xml states that :
> {quote}
> 
>   scoring.filter.order
>   
>   The order in which scoring filters are applied.
>   This may be left empty (in which case all available scoring
>   filters will be applied in the order defined in plugin-includes
>   and plugin-excludes), or a space separated list of implementation
>   classes.
>   
> 
> {quote}
> however if no order is specified the filters are ordered randomly and not in 
> the order defined in plugin-includes.
> The other *order parameters (e.g. urlfilter.order) have a different 
> documentation and "are loaded and applied in system defined order" which 
> corresponds to what the code does.
> The patch attached is for 1.x and puts the code in accordance with the 
> documentation by ordering the filters according to the order of the plugins, 
> which gives users more control without having to specify the classes 
> explicitly in scoring.filter.order.
> We could extend the same idea to the other *order params.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1562) Order of execution for scoring filters

2013-04-20 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13637247#comment-13637247
 ] 

lufeng commented on NUTCH-1562:
---

Hi Julien, if someone define the scoring.filter.order like opic,depth filters 
and these filters are not included in plugin.includes property, maybe forget 
it. it will throw an exception like this. 

{code:java}
java.lang.NullPointerException
at 
org.apache.nutch.scoring.ScoringFilters.injectedScore(ScoringFilters.java:112)
at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:164)
at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:63)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
2013-04-20 21:19:10,983 ERROR crawl.Injector - Injector: java.io.IOException: 
Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1327)
at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
at org.apache.nutch.crawl.Injector.run(Injector.java:318)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Injector.main(Injector.java:308)

{code}

Should we consider this situation or not? 

> Order of execution for scoring filters
> --
>
> Key: NUTCH-1562
> URL: https://issues.apache.org/jira/browse/NUTCH-1562
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.6, 2.1
>Reporter: Julien Nioche
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1562-trunk.patch
>
>
> The documentation in nutch-default.xml states that :
> {quote}
> 
>   scoring.filter.order
>   
>   The order in which scoring filters are applied.
>   This may be left empty (in which case all available scoring
>   filters will be applied in the order defined in plugin-includes
>   and plugin-excludes), or a space separated list of implementation
>   classes.
>   
> 
> {quote}
> however if no order is specified the filters are ordered randomly and not in 
> the order defined in plugin-includes.
> The other *order parameters (e.g. urlfilter.order) have a different 
> documentation and "are loaded and applied in system defined order" which 
> corresponds to what the code does.
> The patch attached is for 1.x and puts the code in accordance with the 
> documentation by ordering the filters according to the order of the plugins, 
> which gives users more control without having to specify the classes 
> explicitly in scoring.filter.order.
> We could extend the same idea to the other *order params.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1562) Order of execution for scoring filters

2013-04-19 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13637058#comment-13637058
 ] 

Lewis John McGibbney commented on NUTCH-1562:
-

Hi Julien. Good one. 
Can you suggest the other *order params?
I think we should cook up a 2.x patch (I will try) as this is a good catch and 
most certainly an improvement. 

> Order of execution for scoring filters
> --
>
> Key: NUTCH-1562
> URL: https://issues.apache.org/jira/browse/NUTCH-1562
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.6, 2.1
>Reporter: Julien Nioche
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1562-trunk.patch
>
>
> The documentation in nutch-default.xml states that :
> {quote}
> 
>   scoring.filter.order
>   
>   The order in which scoring filters are applied.
>   This may be left empty (in which case all available scoring
>   filters will be applied in the order defined in plugin-includes
>   and plugin-excludes), or a space separated list of implementation
>   classes.
>   
> 
> {quote}
> however if no order is specified the filters are ordered randomly and not in 
> the order defined in plugin-includes.
> The other *order parameters (e.g. urlfilter.order) have a different 
> documentation and "are loaded and applied in system defined order" which 
> corresponds to what the code does.
> The patch attached is for 1.x and puts the code in accordance with the 
> documentation by ordering the filters according to the order of the plugins, 
> which gives users more control without having to specify the classes 
> explicitly in scoring.filter.order.
> We could extend the same idea to the other *order params.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira