[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2013-07-30 Thread Ferdy Galema (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723956#comment-13723956
 ] 

Ferdy Galema commented on NUTCH-1457:
-

Hi,

Thanks for submitting the patch. It seems that patch Nutch-2.2.1 can be applied 
to the current 2.x branch. (With the command patch -p0  NUTCH-1457...).

The changes look all valid. However I haven't tested it by running testcrawls. 
I try to get it up and running in a while. Or otherwise if anyone else is able 
to do some testing and/or committing feel free to do so.

 Nutch2 Refactor the update process so that fetched items are only processed 
 once
 

 Key: NUTCH-1457
 URL: https://issues.apache.org/jira/browse/NUTCH-1457
 Project: Nutch
  Issue Type: Improvement
Reporter: Ferdy Galema
 Fix For: 2.4

 Attachments: CrawlStatus.java, DbUpdateReducer.java, 
 GeneratorMapper.java, GeneratorReducer.java, NUTCH-1457(Nutch-2.1).patch, 
 NUTCH-1457(Nutch-2.1)-src.zip, NUTCH-1457(Nutch-2.2.1).patch, 
 NUTCH-1457(Nutch-2.2.1)-src.zip




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1618) Fetches some websites multiple times for long lasting queues

2013-07-30 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1618:


Fix Version/s: (was: 2.1)
   2.3

 Fetches some websites multiple times for long lasting queues
 

 Key: NUTCH-1618
 URL: https://issues.apache.org/jira/browse/NUTCH-1618
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 2.1, 2.2, 2.3, 2.4
Reporter: Talat UYARER
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1618.patch


 We are using nutch for high volume crawls. We noticed that FetcherJob 
 ReduceTask fetches some websites multiple times for long lasting queues. I 
 have discovered the reason of this is 
 mapred.reduce.tasks.speculative.execution settings in hadoop. 1.x has 
 speculative execution turned off. I create a patch for 2.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1611) Elastic Search Indexer Creates field in elastic search boost as a string value, so cannot be used in custom boost queries

2013-07-30 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1611:


Fix Version/s: (was: 2.2)
   2.3

 Elastic Search Indexer Creates field in elastic search boost as a string 
 value, so cannot be used in custom boost queries
 ---

 Key: NUTCH-1611
 URL: https://issues.apache.org/jira/browse/NUTCH-1611
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 2.2.1
 Environment: All
Reporter: Nicholas Waltham
 Fix For: 2.3, 1.8


 Ordinarily, one can use a boost field in a custom_score query in elastic 
 search to affect the ranking, nutch create such a field. However it is store 
 in elastic search as a string, so cannot be used. Attempt to use the boost 
 field in a query therefore creates the following error:
  PropertyAccessException[[Error: could not access: floatValue; in class: 
 org.elasticsearch.index.field.data.strings.StringDocFieldData]\n[Near : {... 
 _score + (1 * doc.boost.floatValue / 100) }]   
 example test query:
 {
 query : {
   custom_score : {  
 query : {
   query_string : {
  query : something
   }},
   script : _score + (1 * doc.boost.doubleValue / 100)
   }
}
 }   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira