[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once
[ https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723956#comment-13723956 ] Ferdy Galema commented on NUTCH-1457: - Hi, Thanks for submitting the patch. It seems that patch Nutch-2.2.1 can be applied to the current 2.x branch. (With the command patch -p0 NUTCH-1457...). The changes look all valid. However I haven't tested it by running testcrawls. I try to get it up and running in a while. Or otherwise if anyone else is able to do some testing and/or committing feel free to do so. Nutch2 Refactor the update process so that fetched items are only processed once Key: NUTCH-1457 URL: https://issues.apache.org/jira/browse/NUTCH-1457 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Fix For: 2.4 Attachments: CrawlStatus.java, DbUpdateReducer.java, GeneratorMapper.java, GeneratorReducer.java, NUTCH-1457(Nutch-2.1).patch, NUTCH-1457(Nutch-2.1)-src.zip, NUTCH-1457(Nutch-2.2.1).patch, NUTCH-1457(Nutch-2.2.1)-src.zip -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1618) Fetches some websites multiple times for long lasting queues
[ https://issues.apache.org/jira/browse/NUTCH-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1618: Fix Version/s: (was: 2.1) 2.3 Fetches some websites multiple times for long lasting queues Key: NUTCH-1618 URL: https://issues.apache.org/jira/browse/NUTCH-1618 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 2.1, 2.2, 2.3, 2.4 Reporter: Talat UYARER Priority: Minor Fix For: 2.3 Attachments: NUTCH-1618.patch We are using nutch for high volume crawls. We noticed that FetcherJob ReduceTask fetches some websites multiple times for long lasting queues. I have discovered the reason of this is mapred.reduce.tasks.speculative.execution settings in hadoop. 1.x has speculative execution turned off. I create a patch for 2.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1611) Elastic Search Indexer Creates field in elastic search boost as a string value, so cannot be used in custom boost queries
[ https://issues.apache.org/jira/browse/NUTCH-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1611: Fix Version/s: (was: 2.2) 2.3 Elastic Search Indexer Creates field in elastic search boost as a string value, so cannot be used in custom boost queries --- Key: NUTCH-1611 URL: https://issues.apache.org/jira/browse/NUTCH-1611 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 2.2.1 Environment: All Reporter: Nicholas Waltham Fix For: 2.3, 1.8 Ordinarily, one can use a boost field in a custom_score query in elastic search to affect the ranking, nutch create such a field. However it is store in elastic search as a string, so cannot be used. Attempt to use the boost field in a query therefore creates the following error: PropertyAccessException[[Error: could not access: floatValue; in class: org.elasticsearch.index.field.data.strings.StringDocFieldData]\n[Near : {... _score + (1 * doc.boost.floatValue / 100) }] example test query: { query : { custom_score : { query : { query_string : { query : something }}, script : _score + (1 * doc.boost.doubleValue / 100) } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira