[ 
https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18057546#comment-18057546
 ] 

ASF GitHub Bot commented on NUTCH-2455:
---------------------------------------

sebastian-nagel commented on PR #888:
URL: https://github.com/apache/nutch/pull/888#issuecomment-3876920546

   > leave this out of 1.22 to permit time for more peer review.
   
   Agreed.
   
   This is an important feature and makes running Generator with a HostDb 
scalable. But since Generator is one of the core parts, more testing is 
recommended.
   
   There's a performance regression if Generator is used without HostDb - about 
20% longer runtime when generating from a large CrawlDb. 
   1. The HostDatum in SelectorEntry is not optional and its serialization is 
not trivial. This is addressed in 7dcc2f4.
   2. Deserializing and comparing floats from FloatTextPair when sorting has an 
even heavier impact. The 
[FloatWritable](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/FloatWritable.html)
 class appears to be optimized in this respect. Eventually, we can use two 
Selector job definitions, in dependence whether there is a HostDb or not.
   
   [Async-profiler](https://github.com/async-profiler/async-profiler) 
flamegraphs:
   - 
[generator.selector.nutch-2455.20260210102545.flamegraph.html](https://github.com/user-attachments/files/25208488/generator.selector.nutch-2455.20260210102545.flamegraph.html)
 (this PR)
   - 
[generator.selector.20260210103253.flamegraph.html](https://github.com/user-attachments/files/25208521/generator.selector.20260210103253.flamegraph.html)
 (recent master, for comparison)
   
   
   
   




> Use secondary sorting for memory-efficient HostDb integration in Generator
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-2455
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2455
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>    Affects Versions: 1.13
>            Reporter: Markus Jelsma
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 1.22
>
>         Attachments: NUTCH-2455.patch
>
>
> Citing Sebastian at NUTCH-2420:
> ??The correct solution would be to use <host,score> pairs as keys in the 
> Selector job, with a partitioner and secondary sorting so that all keys with 
> same host end up in the same call of the reducer. If values can also hold a 
> HostDb entry and the sort comparator guarantees that the HostDb entry 
> (entries if partitioned by domain or IP) comes in front of all CrawlDb 
> entries. But that would be a substantial improvement...??



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to