Re: [PR] NUTCH-2455 Use secondary sorting for memory-efficient HostDb integration in Generator [nutch]

via GitHub Tue, 10 Feb 2026 03:03:53 -0800


sebastian-nagel commented on PR #888:
URL: https://github.com/apache/nutch/pull/888#issuecomment-3876920546


   > leave this out of 1.22 to permit time for more peer review.
   
   Agreed.
   
   This is an important feature and makes running Generator with a HostDb 
scalable. But since Generator is one of the core parts, more testing is 
recommended.
   
   There's a performance regression if Generator is used without HostDb - about 
20% longer runtime when generating from a large CrawlDb. 
   1. The HostDatum in SelectorEntry is not optional and its serialization is 
not trivial. This is addressed in 7dcc2f4.
   2. Deserializing and comparing floats from FloatTextPair when sorting has an 
even heavier impact. The 
[FloatWritable](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/FloatWritable.html)
 class appears to be optimized in this respect. Eventually, we can use two 
Selector job definitions, in dependence whether there is a HostDb or not.
   
   [Async-profiler](https://github.com/async-profiler/async-profiler) 
flamegraphs:
   - 
[generator.selector.nutch-2455.20260210102545.flamegraph.html](https://github.com/user-attachments/files/25208488/generator.selector.nutch-2455.20260210102545.flamegraph.html)
 (this PR)
   - 
[generator.selector.20260210103253.flamegraph.html](https://github.com/user-attachments/files/25208521/generator.selector.20260210103253.flamegraph.html)
 (recent master, for comparison)
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] NUTCH-2455 Use secondary sorting for memory-efficient HostDb integration in Generator [nutch]

Reply via email to