sebastian-nagel commented on PR #888: URL: https://github.com/apache/nutch/pull/888#issuecomment-3876920546
> leave this out of 1.22 to permit time for more peer review. Agreed. This is an important feature and makes running Generator with a HostDb scalable. But since Generator is one of the core parts, more testing is recommended. There's a performance regression if Generator is used without HostDb - about 20% longer runtime when generating from a large CrawlDb. 1. The HostDatum in SelectorEntry is not optional and its serialization is not trivial. This is addressed in 7dcc2f4. 2. Deserializing and comparing floats from FloatTextPair when sorting has an even heavier impact. The [FloatWritable](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/FloatWritable.html) class appears to be optimized in this respect. Eventually, we can use two Selector job definitions, in dependence whether there is a HostDb or not. [Async-profiler](https://github.com/async-profiler/async-profiler) flamegraphs: - [generator.selector.nutch-2455.20260210102545.flamegraph.html](https://github.com/user-attachments/files/25208488/generator.selector.nutch-2455.20260210102545.flamegraph.html) (this PR) - [generator.selector.20260210103253.flamegraph.html](https://github.com/user-attachments/files/25208521/generator.selector.20260210103253.flamegraph.html) (recent master, for comparison) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]

