lewismc commented on PR #888:
URL: https://github.com/apache/nutch/pull/888#issuecomment-3941515717
Hi @sebastian-nagel this PR ended up massive!
I addressed the conflict and propose a solution for the performance
regression you identified (~20% longer runtime) when generating fetch lists
from large CrawlDbs without HostDb configured. The regression was caused by:
1. `HostDatum` serialization overhead: Every `SelectorEntry` was serializing
a full `HostDatum` object, even when HostDb was not used
2. FloatTextPair comparison overhead: Using `FloatTextPair` composite keys
requires object deserialization during sorting, whereas `FloatWritable` uses
optimized raw byte comparison
In summary the changes are as follows
**Conditional HostDatum Serialization**
* Added `hasHostDatum` flag to `SelectorEntry` to make `HostDatum`
serialization optional
* When HostDb is not used, only a single boolean byte is written instead of
the full `HostDatum` structure
**Dual-Path Job Configuration**
* Created separate MapReduce components for each code path:
* With HostDb: `SelectorMapperWithHostDb`, `SelectorReducerWithHostDb`,
`SelectorWithHostDb` using `FloatTextPair` keys and `ScoreHostKeyComparator`
for secondary sorting
* Without HostDb: `SelectorMapper`,`SelectorReducer,` Selector using
`FloatWritable` keys and `DecreasingFloatComparator` for optimized raw byte
comparison
* Modified `the generate()` method to conditionally configure the job based
on whether HostDb is provided
If you could please test and profile again it would be greatly appreciated.
When `generate.hostdb` is not configured, the Generator uses the original
optimized code path. No changes to configuration properties or command-line
interface. Existing Generator workflows remain unaffected which is very handy.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]