lewismc commented on PR #888:
URL: https://github.com/apache/nutch/pull/888#issuecomment-3941515717

   Hi @sebastian-nagel this PR ended up massive!
   
   I addressed the conflict and propose a solution for the performance 
regression you identified  (~20% longer runtime) when generating fetch lists 
from large CrawlDbs without HostDb configured. The regression was caused by:
   
   1. `HostDatum` serialization overhead: Every `SelectorEntry` was serializing 
a full `HostDatum` object, even when HostDb was not used
   2. FloatTextPair comparison overhead: Using `FloatTextPair` composite keys 
requires object deserialization during sorting, whereas `FloatWritable` uses 
optimized raw byte comparison
   
   In summary the changes are as follows
   
   **Conditional HostDatum Serialization**
   * Added `hasHostDatum` flag to `SelectorEntry` to make `HostDatum` 
serialization optional
   * When HostDb is not used, only a single boolean byte is written instead of 
the full `HostDatum` structure
   
   **Dual-Path Job Configuration**
   * Created separate MapReduce components for each code path:
     * With HostDb: `SelectorMapperWithHostDb`, `SelectorReducerWithHostDb`, 
`SelectorWithHostDb` using `FloatTextPair` keys and `ScoreHostKeyComparator` 
for secondary sorting
     * Without HostDb: `SelectorMapper`,`SelectorReducer,` Selector using 
`FloatWritable` keys and `DecreasingFloatComparator` for optimized raw byte 
comparison
   * Modified `the generate()` method to conditionally configure the job based 
on whether HostDb is provided
   
   If you could please test and profile again it would be greatly appreciated. 
   
   When `generate.hostdb` is not configured, the Generator uses the original 
optimized code path. No changes to configuration properties or command-line 
interface. Existing Generator workflows remain unaffected which is very handy.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to