lewismc opened a new pull request, #888: URL: https://github.com/apache/nutch/pull/888
This PR is proposed as a fix for [NUTCH-2455](https://issues.apache.org/jira/browse/NUTCH-2455) and also to supersede https://github.com/apache/nutch/pull/254/ In essence this PR implements scalable HostDb integration in the Generator using MapReduce secondary sorting, eliminating the need to load the entire HostDb into memory. ## Problem The previous implementation loaded the entire HostDb into memory at reducer startup. For crawls with millions of hosts, this caused: - High memory consumption (O(HostDb size) per reducer) - OutOfMemoryError for large HostDbs - Startup latency while loading data ## Solution Use MapReduce secondary sorting to stream HostDb entries through the pipeline: 1. **Composite Key (`FloatTextPair`)**: Combines score and hostname to enable sorting 2. **Custom Comparator (`ScoreHostKeyComparator`)**: Ensures HostDb entries arrive before CrawlDb entries 3. **MultipleInputs**: Reads both HostDb and CrawlDb in a single MapReduce job 4. **Streaming Reducer**: Processes HostDb entries as they arrive, no preloading required ## Key Components ### FloatTextPair ```java public static class FloatTextPair implements WritableComparable<FloatTextPair> { public FloatWritable first; // score (negative for HostDb) public Text second; // hostname (empty for CrawlDb) } ``` ### ScoreHostKeyComparator Sorting order: 1. HostDb entries first (non-empty hostname), sorted by hostname 2. CrawlDb entries second (empty hostname), sorted by score descending ### HostDbReaderMapper Reads HostDb and emits with special key to ensure sorting before CrawlDb entries: ```java context.write(new FloatTextPair(-Float.MAX_VALUE, hostname), entry); ``` ## Configuration | Property | Description | |----------|-------------| | `generate.hostdb` | Path to HostDb (enables feature) | | `generate.max.count.expr` | JEXL expression for per-host URL limit | | `generate.fetch.delay.expr` | JEXL expression for per-host fetch delay | ### Example JEXL Expressions ```xml <!-- Limit hosts with many failures to 10 URLs --> <property> <name>generate.max.count.expr</name> <value>connectionFailures > 100 ? 10 : 1000</value> </property> <!-- Increase delay for unreliable hosts --> <property> <name>generate.fetch.delay.expr</name> <value>connectionFailures > 50 ? 5000 : 1000</value> </property> ``` ## Performance | Aspect | Before | After | |--------|--------|-------| | Memory per reducer | O(H) where H = total hosts | O(P) where P = hosts in partition | | Startup time | Load entire HostDb | None (streaming) | | Scalability | Limited by JVM heap | Scales with cluster size | ## Backward Compatibility - When `generate.hostdb` is not set, behavior is unchanged - Existing configurations continue to work - JEXL expressions only evaluated when HostDb is provided ## Testing - **Unit tests (9):** FloatTextPair serialization, equality, comparison; ScoreHostKeyComparator ordering - **Integration tests (3):** Variable max count, variable fetch delay, backward compatibility -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]

