lewismc opened a new pull request, #888:
URL: https://github.com/apache/nutch/pull/888

   This PR is proposed as a fix for 
[NUTCH-2455](https://issues.apache.org/jira/browse/NUTCH-2455) and also to 
supersede https://github.com/apache/nutch/pull/254/
   
   In essence this PR implements scalable HostDb integration in the Generator 
using MapReduce secondary sorting, eliminating the need to load the entire 
HostDb into memory.
   
   ## Problem
   
   The previous implementation loaded the entire HostDb into memory at reducer 
startup. For crawls with millions of hosts, this caused:
   - High memory consumption (O(HostDb size) per reducer)
   - OutOfMemoryError for large HostDbs
   - Startup latency while loading data
   
   ## Solution
   
   Use MapReduce secondary sorting to stream HostDb entries through the 
pipeline:
   
   1. **Composite Key (`FloatTextPair`)**: Combines score and hostname to 
enable sorting
   2. **Custom Comparator (`ScoreHostKeyComparator`)**: Ensures HostDb entries 
arrive before CrawlDb entries
   3. **MultipleInputs**: Reads both HostDb and CrawlDb in a single MapReduce 
job
   4. **Streaming Reducer**: Processes HostDb entries as they arrive, no 
preloading required
   
   ## Key Components
   
   ### FloatTextPair
   
   ```java
   public static class FloatTextPair implements 
WritableComparable<FloatTextPair> {
       public FloatWritable first;  // score (negative for HostDb)
       public Text second;          // hostname (empty for CrawlDb)
   }
   ```
   
   ### ScoreHostKeyComparator
   
   Sorting order:
   1. HostDb entries first (non-empty hostname), sorted by hostname
   2. CrawlDb entries second (empty hostname), sorted by score descending
   
   ### HostDbReaderMapper
   
   Reads HostDb and emits with special key to ensure sorting before CrawlDb 
entries:
   
   ```java
   context.write(new FloatTextPair(-Float.MAX_VALUE, hostname), entry);
   ```
   
   ## Configuration
   
   | Property | Description |
   |----------|-------------|
   | `generate.hostdb` | Path to HostDb (enables feature) |
   | `generate.max.count.expr` | JEXL expression for per-host URL limit |
   | `generate.fetch.delay.expr` | JEXL expression for per-host fetch delay |
   
   ### Example JEXL Expressions
   
   ```xml
   <!-- Limit hosts with many failures to 10 URLs -->
   <property>
     <name>generate.max.count.expr</name>
     <value>connectionFailures > 100 ? 10 : 1000</value>
   </property>
   
   <!-- Increase delay for unreliable hosts -->
   <property>
     <name>generate.fetch.delay.expr</name>
     <value>connectionFailures > 50 ? 5000 : 1000</value>
   </property>
   ```
   
   ## Performance
   
   | Aspect | Before | After |
   |--------|--------|-------|
   | Memory per reducer | O(H) where H = total hosts | O(P) where P = hosts in 
partition |
   | Startup time | Load entire HostDb | None (streaming) |
   | Scalability | Limited by JVM heap | Scales with cluster size |
   
   ## Backward Compatibility
   
   - When `generate.hostdb` is not set, behavior is unchanged
   - Existing configurations continue to work
   - JEXL expressions only evaluated when HostDb is provided
   
   ## Testing
   
   - **Unit tests (9):** FloatTextPair serialization, equality, comparison; 
ScoreHostKeyComparator ordering
   - **Integration tests (3):** Variable max count, variable fetch delay, 
backward compatibility
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to