[ 
https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18051427#comment-18051427
 ] 

ASF GitHub Bot commented on NUTCH-2455:
---------------------------------------

lewismc opened a new pull request, #888:
URL: https://github.com/apache/nutch/pull/888

   This PR is proposed as a fix for 
[NUTCH-2455](https://issues.apache.org/jira/browse/NUTCH-2455) and also to 
supersede https://github.com/apache/nutch/pull/254/
   
   In essence this PR implements scalable HostDb integration in the Generator 
using MapReduce secondary sorting, eliminating the need to load the entire 
HostDb into memory.
   
   ## Problem
   
   The previous implementation loaded the entire HostDb into memory at reducer 
startup. For crawls with millions of hosts, this caused:
   - High memory consumption (O(HostDb size) per reducer)
   - OutOfMemoryError for large HostDbs
   - Startup latency while loading data
   
   ## Solution
   
   Use MapReduce secondary sorting to stream HostDb entries through the 
pipeline:
   
   1. **Composite Key (`FloatTextPair`)**: Combines score and hostname to 
enable sorting
   2. **Custom Comparator (`ScoreHostKeyComparator`)**: Ensures HostDb entries 
arrive before CrawlDb entries
   3. **MultipleInputs**: Reads both HostDb and CrawlDb in a single MapReduce 
job
   4. **Streaming Reducer**: Processes HostDb entries as they arrive, no 
preloading required
   
   ## Key Components
   
   ### FloatTextPair
   
   ```java
   public static class FloatTextPair implements 
WritableComparable<FloatTextPair> {
       public FloatWritable first;  // score (negative for HostDb)
       public Text second;          // hostname (empty for CrawlDb)
   }
   ```
   
   ### ScoreHostKeyComparator
   
   Sorting order:
   1. HostDb entries first (non-empty hostname), sorted by hostname
   2. CrawlDb entries second (empty hostname), sorted by score descending
   
   ### HostDbReaderMapper
   
   Reads HostDb and emits with special key to ensure sorting before CrawlDb 
entries:
   
   ```java
   context.write(new FloatTextPair(-Float.MAX_VALUE, hostname), entry);
   ```
   
   ## Configuration
   
   | Property | Description |
   |----------|-------------|
   | `generate.hostdb` | Path to HostDb (enables feature) |
   | `generate.max.count.expr` | JEXL expression for per-host URL limit |
   | `generate.fetch.delay.expr` | JEXL expression for per-host fetch delay |
   
   ### Example JEXL Expressions
   
   ```xml
   <!

> Use secondary sorting for memory-efficient HostDb integration in Generator
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-2455
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2455
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>    Affects Versions: 1.13
>            Reporter: Markus Jelsma
>            Priority: Major
>             Fix For: 1.22
>
>         Attachments: NUTCH-2455.patch
>
>
> Citing Sebastian at NUTCH-2420:
> ??The correct solution would be to use <host,score> pairs as keys in the 
> Selector job, with a partitioner and secondary sorting so that all keys with 
> same host end up in the same call of the reducer. If values can also hold a 
> HostDb entry and the sort comparator guarantees that the HostDb entry 
> (entries if partitioned by domain or IP) comes in front of all CrawlDb 
> entries. But that would be a substantial improvement...??



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to