[ 
https://issues.apache.org/jira/browse/NUTCH-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088323#comment-18088323
 ] 

Sebastian Nagel commented on NUTCH-3170:
----------------------------------------

[~tanmay08], if you want to provide a PR for this feature, please, move forward.

> Improve in-memory outlink deduplication: Replace HashSet/hashCode with 
> BloomFilter
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-3170
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3170
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>         Environment: strong text
>            Reporter: tanmay
>            Priority: Major
>              Labels: feather, optimization, performance
>
> Currently, when the property fetcher.follow.outlinks.depth > 0 is used, the 
> Fetcher uses an in-memory Set<Integer> named alreadyFetched to track which 
> outlinks have already been followed during the current execution cycle. It 
> stores url.toString().hashCode() (a 32-bit Java hash) inside a HashSet.
> As mentioned in the comments in FetchItemQueue.java (lines 56-58), the core 
> perspective behind using hashCode() instead of the raw String was strictly to 
> save memory and avoid OutOfMemory exceptions at scale:
> // keep track of duplicates if fetcher.follow.outlinks.depth > 0. Some urls 
> may 
> // not get followed due to hash collisions. Hashing is used to reduce memory 
> usage.
> While we know that Nutch's map-reduce CrawlDB (updatedb) handles these 
> collision omissions reliably in the next round, the 32-bit HashCode 
> implementation presents an unnecessary bottleneck during the immediate fetch 
> cycle due to inevitable hash collisions, which cause valid URLs to be skipped 
> arbitrarily.
> *Proposed Solution:* We can vastly improve both the memory efficiency and the 
> collision rate by replacing the HashSet<Integer> with a Bloom Filter.
> Why Bloom Filter is the perfect fit for this project:
> Unmatched Space Efficiency: A HashSet<Integer> incurs standard Java object 
> overheads (~32 bytes per entry). A Bloom Filter operates strictly on bits. 
> Tracking 1,000,000 URLs translates to barely ~1-2 MBs of RAM usage. It 
> strictly aligns with the original developer's goal of "reducing memory usage".
> Controlled Collision Rates: Instead of blindly accepting the high collision 
> rate of a single 32-bit hash, a Bloom Filter allows us to explicitly define a 
> False Positive Probability (e.g., 0.001% or 0.01%). This makes the immediate 
> fetch optimization highly deterministic.
> Speed: K-hashes in a Bloom Filter are computed efficiently with O(k) 
> complexity.
> I am aware that a Bloom Filter comes with a specific trade-off: elements 
> cannot be deleted. However, since the FetcherQueue short-term cache only 
> requires contains() and add() operations (without any removals during a 
> fetcher cycle), a Bloom Filter is the ideal data structure for this use case.
> I would like to contribute a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to