[
https://issues.apache.org/jira/browse/NUTCH-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088323#comment-18088323
]
Sebastian Nagel commented on NUTCH-3170:
----------------------------------------
[~tanmay08], if you want to provide a PR for this feature, please, move forward.
> Improve in-memory outlink deduplication: Replace HashSet/hashCode with
> BloomFilter
> ----------------------------------------------------------------------------------
>
> Key: NUTCH-3170
> URL: https://issues.apache.org/jira/browse/NUTCH-3170
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Environment: strong text
> Reporter: tanmay
> Priority: Major
> Labels: feather, optimization, performance
>
> Currently, when the property fetcher.follow.outlinks.depth > 0 is used, the
> Fetcher uses an in-memory Set<Integer> named alreadyFetched to track which
> outlinks have already been followed during the current execution cycle. It
> stores url.toString().hashCode() (a 32-bit Java hash) inside a HashSet.
> As mentioned in the comments in FetchItemQueue.java (lines 56-58), the core
> perspective behind using hashCode() instead of the raw String was strictly to
> save memory and avoid OutOfMemory exceptions at scale:
> // keep track of duplicates if fetcher.follow.outlinks.depth > 0. Some urls
> may
> // not get followed due to hash collisions. Hashing is used to reduce memory
> usage.
> While we know that Nutch's map-reduce CrawlDB (updatedb) handles these
> collision omissions reliably in the next round, the 32-bit HashCode
> implementation presents an unnecessary bottleneck during the immediate fetch
> cycle due to inevitable hash collisions, which cause valid URLs to be skipped
> arbitrarily.
> *Proposed Solution:* We can vastly improve both the memory efficiency and the
> collision rate by replacing the HashSet<Integer> with a Bloom Filter.
> Why Bloom Filter is the perfect fit for this project:
> Unmatched Space Efficiency: A HashSet<Integer> incurs standard Java object
> overheads (~32 bytes per entry). A Bloom Filter operates strictly on bits.
> Tracking 1,000,000 URLs translates to barely ~1-2 MBs of RAM usage. It
> strictly aligns with the original developer's goal of "reducing memory usage".
> Controlled Collision Rates: Instead of blindly accepting the high collision
> rate of a single 32-bit hash, a Bloom Filter allows us to explicitly define a
> False Positive Probability (e.g., 0.001% or 0.01%). This makes the immediate
> fetch optimization highly deterministic.
> Speed: K-hashes in a Bloom Filter are computed efficiently with O(k)
> complexity.
> I am aware that a Bloom Filter comes with a specific trade-off: elements
> cannot be deleted. However, since the FetcherQueue short-term cache only
> requires contains() and add() operations (without any removals during a
> fetcher cycle), a Bloom Filter is the ideal data structure for this use case.
> I would like to contribute a patch for this.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)