Simon Elliston Ball created METRON-1534:
-------------------------------------------
Summary: Typosquat Detection via Bloom filters overlaps
Key: METRON-1534
URL: https://issues.apache.org/jira/browse/METRON-1534
Project: Metron
Issue Type: Bug
Affects Versions: 0.4.3
Reporter: Simon Elliston Ball
The typosquat detection use case overpopulates the bloom filter.
For example, using the alexa 10k set, cnn.com, or bbc.co.uk are both detected
as typosquats. While legitimate in themselves, they appear in dns twists of
other legitimate domains. (e.g. bbc is a typosquat for rbc).
This problem is further accentuated by a longer set of legitimate domains such
as the alexa 1m.
The bloom filter additions need to be be prevented for values which are
included in the raw 'good' source. This is hard to do in a space and compute
performant way with the current implementation, given the need to effectively
join the full input set (smallish) with the generated set (very large).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)