Simon Elliston Ball created METRON-1534:
-------------------------------------------

             Summary: Typosquat Detection via Bloom filters overlaps
                 Key: METRON-1534
                 URL: https://issues.apache.org/jira/browse/METRON-1534
             Project: Metron
          Issue Type: Bug
    Affects Versions: 0.4.3
            Reporter: Simon Elliston Ball


The typosquat detection use case overpopulates the bloom filter. 

For example, using the alexa 10k set, cnn.com, or bbc.co.uk  are both detected 
as typosquats. While legitimate in themselves, they appear in dns twists of 
other legitimate domains. (e.g. bbc is a typosquat for rbc).

This problem is further accentuated by a longer set of legitimate domains such 
as the alexa 1m.

The bloom filter additions need to be be prevented for values which are 
included in the raw 'good' source. This is hard to do in a space and compute 
performant way with the current implementation, given the need to effectively 
join the full input set (smallish) with the generated set (very large).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to