[GitHub] metron pull request #882: METRON-1380: Create a typosquatting use-case (comm...

2018-01-16 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/metron/pull/882


---


[GitHub] metron pull request #882: METRON-1380: Create a typosquatting use-case (comm...

2018-01-08 Thread cestella
Github user cestella commented on a diff in the pull request:

https://github.com/apache/metron/pull/882#discussion_r160269501
  
--- Diff: use-cases/typosquat_detection/README.md ---
@@ -0,0 +1,448 @@
+
+# Problem Statement
+
+[Typosquatting](https://en.wikipedia.org/wiki/Typosquatting) is a form of 
cybersquatting which relies on
+likely typos to trick unsuspecting users to visit possibly malicious URLs. 
 In the best case, this is a
+mischievous joke as in the following RickRoll: 
[http://www.latlmes.com/breaking/apache-metron-named-best-software-by-asf-1](http://www.latlmes.com/breaking/apache-metron-named-best-software-by-asf-1).
+In the worst case, however, it can be overtly malicious as Bitcoin users 
found out in 
[2016](https://nakedsecurity.sophos.com/2014/03/24/bitcoin-user-loses-10k-to-typosquatters/)
 
+when thousands of dollars of Bitcoin was stolen as part of a phishing 
attack which used typosquatting.
+
+It is therefore of use for us to detect so called typosquatting attacks as 
they appear over the network.  We
+have had for some time, through the flatfile loader and open source 
typosquatting generation tools such 
+as [DNS Twist](https://github.com/elceef/dnstwist), the ability to 
generated potential typosquatted domains,
+import them into HBase and look them up via `ENRICHMENT_EXISTS`.
+
+There are some challenges with this approach, though entirely viable:
+* Even for modest numbers of domains, the number of records can grow quite 
large.  The Top Alexa 10k domains has on the order of 3 million potential 
typosquatted domains.
+* It still requires a network hop if out of cache.
+
+# The Tools Metron Provides
+
+## Bloom Filters
+
+It would be nice to have a local solution for these types of problems that 
may tradeoff accuracy for better 
+locality and space.  Those who have been following the general theme of 
Metron's analytics philosophy will see
+that we are likely in the domain where a probabalistic sketching data 
structure is in order.  In this case, we
+are asking simple existence queries, so a [Bloom 
Filter](https://en.wikipedia.org/wiki/Bloom_filter) fits 
+well here.
+
+In Metron, we have the ability to create, add and merge bloom filters via:
+* `BLOOM_INIT( size, fpp)` - Creates a bloom filter to handle `size` 
number of elements with `fpp` probability of false positives (`0 < fpp < 1`).
+* `BLOOM_ADD( filter, object)` - Add an item to an existing bloom filter.
+* `BLOOM_MERGE( filters )` - Merge a `filters`, a list of Bloom Filters.
+
+## Typosquatting Domain Generation
+
+Now that we have a suitable data structure, we need a way to generate 
potential typosquatted domains for a
+given domain.  Following the good work of [DNS 
Twist](https://github.com/elceef/dnstwist), we have ported
+their set of typosquatting strategies to Metron:
+* Bitsquatting - See [here](http://dinaburg.org/bitsquatting.html)
+* Homoglyphs - Substituting characters for ascii or unicode analogues 
which are visually similar (e.g. `latlmes.com` for `latimes.com` as above)
+* Subdomain - Making part of the domain a subdomain (e.g. `am.azon.com`)
+* Hyphenation 
+* Insertion 
+* Addition 
+* Omission 
+* Repetition 
+* Replacement
+* Transposition
+* Vowel swapping
+
+The Stellar function in Metron is `DOMAIN_TYPOSQUAT( domain )`.  It is 
recommended to remove the TLD from the 
+domain.  You can see it in action here with our rick roll example above:
+```
+[Stellar]>>> 'latlmes' in DOMAIN_TYPOSQUAT( 'latimes')
+true
+```
+
+## Generating Summaries
+
+We need a way to generate the summary sketches from flat data for this to 
work.  This is similar to, but 
+somewhat different from, loading flat data into HBase.  Instead of each 
row in the file being loaded
+generating a record in HBase, what we want is for each record to 
contribute to the summary sketch and at the
+end to write out the summary sketch.
+
+For this purpose, we have a new utility 
`$METRON_HOME/bin/flatfile_summarizer.sh` to accompany 
+`$METRON_HOME/bin/flatfile_loader.sh`.  The same extractor config is used, 
but we have 3 new configuration
+options:
+* `state_init` - Allows a state object to be initialized.  This is a 
string, so a single expression is created.  The output of this expression will 
be available as the `state` variable.  
+* `state_update` - Allows a state object to be updated.  This is a map, so 
you can have temporary variables here.  Note that you can reference the `state` 
variable from this. 
+* `state_merge` - Allows a list of states to be merged. This is a string, 
so a single expression.  There is a special field called `states` available, 
which is a list of the states (one per thread).  If this is not in existence, 

[GitHub] metron pull request #882: METRON-1380: Create a typosquatting use-case (comm...

2018-01-08 Thread justinleet
Github user justinleet commented on a diff in the pull request:

https://github.com/apache/metron/pull/882#discussion_r160245549
  
--- Diff: use-cases/typosquat_detection/README.md ---
@@ -0,0 +1,448 @@
+
+# Problem Statement
+
+[Typosquatting](https://en.wikipedia.org/wiki/Typosquatting) is a form of 
cybersquatting which relies on
+likely typos to trick unsuspecting users to visit possibly malicious URLs. 
 In the best case, this is a
+mischievous joke as in the following RickRoll: 
[http://www.latlmes.com/breaking/apache-metron-named-best-software-by-asf-1](http://www.latlmes.com/breaking/apache-metron-named-best-software-by-asf-1).
+In the worst case, however, it can be overtly malicious as Bitcoin users 
found out in 
[2016](https://nakedsecurity.sophos.com/2014/03/24/bitcoin-user-loses-10k-to-typosquatters/)
 
+when thousands of dollars of Bitcoin was stolen as part of a phishing 
attack which used typosquatting.
+
+It is therefore of use for us to detect so called typosquatting attacks as 
they appear over the network.  We
+have had for some time, through the flatfile loader and open source 
typosquatting generation tools such 
+as [DNS Twist](https://github.com/elceef/dnstwist), the ability to 
generated potential typosquatted domains,
+import them into HBase and look them up via `ENRICHMENT_EXISTS`.
+
+There are some challenges with this approach, though entirely viable:
+* Even for modest numbers of domains, the number of records can grow quite 
large.  The Top Alexa 10k domains has on the order of 3 million potential 
typosquatted domains.
+* It still requires a network hop if out of cache.
+
+# The Tools Metron Provides
+
+## Bloom Filters
+
+It would be nice to have a local solution for these types of problems that 
may tradeoff accuracy for better 
+locality and space.  Those who have been following the general theme of 
Metron's analytics philosophy will see
+that we are likely in the domain where a probabalistic sketching data 
structure is in order.  In this case, we
+are asking simple existence queries, so a [Bloom 
Filter](https://en.wikipedia.org/wiki/Bloom_filter) fits 
+well here.
+
+In Metron, we have the ability to create, add and merge bloom filters via:
+* `BLOOM_INIT( size, fpp)` - Creates a bloom filter to handle `size` 
number of elements with `fpp` probability of false positives (`0 < fpp < 1`).
+* `BLOOM_ADD( filter, object)` - Add an item to an existing bloom filter.
+* `BLOOM_MERGE( filters )` - Merge a `filters`, a list of Bloom Filters.
+
+## Typosquatting Domain Generation
+
+Now that we have a suitable data structure, we need a way to generate 
potential typosquatted domains for a
+given domain.  Following the good work of [DNS 
Twist](https://github.com/elceef/dnstwist), we have ported
+their set of typosquatting strategies to Metron:
+* Bitsquatting - See [here](http://dinaburg.org/bitsquatting.html)
+* Homoglyphs - Substituting characters for ascii or unicode analogues 
which are visually similar (e.g. `latlmes.com` for `latimes.com` as above)
+* Subdomain - Making part of the domain a subdomain (e.g. `am.azon.com`)
+* Hyphenation 
+* Insertion 
+* Addition 
+* Omission 
+* Repetition 
+* Replacement
+* Transposition
+* Vowel swapping
+
+The Stellar function in Metron is `DOMAIN_TYPOSQUAT( domain )`.  It is 
recommended to remove the TLD from the 
+domain.  You can see it in action here with our rick roll example above:
+```
+[Stellar]>>> 'latlmes' in DOMAIN_TYPOSQUAT( 'latimes')
+true
+```
+
+## Generating Summaries
+
+We need a way to generate the summary sketches from flat data for this to 
work.  This is similar to, but 
+somewhat different from, loading flat data into HBase.  Instead of each 
row in the file being loaded
+generating a record in HBase, what we want is for each record to 
contribute to the summary sketch and at the
+end to write out the summary sketch.
+
+For this purpose, we have a new utility 
`$METRON_HOME/bin/flatfile_summarizer.sh` to accompany 
+`$METRON_HOME/bin/flatfile_loader.sh`.  The same extractor config is used, 
but we have 3 new configuration
+options:
+* `state_init` - Allows a state object to be initialized.  This is a 
string, so a single expression is created.  The output of this expression will 
be available as the `state` variable.  
+* `state_update` - Allows a state object to be updated.  This is a map, so 
you can have temporary variables here.  Note that you can reference the `state` 
variable from this. 
+* `state_merge` - Allows a list of states to be merged. This is a string, 
so a single expression.  There is a special field called `states` available, 
which is a list of the states (one per thread).  If this is not in existence, 

[GitHub] metron pull request #882: METRON-1380: Create a typosquatting use-case (comm...

2018-01-08 Thread justinleet
Github user justinleet commented on a diff in the pull request:

https://github.com/apache/metron/pull/882#discussion_r160241987
  
--- Diff: use-cases/typosquat_detection/README.md ---
@@ -0,0 +1,448 @@
+
+# Problem Statement
+
+[Typosquatting](https://en.wikipedia.org/wiki/Typosquatting) is a form of 
cybersquatting which relies on
+likely typos to trick unsuspecting users to visit possibly malicious URLs. 
 In the best case, this is a
+mischievous joke as in the following RickRoll: 
[http://www.latlmes.com/breaking/apache-metron-named-best-software-by-asf-1](http://www.latlmes.com/breaking/apache-metron-named-best-software-by-asf-1).
+In the worst case, however, it can be overtly malicious as Bitcoin users 
found out in 
[2016](https://nakedsecurity.sophos.com/2014/03/24/bitcoin-user-loses-10k-to-typosquatters/)
 
+when thousands of dollars of Bitcoin was stolen as part of a phishing 
attack which used typosquatting.
+
+It is therefore of use for us to detect so called typosquatting attacks as 
they appear over the network.  We
+have had for some time, through the flatfile loader and open source 
typosquatting generation tools such 
+as [DNS Twist](https://github.com/elceef/dnstwist), the ability to 
generated potential typosquatted domains,
+import them into HBase and look them up via `ENRICHMENT_EXISTS`.
+
+There are some challenges with this approach, though entirely viable:
+* Even for modest numbers of domains, the number of records can grow quite 
large.  The Top Alexa 10k domains has on the order of 3 million potential 
typosquatted domains.
+* It still requires a network hop if out of cache.
+
+# The Tools Metron Provides
+
+## Bloom Filters
+
+It would be nice to have a local solution for these types of problems that 
may tradeoff accuracy for better 
+locality and space.  Those who have been following the general theme of 
Metron's analytics philosophy will see
+that we are likely in the domain where a probabalistic sketching data 
structure is in order.  In this case, we
+are asking simple existence queries, so a [Bloom 
Filter](https://en.wikipedia.org/wiki/Bloom_filter) fits 
+well here.
+
+In Metron, we have the ability to create, add and merge bloom filters via:
+* `BLOOM_INIT( size, fpp)` - Creates a bloom filter to handle `size` 
number of elements with `fpp` probability of false positives (`0 < fpp < 1`).
+* `BLOOM_ADD( filter, object)` - Add an item to an existing bloom filter.
+* `BLOOM_MERGE( filters )` - Merge a `filters`, a list of Bloom Filters.
+
+## Typosquatting Domain Generation
+
+Now that we have a suitable data structure, we need a way to generate 
potential typosquatted domains for a
+given domain.  Following the good work of [DNS 
Twist](https://github.com/elceef/dnstwist), we have ported
+their set of typosquatting strategies to Metron:
+* Bitsquatting - See [here](http://dinaburg.org/bitsquatting.html)
+* Homoglyphs - Substituting characters for ascii or unicode analogues 
which are visually similar (e.g. `latlmes.com` for `latimes.com` as above)
+* Subdomain - Making part of the domain a subdomain (e.g. `am.azon.com`)
+* Hyphenation 
+* Insertion 
+* Addition 
+* Omission 
+* Repetition 
+* Replacement
+* Transposition
+* Vowel swapping
+
+The Stellar function in Metron is `DOMAIN_TYPOSQUAT( domain )`.  It is 
recommended to remove the TLD from the 
+domain.  You can see it in action here with our rick roll example above:
+```
+[Stellar]>>> 'latlmes' in DOMAIN_TYPOSQUAT( 'latimes')
+true
+```
+
+## Generating Summaries
+
+We need a way to generate the summary sketches from flat data for this to 
work.  This is similar to, but 
+somewhat different from, loading flat data into HBase.  Instead of each 
row in the file being loaded
+generating a record in HBase, what we want is for each record to 
contribute to the summary sketch and at the
+end to write out the summary sketch.
+
+For this purpose, we have a new utility 
`$METRON_HOME/bin/flatfile_summarizer.sh` to accompany 
+`$METRON_HOME/bin/flatfile_loader.sh`.  The same extractor config is used, 
but we have 3 new configuration
+options:
+* `state_init` - Allows a state object to be initialized.  This is a 
string, so a single expression is created.  The output of this expression will 
be available as the `state` variable.  
+* `state_update` - Allows a state object to be updated.  This is a map, so 
you can have temporary variables here.  Note that you can reference the `state` 
variable from this. 
+* `state_merge` - Allows a list of states to be merged. This is a string, 
so a single expression.  There is a special field called `states` available, 
which is a list of the states (one per thread).  If this is not in existence, 

[GitHub] metron pull request #882: METRON-1380: Create a typosquatting use-case (comm...

2017-12-30 Thread justinleet
Github user justinleet commented on a diff in the pull request:

https://github.com/apache/metron/pull/882#discussion_r159122512
  
--- Diff: use-cases/typosquat_detection/README.md ---
@@ -0,0 +1,431 @@
+# Problem Statement
--- End diff --

Can you please add the license header to this? 
https://github.com/apache/metron/pull/884 is close to going in and enforcing 
this, so I'm hoping to avoid impact to master.

```

```


---


[GitHub] metron pull request #882: METRON-1380: Create a typosquatting use-case (comm...

2017-12-22 Thread cestella
GitHub user cestella reopened a pull request:

https://github.com/apache/metron/pull/882

METRON-1380: Create a typosquatting use-case (commit after METRON-1379, 
METRON-1377, METRON-1378)

## Contributor Comments
This is a documented use-case on how to use the following JIRAs (PRs) to 
detect typosquatting in-stream using bloom filters:
* METRON-1379 (#880)
* METRON-1377 (#878 )
* METRON-1378 (#879 )

The code here is a merger of the PRs above to allow reviewers to test the 
entire feature together.  The manual testing plan is to execute the 
typosquatting use-case 
[instructions](https://github.com/cestella/incubator-metron/tree/typosquat_merge/use-cases/typosquat_detection).

## Pull Request Checklist

Thank you for submitting a contribution to Apache Metron.  
Please refer to our [Development 
Guidelines](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61332235)
 for the complete guide to follow for contributions.  
Please refer also to our [Build Verification 
Guidelines](https://cwiki.apache.org/confluence/display/METRON/Verifying+Builds?show-miniview)
 for complete smoke testing guides.  


In order to streamline the review of the contribution we ask you follow 
these guidelines and ask you to double check the following:

### For all changes:
- [x] Is there a JIRA ticket associated with this PR? If not one needs to 
be created at [Metron 
Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
 
- [x] Does your PR title start with METRON- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
- [x] Has your PR been rebased against the latest commit within the target 
branch (typically master)?


### For code changes:
- [x] Have you included steps to reproduce the behavior or problem that is 
being changed or addressed?
- [x] Have you included steps or a guide to how the change may be verified 
and tested manually?
- [x] Have you ensured that the full suite of tests and checks have been 
executed in the root metron folder via:
  ```
  mvn -q clean integration-test install && build_utils/verify_licenses.sh 
  ```

- [x] Have you written or updated unit tests and or integration tests to 
verify your changes?
- [x] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
- [x] Have you verified the basic functionality of the build by building 
and running locally with Vagrant full-dev environment or the equivalent?

### For documentation related changes:
- [x] Have you ensured that format looks appropriate for the output in 
which it is rendered by building and verifying the site-book? If not then run 
the following commands and the verify changes via 
`site-book/target/site/index.html`:

  ```
  cd site-book
  mvn site
  ```

 Note:
Please ensure that once the PR is submitted, you check travis-ci for build 
issues and submit an update to your PR as soon as possible.
It is also recommended that [travis-ci](https://travis-ci.org) is set up 
for your personal repository such that your branches are built there before 
submitting a pull request.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cestella/incubator-metron typosquat_merge

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/metron/pull/882.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #882


commit a95014ed1e145f9133dd95dcbfbf7e9212401fef
Author: cstella 
Date:   2017-12-19T22:26:03Z

METRON-1377: Stellar function to generate typosquatted domains (similar to 
dnstwist)

commit 9c492c4540534fa72550aff330ce6c588f640965
Author: cstella 
Date:   2017-12-21T15:17:18Z

flatfile summarizer initial commit.

commit 71e63b2604ad94c51423762582e547184169d8a2
Author: cstella 
Date:   2017-12-21T15:20:48Z

Don't want to generate original domain as it's not a typosquatted domain

commit 42af879d5fc1623fd9b24dd24af687292d9bcc73
Author: cstella 
Date:   2017-12-21T16:20:10Z

Fixed homoglyph bug with ACE domains

commit 7ee3ab14b81b0cb3fd899cf082050b7e3fade63e
Author: cstella 
Date:   2017-12-21T17:04:58Z

Persistent bug..

commit 15681143e86913a69270d0a89e1c877e3d99
Author: cstella 
Date:   2017-12-21T18:50:58Z

typo

commit 0d1e7b304b926bae65a2d6b4c63dec565542ad7e
Author: cstella 
Date:   2017-12-21T18:51:50Z

Weirdness with international domains.

commit 

[GitHub] metron pull request #882: METRON-1380: Create a typosquatting use-case (comm...

2017-12-22 Thread cestella
Github user cestella closed the pull request at:

https://github.com/apache/metron/pull/882


---