Oh, one more thing, while the example here is around typosquatting, this is of use outside of that. Pretty much any large existence-style query can be enabled via this construction (create a summary bloom filter). There are other use-cases involving other data structures too.
On Sun, Dec 24, 2017 at 8:20 PM, Casey Stella <ceste...@gmail.com> wrote: > Hi all, > > I wanted to get some feedback on a sensible plan for something. It > occurred to me the other day when considering the use-case of detecting > typosquatted domains, that one approach was to generate the set of > typosquatted domains for some set of reference domains and compare domains > as they flow through. > > One way we could do this would be to generate this data and import the > typosquatted domains into HBase. I thought, however, that another approach > which may trade-off accuracy to remove the network hop and potential disk > seek by constructing a bloom filter that includes the set of typosquatted > domains. > > The challenge was that we don't have a way to do this currently. We do, > however, have a loading infrastructure (e.g. the flatfile_loader) and > configuration (see https://github.com/apache/metron/tree/master/metron- > platform/metron-data-management#common-extractor-properties) which > handles: > > - parsing flat files > - transforming the rows > - filtering the rows > > To enable the new use-case of generating a summary object (e.g. a bloom > filter), in METRON-1378 (https://github.com/apache/metron/pull/879) I > propose that we create a new utility that uses the same extractor config > add the ability to: > > - initialize a state object > - update the object for every row > - merge the state objects (in the case of multiple threads, in the > case of one thread it's not needed). > > I think this is a sensible decision because: > > - It's a minimal movement from the flat file loader > - Uses the same configs > - Abstracts and reuses the existing infrastructure > - Having one extractor config means that it should be easier to > generate a UI around this to simplify the experience > > All that being said, our extractor config is..shall we say...daunting :). > I am sensitive to the fact that this adds to an existing difficult config. > I propose that this is an initial step forward to support the use-case and > we can enable something more composable going forward. My concern in > considering this as the first step was that it felt that the composable > units for data transformation and manipulation suddenly takes us into a > place where Stellar starts to look like Pig or Spark RDD API. I wasn't > ready for that without a lot more discussion. > > To summarize, what I'd like to get from the community is, after reviewing > the entire use-case at https://github.com/cestella/incubator-metron/tree/ > typosquat_merge/use-cases/typosquat_detection: > > - Is this so confusing that it does not belong in Metron even as a > first-step? > - Is there a way to extend the extractor config in a less confusing > way to enable this? > > I apologize for making the discuss thread *after* the JIRAs, but I felt > this one might bear having some working code to consider. >