Oh, gotcha. That makes sense. Thanks for clarifying. On Wed, Jan 3, 2018 at 12:15 PM, Casey Stella <ceste...@gmail.com> wrote:
> It's actually many more than 1M. There are 1M domains, each domain could > have upwards of 300 - 1000 possible typosquatted domains. > > You will notice from > https://github.com/cestella/incubator-metron/tree/ > typosquat_merge/use-cases/typosquat_detection#generate-the-bloom-filter > that we are not adding the domain to the bloom filter, we're adding each > domain generated from DOMAIN_TYPOSQUAT to the bloom filter. In fact, we > would very specifically NOT want the base domain as that would not be an > indication of typosquatting (going to google.com would be legit, going to > goggle.com would not). > > > > On Wed, Jan 3, 2018 at 10:48 AM, Nick Allen <n...@nickallen.org> wrote: > > > > Even with 5 threads, it takes an hour for the full Alexa 1m, so I > think > > this will impact performance > > > > What exactly takes an hour? Adding 1M entries to a bloom filter? That > > seems really high, unless I am not understanding something. > > > > > > > > > > > > > > On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella <ceste...@gmail.com> > wrote: > > > > > Thanks for the feedback, Nick. > > > > > > Regarding "IMHO, I'd rather not reinvent the wheel for text > > manipulation." > > > > > > I would argue that we are not reinventing the wheel for text > manipulation > > > as the extractor config exists already and we are doing a similar thing > > in > > > the flatfile loader (in fact, the code is reused and merely extended). > > > Transformation operations are already supported in our codebase in the > > > extractor config, this PR has just added some hooks for stateful > > > operations. > > > > > > Furthermore, we will need a configuration object to pass to the REST > call > > > if we are ever to create a UI around importing data into hbase or > > creating > > > these summary objects. > > > > > > Regarding your example: > > > $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i > > > 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)' > > > > > > I'm very sympathetic to this type of extension, but it has some issues: > > > > > > 1. This implies a single-threaded addition to the bloom filter. > > > 1. Even with 5 threads, it takes an hour for the full alexa 1m, > so > > I > > > think this will impact performance > > > 2. There's not a way to specify how to merge across threads if we > > do > > > make a multithread command line option > > > 2. This restricts these kinds of operations to roles with heavy unix > > CLI > > > knowledge, which isn't often the types of people who would be doing > > this > > > type of operation > > > 3. What if we need two variables passed to stellar? > > > 4. This approach will be harder to move to Hadoop. Eventually we > will > > > want to support data on HDFS being processed by Hadoop (similar to > > > flatfile > > > loader), so instead of -m LOCAL being passed for the flatfile > > summarizer > > > you'd pass -m SPARK and the processing would happen on the cluster > > > 1. This is particularly relevant in this case as it's a > > > embarrassingly parallel problem in general > > > > > > In summary, while this a CLI approach is attractive, I prefer the > > extractor > > > config solution because it is the solution with the smallest iteration > > > that: > > > > > > 1. Reuses existing metron extraction infrastructure > > > 2. Provides the most solid base for the extensions that will be > sorely > > > needed soon (and will keep it in parity with the flatfile loader) > > > 3. Provides the most solid base for a future UI extension in the > > > management UI to support both summarization and loading > > > > > > > > > > > > > > > On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen <n...@nickallen.org> > wrote: > > > > > > > First off, I really do like the typosquatting use case and a lot of > > what > > > > you have described. > > > > > > > > > We need a way to generate the summary sketches from flat data for > > this > > > to > > > > > work. > > > > > .. > > > > > > > > > > > > > I took this quote directly from your use case. Above is the point > that > > > I'd > > > > like to discuss and what your proposed solutions center on. This is > > > what I > > > > think you are trying to do, at least with PR #879 > > > > <https://github.com/apache/metron/pull/879>... > > > > > > > > (Q) Can we repurpose Stellar functions so that they can operate on > text > > > > stored in a file system? > > > > > > > > > > > > Whether we use the (1) Configuration or the (2) Function-based > approach > > > > that you described, fundamentally we are introducing new ways to > > perform > > > > text manipulation inside of Stellar. > > > > > > > > IMHO, I'd rather not reinvent the wheel for text manipulation. It > > would > > > be > > > > painful to implement and maintain a bunch of Stellar functions for > text > > > > manipulation. People already have a large number of tools available > to > > > do > > > > this and everyone has their favorites. People are resistant to > > learning > > > > something new when they already are familiar with another way to do > the > > > > same thing. > > > > > > > > So then the question is, how else can we do this? My suggestion is > > that > > > > rather than introducing text manipulation tools inside of Stellar, we > > > allow > > > > people to use the text manipulation tools they already know, but with > > the > > > > Stellar functions that we already have. And the obvious way to tie > > those > > > > two things together is the Unix pipeline. > > > > > > > > A quick, albeit horribly incomplete, example to flesh this out a bit > > more > > > > based on the example you have in PR #879 > > > > <https://github.com/apache/metron/pull/879>. This would allow me to > > > > integrate Stellar with whatever external tools that I want. > > > > > > > > $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i > > > > 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)' > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Dec 24, 2017 at 8:28 PM, Casey Stella <ceste...@gmail.com> > > > wrote: > > > > > > > > > I'll start this discussion off with my idea around a 2nd step that > is > > > > more > > > > > adaptable. I propose the following set of stellar functions backed > > by > > > > > Spark in the metron-management project: > > > > > > > > > > - CSV_PARSE(location, separator?, columns?) : Constructs a Spark > > > > > Dataframe for reading the flatfile > > > > > - SQL_TRANSFORM(dataframe, spark sql statement): Transforms the > > > > > dataframe > > > > > - SUMMARIZE(state_init, state_update, state_merge): Summarize > the > > > > > dataframe using the lambda functions: > > > > > - state_init - executed once per worker to initialize the > state > > > > > - state_update - executed once per row > > > > > - state_merge - Merge the worker states into one worker state > > > > > - OBJECT_SAVE(obj, output_path) : Save the object obj to the > path > > > > > output_path on HDFS. > > > > > > > > > > This would enable more flexibility and composibility than the > > > > > configuration-based approach that we have in the flatfile loader. > > > > > My concern with this approach, and the reason I didn't do it > > initially, > > > > was > > > > > that I think that users will want at least 2 ways to summarize data > > (or > > > > > load data): > > > > > > > > > > - A configuration based approach, which enables a UI > > > > > - A set of stellar functions via the scriptable REPL > > > > > > > > > > I would argue that both have a place and I started with the > > > configuration > > > > > based approach as it was a more natural extension of what we > already > > > had. > > > > > I'd love to hear thoughts about this idea too. > > > > > > > > > > > > > > > On Sun, Dec 24, 2017 at 8:20 PM, Casey Stella <ceste...@gmail.com> > > > > wrote: > > > > > > > > > > > Hi all, > > > > > > > > > > > > I wanted to get some feedback on a sensible plan for something. > It > > > > > > occurred to me the other day when considering the use-case of > > > detecting > > > > > > typosquatted domains, that one approach was to generate the set > of > > > > > > typosquatted domains for some set of reference domains and > compare > > > > > domains > > > > > > as they flow through. > > > > > > > > > > > > One way we could do this would be to generate this data and > import > > > the > > > > > > typosquatted domains into HBase. I thought, however, that > another > > > > > approach > > > > > > which may trade-off accuracy to remove the network hop and > > potential > > > > disk > > > > > > seek by constructing a bloom filter that includes the set of > > > > typosquatted > > > > > > domains. > > > > > > > > > > > > The challenge was that we don't have a way to do this currently. > > We > > > > do, > > > > > > however, have a loading infrastructure (e.g. the flatfile_loader) > > and > > > > > > configuration (see https://github.com/apache/ > > > > metron/tree/master/metron- > > > > > > platform/metron-data-management#common-extractor-properties) > > which > > > > > > handles: > > > > > > > > > > > > - parsing flat files > > > > > > - transforming the rows > > > > > > - filtering the rows > > > > > > > > > > > > To enable the new use-case of generating a summary object (e.g. a > > > bloom > > > > > > filter), in METRON-1378 (https://github.com/apache/ > metron/pull/879 > > ) > > > I > > > > > > propose that we create a new utility that uses the same extractor > > > > config > > > > > > add the ability to: > > > > > > > > > > > > - initialize a state object > > > > > > - update the object for every row > > > > > > - merge the state objects (in the case of multiple threads, in > > the > > > > > > case of one thread it's not needed). > > > > > > > > > > > > I think this is a sensible decision because: > > > > > > > > > > > > - It's a minimal movement from the flat file loader > > > > > > - Uses the same configs > > > > > > - Abstracts and reuses the existing infrastructure > > > > > > - Having one extractor config means that it should be easier > to > > > > > > generate a UI around this to simplify the experience > > > > > > > > > > > > All that being said, our extractor config is..shall we > > say...daunting > > > > :). > > > > > > I am sensitive to the fact that this adds to an existing > difficult > > > > > config. > > > > > > I propose that this is an initial step forward to support the > > > use-case > > > > > and > > > > > > we can enable something more composable going forward. My > concern > > in > > > > > > considering this as the first step was that it felt that the > > > composable > > > > > > units for data transformation and manipulation suddenly takes us > > > into a > > > > > > place where Stellar starts to look like Pig or Spark RDD API. I > > > wasn't > > > > > > ready for that without a lot more discussion. > > > > > > > > > > > > To summarize, what I'd like to get from the community is, after > > > > reviewing > > > > > > the entire use-case at https://github.com/cestella/ > > > > > incubator-metron/tree/ > > > > > > typosquat_merge/use-cases/typosquat_detection: > > > > > > > > > > > > - Is this so confusing that it does not belong in Metron even > > as a > > > > > > first-step? > > > > > > - Is there a way to extend the extractor config in a less > > > confusing > > > > > > way to enable this? > > > > > > > > > > > > I apologize for making the discuss thread *after* the JIRAs, but > I > > > felt > > > > > > this one might bear having some working code to consider. > > > > > > > > > > > > > > > > > > > > >