I'll start this discussion off with my idea around a 2nd step that is more adaptable. I propose the following set of stellar functions backed by Spark in the metron-management project:
- CSV_PARSE(location, separator?, columns?) : Constructs a Spark Dataframe for reading the flatfile - SQL_TRANSFORM(dataframe, spark sql statement): Transforms the dataframe - SUMMARIZE(state_init, state_update, state_merge): Summarize the dataframe using the lambda functions: - state_init - executed once per worker to initialize the state - state_update - executed once per row - state_merge - Merge the worker states into one worker state - OBJECT_SAVE(obj, output_path) : Save the object obj to the path output_path on HDFS. This would enable more flexibility and composibility than the configuration-based approach that we have in the flatfile loader. My concern with this approach, and the reason I didn't do it initially, was that I think that users will want at least 2 ways to summarize data (or load data): - A configuration based approach, which enables a UI - A set of stellar functions via the scriptable REPL I would argue that both have a place and I started with the configuration based approach as it was a more natural extension of what we already had. I'd love to hear thoughts about this idea too. On Sun, Dec 24, 2017 at 8:20 PM, Casey Stella <ceste...@gmail.com> wrote: > Hi all, > > I wanted to get some feedback on a sensible plan for something. It > occurred to me the other day when considering the use-case of detecting > typosquatted domains, that one approach was to generate the set of > typosquatted domains for some set of reference domains and compare domains > as they flow through. > > One way we could do this would be to generate this data and import the > typosquatted domains into HBase. I thought, however, that another approach > which may trade-off accuracy to remove the network hop and potential disk > seek by constructing a bloom filter that includes the set of typosquatted > domains. > > The challenge was that we don't have a way to do this currently. We do, > however, have a loading infrastructure (e.g. the flatfile_loader) and > configuration (see https://github.com/apache/metron/tree/master/metron- > platform/metron-data-management#common-extractor-properties) which > handles: > > - parsing flat files > - transforming the rows > - filtering the rows > > To enable the new use-case of generating a summary object (e.g. a bloom > filter), in METRON-1378 (https://github.com/apache/metron/pull/879) I > propose that we create a new utility that uses the same extractor config > add the ability to: > > - initialize a state object > - update the object for every row > - merge the state objects (in the case of multiple threads, in the > case of one thread it's not needed). > > I think this is a sensible decision because: > > - It's a minimal movement from the flat file loader > - Uses the same configs > - Abstracts and reuses the existing infrastructure > - Having one extractor config means that it should be easier to > generate a UI around this to simplify the experience > > All that being said, our extractor config is..shall we say...daunting :). > I am sensitive to the fact that this adds to an existing difficult config. > I propose that this is an initial step forward to support the use-case and > we can enable something more composable going forward. My concern in > considering this as the first step was that it felt that the composable > units for data transformation and manipulation suddenly takes us into a > place where Stellar starts to look like Pig or Spark RDD API. I wasn't > ready for that without a lot more discussion. > > To summarize, what I'd like to get from the community is, after reviewing > the entire use-case at https://github.com/cestella/incubator-metron/tree/ > typosquat_merge/use-cases/typosquat_detection: > > - Is this so confusing that it does not belong in Metron even as a > first-step? > - Is there a way to extend the extractor config in a less confusing > way to enable this? > > I apologize for making the discuss thread *after* the JIRAs, but I felt > this one might bear having some working code to consider. >