I'm not sure I follow what you're saying as it pertains to summary objects. Repository is a loaded term, and I'm very apprehensive of pushing for something potentially very complex where a simpler solution would suffice in the short term. To wit, the items I'm seeing in this use case doc - https://github.com/cestella/incubator-metron/tree/typosquat_merge/use-cases/typosquat_detection - don't preclude the 4 capabilities you've enumerated. Am I missing something, or can you provide more context? My best guess is that rather than referring to a specific HDFS path for a serialized object, you're suggesting we provide a more abstract method for serializing/deserializing objects to/from a variety of sources. Am I in the ballpark? I'd be in favor of expanding functionality for such a thing provided a sensible default (ie HDFS) is provided in the short-term.
On Fri, Jan 5, 2018 at 8:26 AM, Otto Fowler <ottobackwa...@gmail.com> wrote: > If we separate the concerns as I have state previously : > > 1. Stellar can load objects into ‘caches’ from some repository and refer to > them. > 2. The repositories > 3. Some number of strategies to populate and possibly update the > repository, from spark, > to MR jobs to whatever you would classify the flat file stuff as. > 4. Let the Stellar API for everything but LOAD() follow after we get usage > > Then the particulars of ‘3’ are less important. > > > > On January 5, 2018 at 09:02:41, Justin Leet (justinjl...@gmail.com) wrote: > > I agree with the general sentiment that we can tailor specific use cases > via UI, and I'm worried that the use case specific solution (particularly > in light of the note that it's not even general to the class of bloom > filter problems, let alone an actually general problem) becomes more work > than this as soon as about 2 more uses cases actually get realized. > Pushing that to the UI lets people solve a variety of problems if they > really want to dig in, while still giving flexibility to provide a more > tailored experience for what we discover the 80% cases are in practice. > > Keeping in mind I am mostly unfamiliar with the extractor config itself, I > am wondering if it makes sense to split up the config a bit. While a lot > of implementation details are shared, maybe the extractor config itself > should be refactored into a couple parts analogous to ETL (as a follow on > task, I think if this is true, it predates Casey's proposed change). It > doesn't necessarily make it less complex, but it might make it more easily > digestible if it's split up by idea (parsing, transformation, etc.). > > Re: Mike's point, I don't think we want the actual processing broken up as > ETL, but the representation to the user in terms of configuration could be > similar (Since we're already doing parsing and transformation). We don't > have to implement it as an ETL pipeline, but it does potentially offer the > user a way to quickly grasp what the JSON blob is actually specifying. > Making it easy to understand, even if it's not the ideal way to interact is > potentially still a win. > > On Thu, Jan 4, 2018 at 1:28 PM, Michael Miklavcic < > michael.miklav...@gmail.com> wrote: > > > I mentioned this earlier, but I'll reiterate that I think this approach > > gives us the ability to make specific use cases via a UI, or other > > interface should we choose to add one, while keeping the core adaptable > and > > flexible. This is ideal for middle tier as I think this effectively gives > > us the ability to pivot to other use cases very easily while not being so > > generic as to be useless. The fact that you were able to create this as > > quickly as you did seems to me directly related to the fact we made the > > decision to keep the loader somewhat flexible rather than very specific. > > The operation ordering and state carry from one phase of processing to > the > > next would simply have been inscrutable, if not impossible, with a CLI > > option-only approach. Sure, it's not as simple as "put infile.txt > > outfile.txt", but the alternatives are not that clear either. One might > > argue we could split up the processing pieces as in traditional Hadoop, > eg > > ETL: Sqoop ingest -> HDFS -> mapreduce, pig, hive, or spark transform. > But > > quite frankly that's going in the *opposite* direction I think we want > > here. That's more complex in terms of moving parts. The config approach > > with pluggable Stellar insulates users from specific implementations, but > > also gives you the ability to pass lower level constructs, eg Spark SQL > or > > HiveQL, should the need arise. > > > > In summary, my impressions are that at this point the features and level > of > > abstraction feel appropriate to me. I think it buys us 1) learning from a > > starting typosquatting use case, 2) flexibility to change and adapt it > > without affecting users, and 3) enough concrete capability to make more > > specific use cases easy to deliver with a UI. > > > > Cheers, > > Mike > > > > On Jan 4, 2018 9:59 AM, "Casey Stella" <ceste...@gmail.com> wrote: > > > > > It also occurs to me that even in this situation, it's not a sufficient > > > generalization for just Bloom, but this is a bloom filter of the output > > of > > > the all the typosquatted domains for the domain in each row. If we > > wanted > > > to hard code, we'd have to hard code specifically the bloom filter > *for* > > > typosquatting use-case. Hard coding this would prevent things like > bloom > > > filters containing malicious IPs from a reference source, for instance. > > > > > > On Thu, Jan 4, 2018 at 10:46 AM, Casey Stella <ceste...@gmail.com> > > wrote: > > > > > > > So, there is value outside of just bloom usage. The most specific > > > example > > > > of this would be in order to configure a bloom filter, we need to > know > > at > > > > least an upper bound of the number of items that are going to be > added > > to > > > > the bloom filter. In order to do that, we need to count the number of > > > > typosquatted domains. Specifically at https://github.com/ > > > > cestella/incubator-metron/tree/typosquat_merge/use- > > > > cases/typosquat_detection#configure-the-bloom-filter you can see how > > we > > > > use the CONSOLE writer with an extractor config to count the number > of > > > > typosquatted domains in the alexa top 10k dataset so we can size the > > > filter > > > > appropriately. > > > > > > > > I'd argue that other types of probabalistic data structures could > also > > > > make sense here as well, like statistical sketches. Consider, for > > > instance, > > > > a cheap and dirty DGA indicator where we take the Alexa top 1M and > look > > > at > > > > the distribution of shannon entropy in the domains. If the shannon > > > entropy > > > > of a domain going across metron is more than 5 std devs from the > mean, > > > that > > > > could be circumstantial evidence of a malicious attack. This would > > > yield a > > > > lot of false positives, but used in conjunction with other indicators > > it > > > > could be valuable. > > > > > > > > Computing that would be as follows: > > > > > > > > { > > > > "config" : { > > > > "columns" : { > > > > "rank" : 0, > > > > "domain" : 1 > > > > }, > > > > "value_transform" : { > > > > "domain" : "DOMAIN_REMOVE_TLD(domain)" > > > > }, > > > > "value_filter" : "LENGTH(domain) > 0", > > > > "state_init" : "STATS_INIT()", > > > > "state_update" : { > > > > "state" : "STATS_ADD(state, STRING_ENTROPY(domain))" > > > > }, > > > > "state_merge" : "STATS_MERGE(states)", > > > > "separator" : "," > > > > }, > > > > "extractor" : "CSV" > > > > } > > > > > > > > Also, for another example, imagine a situation where we have a > > SPARK_SQL > > > > engine rather than just LOCAL for summarizing. We could create a > > general > > > > summary of URL lengths in bro data which could be used for > determining > > if > > > > someone is trying to send in very large URLs maliciously (see Jon > > > Zeolla's > > > > concerns in https://issues.apache.org/jira/browse/METRON-517 for a > > > > discussion of this). In order to do that, we could simply execute: > > > > > > > > $METRON_HOME/bin/flatfile_summarizer.sh -i "select uri from bro" -o > > > /tmp/reference/bro_uri_distribution.ser -e ~/uri_length_extractor.json > > -p > > > 5 -om HDFS -m SPARK_SQL > > > > > > > > with uri_length_extractor.json containing: > > > > > > > > { > > > > "config" : { > > > > "value_filter" : "LENGTH(uri) > 0", > > > > "state_init" : "STATS_INIT()", > > > > "state_update" : { > > > > "state" : "STATS_ADD(state, LENGTH(uri))" > > > > }, > > > > "state_merge" : "STATS_MERGE(states)", > > > > "separator" : "," > > > > }, > > > > "extractor" : "SQL_ROW" > > > > } > > > > > > > > > > > > Regarding value filter, that's already around in the extractor config > > > > because of the need to transform data in the flatfile loader. While I > > > > definitely see the desire to use unix tools to prep data, there are > > some > > > > things that aren't as easy to do. For instance, here, removing the > TLD > > > of > > > > a domain is not a trivial task in a shell script and we have existing > > > > functions for that in Stellar. I would see people using both. > > > > > > > > To address the issue of a more targeted experience to bloom, I think > > that > > > > sort of specialization should best exist in the UI layer. Having a > > more > > > > complete and expressive backend reused across specific UIs seems to > be > > > the > > > > best of all worlds. It allows power users to drop down and do more > > > complex > > > > things and still provides a (mostly) code-free and targeted > experience > > > for > > > > users. It seems to me that limiting the expressibility in the backend > > > > isn't the right way to go since this work just fits in with our > > existing > > > > engine. > > > > > > > > > > > > On Thu, Jan 4, 2018 at 1:40 AM, James Sirota <jsir...@apache.org> > > wrote: > > > > > > > >> I just went through these pull requests as well and also agree this > is > > > >> good work. I think it's a good first pass. I would be careful with > > > trying > > > >> to boil the ocean here. I think for the initial use case I would > only > > > >> support loading the bloom filters from HDFS. If people want to > > > pre-process > > > >> the CSV file of domains using awk or sed this should be out of scope > > of > > > >> this work. It's easy enough to do out of band and I would not > include > > > any > > > >> of these functions at all. I also think that the config could be > > > >> considerably simplified. I think value_filter should be removed > > (since > > > I > > > >> believe that preprocessing should be done by the user outside of > this > > > >> process). I also have a question about the init, update, and merge > > > >> configurations. Would I ever initialize to anything but an empty > > bloom > > > >> filter? For the state update would I ever do anything other than add > > to > > > >> the bloom filter? For the state merge would I ever do anything other > > > than > > > >> merge the states? If the answer to these is 'no', then this should > > > simply > > > >> be hard coded and not externalized into config values. > > > >> > > > >> 03.01.2018, 14:20, "Michael Miklavcic" <michael.miklav...@gmail.com > >: > > > > >> > I just finished stepping through the typosquatting use case README > > in > > > >> your > > > >> > merge branch. This is really, really good work Casey. I see most > of > > > our > > > >> > previous documentation issues addressed up front, e.g. special > > > variables > > > >> > are cited, all new fields explained, side effects documented. The > > use > > > >> case > > > >> > doc brings it all together soup-to-nuts and I think all the pieces > > > make > > > >> > sense in a mostly self-contained way. I can't think of anything I > > had > > > to > > > >> > sit and think about for more than a few seconds. I'll be making my > > way > > > >> > through your individual PR's in more detail, but my first > > impressions > > > >> are > > > >> > that this is excellent. > > > >> > > > > >> > On Wed, Jan 3, 2018 at 12:43 PM, Michael Miklavcic < > > > >> > michael.miklav...@gmail.com> wrote: > > > >> > > > > >> >> I'm liking this design and growth strategy, Casey. I also think > > Nick > > > >> and > > > >> >> Otto have some valid points. I always find there's a natural > > tension > > > >> >> between too little, just enough, and boiling the ocean and these > > > >> discuss > > > >> >> threads really help drive what the short and long term visions > > > should > > > >> look > > > >> >> like. > > > >> >> > > > >> >> On the subject of repositories and strategies, I agree that > > > pluggable > > > >> >> repos and strategies for modifying them would be useful. For the > > > first > > > >> >> pass, I'd really like to see HDFS with the proposed set of > Stellar > > > >> >> functions. This gives us a lot of bang for our buck - we can > > > >> capitalize on > > > >> >> a set of powerful features around existence checking earlier > > without > > > >> having > > > >> >> to worry about later interface changes impacting users. With the > > > >> primary > > > >> >> interface coming through the JSON config, we are building a nice > > > >> facade > > > >> >> that protects users from later implementation abstractions and > > > >> >> improvements, all while providing a stable enough interface on > > which > > > >> we can > > > >> >> develop UI features as desired. I'd be interested to hear more > > about > > > >> what > > > >> >> features could be provided by a repository as time goes by. > > > >> Federation, > > > >> >> permissions, governance, metadata management, perhaps? > > > >> >> > > > >> >> I also had some concern over duplicating existing Unix features. > I > > > >> think > > > >> >> where I'm at has been largely addressed by Casey's comments on 1) > > > >> scaling, > > > >> >> 2) multiple variables, and 3) portability to Hadoop. Providing 2 > > > >> approaches > > > >> >> - 1 which is config-based and the other a composable set of > > > functions > > > >> gives > > > >> >> us the ability to provide a core set of features that can later > be > > > >> easily > > > >> >> expanded by users as the need arises. Here again I think the > > > >> prescribed > > > >> >> approach provides a strong first pass that we can then expand on > > > >> without > > > >> >> concern of future improvements becoming a hassle for end users. > > > >> >> > > > >> >> Best, > > > >> >> Mike > > > >> >> > > > >> >> On Wed, Jan 3, 2018 at 10:25 AM, Simon Elliston Ball < > > > >> >> si...@simonellistonball.com> wrote: > > > >> >> > > > >> >>> There is some really cool stuff happening here, if only I’d been > > > >> allowed > > > >> >>> to see the lists over Christmas... :) > > > >> >>> > > > >> >>> A few thoughts... > > > >> >>> > > > >> >>> I like Otto’s generalisation of the problem to include specific > > > local > > > >> >>> stellar objects in a cache loaded from a store (HDFS seems a > > > >> natural, but > > > >> >>> not only place, maybe even a web service / local microservicey > > > object > > > >> >>> provider!?) That said, I suspect that’s a good platform > > > optimisation > > > >> >>> approach. Should we look at this as a separate piece of work > > given > > > it > > > >> >>> extends beyond the scope of the summarisation concept and > > > ultimately > > > >> use it > > > >> >>> as a back-end to feed the summarising engine proposed here for > > the > > > >> >>> enrichment loader? > > > >> >>> > > > >> >>> On the more specific use case, one think I would comment on is > > the > > > >> >>> configuration approach. The iteration loop > > > (state_{init|update|merge} > > > >> >>> should be consistent with the way we handle things like the > > > profiler > > > >> >>> config, since it’s the same approach to data handling. > > > >> >>> > > > >> >>> The other thing that seems to have crept in here is the > interface > > > to > > > >> >>> something like Spark, which again, I am really very very keen on > > > >> seeing > > > >> >>> happen. That said, not sure how that would happen in this > > context, > > > >> unless > > > >> >>> you’re talking about pushing to something like livy for example > > > >> (eminently > > > >> >>> sensible for things like cross instance caching and faster > > RPC-ish > > > >> access > > > >> >>> to an existing spark context which seem to be what Casey is > > driving > > > >> at with > > > >> >>> the spark piece. > > > >> >>> > > > >> >>> To address the question of text manipulation in Stellar / metron > > > >> >>> enrichment ingest etc, we already have this outside of the > > context > > > >> of the > > > >> >>> issues here. I would argue that yes, we don’t want too many > paths > > > >> for this, > > > >> >>> and that maybe our parser approach might be heavily related to > > > >> text-based > > > >> >>> ingest. I would say the scope worth dealing with here though is > > not > > > >> really > > > >> >>> text manipulation, but summarisation, which is not well served > by > > > >> existing > > > >> >>> CLI tools like awk / sed and friends. > > > >> >>> > > > >> >>> Simon > > > >> >>> > > > >> >>> > On 3 Jan 2018, at 15:48, Nick Allen <n...@nickallen.org> > > wrote: > > > >> >>> > > > > >> >>> >> Even with 5 threads, it takes an hour for the full Alexa 1m, > > so > > > I > > > >> >>> think > > > >> >>> > this will impact performance > > > >> >>> > > > > >> >>> > What exactly takes an hour? Adding 1M entries to a bloom > > filter? > > > >> That > > > >> >>> > seems really high, unless I am not understanding something. > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella < > > > ceste...@gmail.com> > > > >> >>> wrote: > > > >> >>> > > > > >> >>> >> Thanks for the feedback, Nick. > > > >> >>> >> > > > >> >>> >> Regarding "IMHO, I'd rather not reinvent the wheel for text > > > >> >>> manipulation." > > > >> >>> >> > > > >> >>> >> I would argue that we are not reinventing the wheel for text > > > >> >>> manipulation > > > >> >>> >> as the extractor config exists already and we are doing a > > > similar > > > >> >>> thing in > > > >> >>> >> the flatfile loader (in fact, the code is reused and merely > > > >> extended). > > > >> >>> >> Transformation operations are already supported in our > > codebase > > > >> in the > > > >> >>> >> extractor config, this PR has just added some hooks for > > stateful > > > >> >>> >> operations. > > > >> >>> >> > > > >> >>> >> Furthermore, we will need a configuration object to pass to > > the > > > >> REST > > > >> >>> call > > > >> >>> >> if we are ever to create a UI around importing data into > hbase > > > or > > > >> >>> creating > > > >> >>> >> these summary objects. > > > >> >>> >> > > > >> >>> >> Regarding your example: > > > >> >>> >> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | > > stellar > > > -i > > > >> >>> >> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)' > > > >> >>> >> > > > >> >>> >> I'm very sympathetic to this type of extension, but it has > > some > > > >> issues: > > > >> >>> >> > > > >> >>> >> 1. This implies a single-threaded addition to the bloom > > filter. > > > >> >>> >> 1. Even with 5 threads, it takes an hour for the full alexa > > 1m, > > > >> >>> so I > > > >> >>> >> think this will impact performance > > > >> >>> >> 2. There's not a way to specify how to merge across threads > if > > > we > > > >> >>> do > > > >> >>> >> make a multithread command line option > > > >> >>> >> 2. This restricts these kinds of operations to roles with > > heavy > > > >> unix > > > >> >>> CLI > > > >> >>> >> knowledge, which isn't often the types of people who would be > > > >> doing > > > >> >>> this > > > >> >>> >> type of operation > > > >> >>> >> 3. What if we need two variables passed to stellar? > > > >> >>> >> 4. This approach will be harder to move to Hadoop. Eventually > > we > > > >> >>> will > > > >> >>> >> want to support data on HDFS being processed by Hadoop > > (similar > > > to > > > >> >>> >> flatfile > > > >> >>> >> loader), so instead of -m LOCAL being passed for the flatfile > > > >> >>> summarizer > > > >> >>> >> you'd pass -m SPARK and the processing would happen on the > > > cluster > > > >> >>> >> 1. This is particularly relevant in this case as it's a > > > >> >>> >> embarrassingly parallel problem in general > > > >> >>> >> > > > >> >>> >> In summary, while this a CLI approach is attractive, I prefer > > > the > > > >> >>> extractor > > > >> >>> >> config solution because it is the solution with the smallest > > > >> iteration > > > >> >>> >> that: > > > >> >>> >> > > > >> >>> >> 1. Reuses existing metron extraction infrastructure > > > >> >>> >> 2. Provides the most solid base for the extensions that will > > be > > > >> >>> sorely > > > >> >>> >> needed soon (and will keep it in parity with the flatfile > > > loader) > > > >> >>> >> 3. Provides the most solid base for a future UI extension in > > the > > > >> >>> >> management UI to support both summarization and loading > > > >> >>> >> > > > >> >>> >> > > > >> >>> >> > > > >> >>> >> > > > >> >>> >> On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen < > > > n...@nickallen.org> > > > >> >>> wrote: > > > >> >>> >> > > > >> >>> >>> First off, I really do like the typosquatting use case and a > > > lot > > > >> of > > > >> >>> what > > > >> >>> >>> you have described. > > > >> >>> >>> > > > >> >>> >>>> We need a way to generate the summary sketches from flat > > data > > > >> for > > > >> >>> this > > > >> >>> >> to > > > >> >>> >>>> work. > > > >> >>> >>>> .. > > > >> >>> >>>> > > > >> >>> >>> > > > >> >>> >>> I took this quote directly from your use case. Above is the > > > point > > > >> >>> that > > > >> >>> >> I'd > > > >> >>> >>> like to discuss and what your proposed solutions center on. > > > This > > > >> is > > > >> >>> >> what I > > > >> >>> >>> think you are trying to do, at least with PR #879 > > > >> >>> >>> <https://github.com/apache/metron/pull/879>... > > > >> >>> >>> > > > >> >>> >>> (Q) Can we repurpose Stellar functions so that they can > > operate > > > >> on > > > >> >>> text > > > >> >>> >>> stored in a file system? > > > >> >>> >>> > > > >> >>> >>> > > > >> >>> >>> Whether we use the (1) Configuration or the (2) > > Function-based > > > >> >>> approach > > > >> >>> >>> that you described, fundamentally we are introducing new > ways > > > to > > > >> >>> perform > > > >> >>> >>> text manipulation inside of Stellar. > > > >> >>> >>> > > > >> >>> >>> IMHO, I'd rather not reinvent the wheel for text > > manipulation. > > > It > > > >> >>> would > > > >> >>> >> be > > > >> >>> >>> painful to implement and maintain a bunch of Stellar > > functions > > > >> for > > > >> >>> text > > > >> >>> >>> manipulation. People already have a large number of tools > > > >> available > > > >> >>> to > > > >> >>> >> do > > > >> >>> >>> this and everyone has their favorites. People are resistant > > to > > > >> >>> learning > > > >> >>> >>> something new when they already are familiar with another > way > > > to > > > >> do > > > >> >>> the > > > >> >>> >>> same thing. > > > >> >>> >>> > > > >> >>> >>> So then the question is, how else can we do this? My > > suggestion > > > >> is > > > >> >>> that > > > >> >>> >>> rather than introducing text manipulation tools inside of > > > >> Stellar, we > > > >> >>> >> allow > > > >> >>> >>> people to use the text manipulation tools they already know, > > > but > > > >> with > > > >> >>> the > > > >> >>> >>> Stellar functions that we already have. And the obvious way > > to > > > >> tie > > > >> >>> those > > > >> >>> >>> two things together is the Unix pipeline. > > > >> >>> >>> > > > >> >>> >>> A quick, albeit horribly incomplete, example to flesh this > > out > > > a > > > >> bit > > > >> >>> more > > > >> >>> >>> based on the example you have in PR #879 > > > >> >>> >>> <https://github.com/apache/metron/pull/879>. This would > > allow > > > >> me to > > > >> >>> >>> integrate Stellar with whatever external tools that I want. > > > >> >>> >>> > > > >> >>> >>> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | > > stellar > > > >> -i > > > >> >>> >>> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)' > > > >> >>> >>> > > > >> >>> >>> > > > >> >>> >>> > > > >> >>> >>> > > > >> >>> >>> > > > >> >>> >>> > > > >> >>> >>> > > > >> >>> >>> > > > >> >>> >>> On Sun, Dec 24, 2017 at 8:28 PM, Casey Stella < > > > >> ceste...@gmail.com> > > > >> >>> >> wrote: > > > >> >>> >>> > > > >> >>> >>>> I'll start this discussion off with my idea around a 2nd > > step > > > >> that is > > > >> >>> >>> more > > > >> >>> >>>> adaptable. I propose the following set of stellar functions > > > >> backed > > > >> >>> by > > > >> >>> >>>> Spark in the metron-management project: > > > >> >>> >>>> > > > >> >>> >>>> - CSV_PARSE(location, separator?, columns?) : Constructs a > > > Spark > > > >> >>> >>>> Dataframe for reading the flatfile > > > >> >>> >>>> - SQL_TRANSFORM(dataframe, spark sql statement): Transforms > > > the > > > >> >>> >>>> dataframe > > > >> >>> >>>> - SUMMARIZE(state_init, state_update, state_merge): > > Summarize > > > >> the > > > >> >>> >>>> dataframe using the lambda functions: > > > >> >>> >>>> - state_init - executed once per worker to initialize the > > > state > > > >> >>> >>>> - state_update - executed once per row > > > >> >>> >>>> - state_merge - Merge the worker states into one worker > > state > > > >> >>> >>>> - OBJECT_SAVE(obj, output_path) : Save the object obj to > the > > > >> path > > > >> >>> >>>> output_path on HDFS. > > > >> >>> >>>> > > > >> >>> >>>> This would enable more flexibility and composibility than > > the > > > >> >>> >>>> configuration-based approach that we have in the flatfile > > > >> loader. > > > >> >>> >>>> My concern with this approach, and the reason I didn't do > it > > > >> >>> initially, > > > >> >>> >>> was > > > >> >>> >>>> that I think that users will want at least 2 ways to > > summarize > > > >> data > > > >> >>> (or > > > >> >>> >>>> load data): > > > >> >>> >>>> > > > >> >>> >>>> - A configuration based approach, which enables a UI > > > >> >>> >>>> - A set of stellar functions via the scriptable REPL > > > >> >>> >>>> > > > >> >>> >>>> I would argue that both have a place and I started with the > > > >> >>> >> configuration > > > >> >>> >>>> based approach as it was a more natural extension of what > we > > > >> already > > > >> >>> >> had. > > > >> >>> >>>> I'd love to hear thoughts about this idea too. > > > >> >>> >>>> > > > >> >>> >>>> > > > >> >>> >>>> On Sun, Dec 24, 2017 at 8:20 PM, Casey Stella < > > > >> ceste...@gmail.com> > > > >> >>> >>> wrote: > > > >> >>> >>>> > > > >> >>> >>>>> Hi all, > > > >> >>> >>>>> > > > >> >>> >>>>> I wanted to get some feedback on a sensible plan for > > > >> something. It > > > >> >>> >>>>> occurred to me the other day when considering the use-case > > of > > > >> >>> >> detecting > > > >> >>> >>>>> typosquatted domains, that one approach was to generate > the > > > >> set of > > > >> >>> >>>>> typosquatted domains for some set of reference domains and > > > >> compare > > > >> >>> >>>> domains > > > >> >>> >>>>> as they flow through. > > > >> >>> >>>>> > > > >> >>> >>>>> One way we could do this would be to generate this data > and > > > >> import > > > >> >>> >> the > > > >> >>> >>>>> typosquatted domains into HBase. I thought, however, that > > > >> another > > > >> >>> >>>> approach > > > >> >>> >>>>> which may trade-off accuracy to remove the network hop and > > > >> potential > > > >> >>> >>> disk > > > >> >>> >>>>> seek by constructing a bloom filter that includes the set > > of > > > >> >>> >>> typosquatted > > > >> >>> >>>>> domains. > > > >> >>> >>>>> > > > >> >>> >>>>> The challenge was that we don't have a way to do this > > > >> currently. We > > > >> >>> >>> do, > > > >> >>> >>>>> however, have a loading infrastructure (e.g. the > > > >> flatfile_loader) > > > >> >>> and > > > >> >>> >>>>> configuration (see https://github.com/apache/ > > > >> >>> >>> metron/tree/master/metron- > > > >> >>> >>>>> platform/metron-data-management#common-extractor- > > properties) > > > >> which > > > >> >>> >>>>> handles: > > > >> >>> >>>>> > > > >> >>> >>>>> - parsing flat files > > > >> >>> >>>>> - transforming the rows > > > >> >>> >>>>> - filtering the rows > > > >> >>> >>>>> > > > >> >>> >>>>> To enable the new use-case of generating a summary object > > > >> (e.g. a > > > >> >>> >> bloom > > > >> >>> >>>>> filter), in METRON-1378 (https://github.com/apache/met > > > >> ron/pull/879) > > > >> >>> >> I > > > >> >>> >>>>> propose that we create a new utility that uses the same > > > >> extractor > > > >> >>> >>> config > > > >> >>> >>>>> add the ability to: > > > >> >>> >>>>> > > > >> >>> >>>>> - initialize a state object > > > >> >>> >>>>> - update the object for every row > > > >> >>> >>>>> - merge the state objects (in the case of multiple > threads, > > > in > > > >> the > > > >> >>> >>>>> case of one thread it's not needed). > > > >> >>> >>>>> > > > >> >>> >>>>> I think this is a sensible decision because: > > > >> >>> >>>>> > > > >> >>> >>>>> - It's a minimal movement from the flat file loader > > > >> >>> >>>>> - Uses the same configs > > > >> >>> >>>>> - Abstracts and reuses the existing infrastructure > > > >> >>> >>>>> - Having one extractor config means that it should be > > easier > > > to > > > >> >>> >>>>> generate a UI around this to simplify the experience > > > >> >>> >>>>> > > > >> >>> >>>>> All that being said, our extractor config is..shall we > > > >> >>> say...daunting > > > >> >>> >>> :). > > > >> >>> >>>>> I am sensitive to the fact that this adds to an existing > > > >> difficult > > > >> >>> >>>> config. > > > >> >>> >>>>> I propose that this is an initial step forward to support > > the > > > >> >>> >> use-case > > > >> >>> >>>> and > > > >> >>> >>>>> we can enable something more composable going forward. My > > > >> concern > > > >> >>> in > > > >> >>> >>>>> considering this as the first step was that it felt that > > the > > > >> >>> >> composable > > > >> >>> >>>>> units for data transformation and manipulation suddenly > > takes > > > >> us > > > >> >>> >> into a > > > >> >>> >>>>> place where Stellar starts to look like Pig or Spark RDD > > > API. I > > > >> >>> >> wasn't > > > >> >>> >>>>> ready for that without a lot more discussion. > > > >> >>> >>>>> > > > >> >>> >>>>> To summarize, what I'd like to get from the community is, > > > after > > > >> >>> >>> reviewing > > > >> >>> >>>>> the entire use-case at https://github.com/cestella/ > > > >> >>> >>>> incubator-metron/tree/ > > > >> >>> >>>>> typosquat_merge/use-cases/typosquat_detection: > > > >> >>> >>>>> > > > >> >>> >>>>> - Is this so confusing that it does not belong in Metron > > even > > > >> as a > > > >> >>> >>>>> first-step? > > > >> >>> >>>>> - Is there a way to extend the extractor config in a less > > > >> >>> >> confusing > > > >> >>> >>>>> way to enable this? > > > >> >>> >>>>> > > > >> >>> >>>>> I apologize for making the discuss thread *after* the > > JIRAs, > > > >> but I > > > >> >>> >> felt > > > >> >>> >>>>> this one might bear having some working code to consider. > > > >> >>> >>>>> > > > >> >>> >>>> > > > >> >>> >>> > > > >> >>> >> > > > >> > > > >> ------------------- > > > >> Thank you, > > > >> > > > >> James Sirota > > > >> PMC- Apache Metron > > > >> jsirota AT apache DOT org > > > >> > > > > > > > > > > > > > >