Any volunteers for creating a set of jiras and feature branch for an object store repository? This sounds like a massive feature.
On Jan 5, 2018 2:06 PM, "Otto Fowler" <ottobackwa...@gmail.com> wrote: > I would say that at the stellar author level, you would just get objects > from the store and the ‘override’ case would be a follow on for edge cases. > > > On January 5, 2018 at 14:29:16, Casey Stella (ceste...@gmail.com) wrote: > > Well, you can pull the default configs from global configs, but you might > want to override them (similar to the profiler). For instance, you might > want to interact with another hbase table than the one globally configured. > > On Fri, Jan 5, 2018 at 12:04 PM, Otto Fowler <ottobackwa...@gmail.com> > wrote: > > > I would imagine the ‘stellar-object-repo’ would be part of the global > > configuration or configuration passed to the command. > > why specify in the function itself? > > > > > > > > > > On January 5, 2018 at 11:22:32, Casey Stella (ceste...@gmail.com) wrote: > > > > I like that, specifically the repositories abstraction. Perhaps we can > > construct some longer term JIRAs for extensions. > > For the current state of affairs (wrt to the OBJECT_GET call) I was > > imagining the simple default HDFS solution as a first cut and > > following on adding a repository name (e.g. OBJECT_GET(path, repo_name) > > with repo_name being optional and defaulting to HDFS > > for backwards compatibility. > > > > In effect, this would be the next step that I'm proposing > OBJECT_GET(paths, > > repo_name, repo_config) which would be backwards compatible > > > > - paths - a single path or a list of paths (if a list, then a list of > > objects returned) > > - repo_name - optional name for repo, defaulted to HDFS if we don't > > specify > > - repo_config - optional config map > > > > > > This would open things like: > > > > - OBJECT_GET('key', 'HBASE', { 'hbase.table' : 'table', 'hbase.cf' : > > 'cf'} ) -- pulling from HBase > > > > Eventually we might also be able to fold ENRICHMENT_GET as just a special > > repo instance. > > > > On Fri, Jan 5, 2018 at 10:26 AM, Otto Fowler <ottobackwa...@gmail.com> > > wrote: > > > > > If we separate the concerns as I have state previously : > > > > > > 1. Stellar can load objects into ‘caches’ from some repository and > refer > > to > > > them. > > > 2. The repositories > > > 3. Some number of strategies to populate and possibly update the > > > repository, from spark, > > > to MR jobs to whatever you would classify the flat file stuff as. > > > 4. Let the Stellar API for everything but LOAD() follow after we get > > usage > > > > > > Then the particulars of ‘3’ are less important. > > > > > > > > > > > > On January 5, 2018 at 09:02:41, Justin Leet (justinjl...@gmail.com) > > wrote: > > > > > > I agree with the general sentiment that we can tailor specific use > cases > > > via UI, and I'm worried that the use case specific solution > (particularly > > > in light of the note that it's not even general to the class of bloom > > > filter problems, let alone an actually general problem) becomes more > work > > > than this as soon as about 2 more uses cases actually get realized. > > > Pushing that to the UI lets people solve a variety of problems if they > > > really want to dig in, while still giving flexibility to provide a more > > > tailored experience for what we discover the 80% cases are in practice. > > > > > > Keeping in mind I am mostly unfamiliar with the extractor config > itself, > > I > > > am wondering if it makes sense to split up the config a bit. While a > lot > > > of implementation details are shared, maybe the extractor config itself > > > should be refactored into a couple parts analogous to ETL (as a follow > on > > > task, I think if this is true, it predates Casey's proposed change). It > > > doesn't necessarily make it less complex, but it might make it more > > easily > > > digestible if it's split up by idea (parsing, transformation, etc.). > > > > > > Re: Mike's point, I don't think we want the actual processing broken up > > as > > > ETL, but the representation to the user in terms of configuration could > > be > > > similar (Since we're already doing parsing and transformation). We > don't > > > have to implement it as an ETL pipeline, but it does potentially offer > > the > > > user a way to quickly grasp what the JSON blob is actually specifying. > > > Making it easy to understand, even if it's not the ideal way to > interact > > is > > > potentially still a win. > > > > > > On Thu, Jan 4, 2018 at 1:28 PM, Michael Miklavcic < > > > michael.miklav...@gmail.com> wrote: > > > > > > > I mentioned this earlier, but I'll reiterate that I think this > approach > > > > gives us the ability to make specific use cases via a UI, or other > > > > interface should we choose to add one, while keeping the core > adaptable > > > and > > > > flexible. This is ideal for middle tier as I think this effectively > > gives > > > > us the ability to pivot to other use cases very easily while not > being > > so > > > > generic as to be useless. The fact that you were able to create this > as > > > > quickly as you did seems to me directly related to the fact we made > the > > > > decision to keep the loader somewhat flexible rather than very > > specific. > > > > The operation ordering and state carry from one phase of processing > to > > > the > > > > next would simply have been inscrutable, if not impossible, with a > CLI > > > > option-only approach. Sure, it's not as simple as "put infile.txt > > > > outfile.txt", but the alternatives are not that clear either. One > might > > > > argue we could split up the processing pieces as in traditional > Hadoop, > > > eg > > > > ETL: Sqoop ingest -> HDFS -> mapreduce, pig, hive, or spark > transform. > > > But > > > > quite frankly that's going in the *opposite* direction I think we > want > > > > here. That's more complex in terms of moving parts. The config > approach > > > > with pluggable Stellar insulates users from specific implementations, > > but > > > > also gives you the ability to pass lower level constructs, eg Spark > SQL > > > or > > > > HiveQL, should the need arise. > > > > > > > > In summary, my impressions are that at this point the features and > > level > > > of > > > > abstraction feel appropriate to me. I think it buys us 1) learning > > from a > > > > starting typosquatting use case, 2) flexibility to change and adapt > it > > > > without affecting users, and 3) enough concrete capability to make > more > > > > specific use cases easy to deliver with a UI. > > > > > > > > Cheers, > > > > Mike > > > > > > > > On Jan 4, 2018 9:59 AM, "Casey Stella" <ceste...@gmail.com> wrote: > > > > > > > > > It also occurs to me that even in this situation, it's not a > > sufficient > > > > > generalization for just Bloom, but this is a bloom filter of the > > output > > > > of > > > > > the all the typosquatted domains for the domain in each row. If we > > > > wanted > > > > > to hard code, we'd have to hard code specifically the bloom filter > > > *for* > > > > > typosquatting use-case. Hard coding this would prevent things like > > > bloom > > > > > filters containing malicious IPs from a reference source, for > > instance. > > > > > > > > > > On Thu, Jan 4, 2018 at 10:46 AM, Casey Stella <ceste...@gmail.com> > > > > wrote: > > > > > > > > > > > So, there is value outside of just bloom usage. The most specific > > > > > example > > > > > > of this would be in order to configure a bloom filter, we need to > > > know > > > > at > > > > > > least an upper bound of the number of items that are going to be > > > added > > > > to > > > > > > the bloom filter. In order to do that, we need to count the > number > > of > > > > > > typosquatted domains. Specifically at https://github.com/ > > > > > > cestella/incubator-metron/tree/typosquat_merge/use- > > > > > > cases/typosquat_detection#configure-the-bloom-filter you can see > > how > > > > we > > > > > > use the CONSOLE writer with an extractor config to count the > number > > > of > > > > > > typosquatted domains in the alexa top 10k dataset so we can size > > the > > > > > filter > > > > > > appropriately. > > > > > > > > > > > > I'd argue that other types of probabalistic data structures could > > > also > > > > > > make sense here as well, like statistical sketches. Consider, for > > > > > instance, > > > > > > a cheap and dirty DGA indicator where we take the Alexa top 1M > and > > > look > > > > > at > > > > > > the distribution of shannon entropy in the domains. If the > shannon > > > > > entropy > > > > > > of a domain going across metron is more than 5 std devs from the > > > mean, > > > > > that > > > > > > could be circumstantial evidence of a malicious attack. This > would > > > > > yield a > > > > > > lot of false positives, but used in conjunction with other > > indicators > > > > it > > > > > > could be valuable. > > > > > > > > > > > > Computing that would be as follows: > > > > > > > > > > > > { > > > > > > "config" : { > > > > > > "columns" : { > > > > > > "rank" : 0, > > > > > > "domain" : 1 > > > > > > }, > > > > > > "value_transform" : { > > > > > > "domain" : "DOMAIN_REMOVE_TLD(domain)" > > > > > > }, > > > > > > "value_filter" : "LENGTH(domain) > 0", > > > > > > "state_init" : "STATS_INIT()", > > > > > > "state_update" : { > > > > > > "state" : "STATS_ADD(state, STRING_ENTROPY(domain))" > > > > > > }, > > > > > > "state_merge" : "STATS_MERGE(states)", > > > > > > "separator" : "," > > > > > > }, > > > > > > "extractor" : "CSV" > > > > > > } > > > > > > > > > > > > Also, for another example, imagine a situation where we have a > > > > SPARK_SQL > > > > > > engine rather than just LOCAL for summarizing. We could create a > > > > general > > > > > > summary of URL lengths in bro data which could be used for > > > determining > > > > if > > > > > > someone is trying to send in very large URLs maliciously (see Jon > > > > > Zeolla's > > > > > > concerns in https://issues.apache.org/jira/browse/METRON-517 > for a > > > > > > discussion of this). In order to do that, we could simply > execute: > > > > > > > > > > > > $METRON_HOME/bin/flatfile_summarizer.sh -i "select uri from bro" > > -o > > > > > /tmp/reference/bro_uri_distribution.ser -e > > ~/uri_length_extractor.json > > > > -p > > > > > 5 -om HDFS -m SPARK_SQL > > > > > > > > > > > > with uri_length_extractor.json containing: > > > > > > > > > > > > { > > > > > > "config" : { > > > > > > "value_filter" : "LENGTH(uri) > 0", > > > > > > "state_init" : "STATS_INIT()", > > > > > > "state_update" : { > > > > > > "state" : "STATS_ADD(state, LENGTH(uri))" > > > > > > }, > > > > > > "state_merge" : "STATS_MERGE(states)", > > > > > > "separator" : "," > > > > > > }, > > > > > > "extractor" : "SQL_ROW" > > > > > > } > > > > > > > > > > > > > > > > > > Regarding value filter, that's already around in the extractor > > config > > > > > > because of the need to transform data in the flatfile loader. > > While I > > > > > > definitely see the desire to use unix tools to prep data, there > are > > > > some > > > > > > things that aren't as easy to do. For instance, here, removing > the > > > TLD > > > > > of > > > > > > a domain is not a trivial task in a shell script and we have > > existing > > > > > > functions for that in Stellar. I would see people using both. > > > > > > > > > > > > To address the issue of a more targeted experience to bloom, I > > think > > > > that > > > > > > sort of specialization should best exist in the UI layer. Having > a > > > > more > > > > > > complete and expressive backend reused across specific UIs seems > to > > > be > > > > > the > > > > > > best of all worlds. It allows power users to drop down and do > more > > > > > complex > > > > > > things and still provides a (mostly) code-free and targeted > > > experience > > > > > for > > > > > > users. It seems to me that limiting the expressibility in the > > backend > > > > > > isn't the right way to go since this work just fits in with our > > > > existing > > > > > > engine. > > > > > > > > > > > > > > > > > > On Thu, Jan 4, 2018 at 1:40 AM, James Sirota <jsir...@apache.org > > > > > > wrote: > > > > > > > > > > > >> I just went through these pull requests as well and also agree > > this > > > is > > > > > >> good work. I think it's a good first pass. I would be careful > with > > > > > trying > > > > > >> to boil the ocean here. I think for the initial use case I would > > > only > > > > > >> support loading the bloom filters from HDFS. If people want to > > > > > pre-process > > > > > >> the CSV file of domains using awk or sed this should be out of > > scope > > > > of > > > > > >> this work. It's easy enough to do out of band and I would not > > > include > > > > > any > > > > > >> of these functions at all. I also think that the config could be > > > > > >> considerably simplified. I think value_filter should be removed > > > > (since > > > > > I > > > > > >> believe that preprocessing should be done by the user outside of > > > this > > > > > >> process). I also have a question about the init, update, and > merge > > > > > >> configurations. Would I ever initialize to anything but an empty > > > > bloom > > > > > >> filter? For the state update would I ever do anything other than > > add > > > > to > > > > > >> the bloom filter? For the state merge would I ever do anything > > other > > > > > than > > > > > >> merge the states? If the answer to these is 'no', then this > should > > > > > simply > > > > > >> be hard coded and not externalized into config values. > > > > > >> > > > > > >> 03.01.2018, 14:20, "Michael Miklavcic" < > > michael.miklav...@gmail.com > > > >: > > > > > > > > >> > I just finished stepping through the typosquatting use case > > README > > > > in > > > > > >> your > > > > > >> > merge branch. This is really, really good work Casey. I see > most > > > of > > > > > our > > > > > >> > previous documentation issues addressed up front, e.g. special > > > > > variables > > > > > >> > are cited, all new fields explained, side effects documented. > > The > > > > use > > > > > >> case > > > > > >> > doc brings it all together soup-to-nuts and I think all the > > pieces > > > > > make > > > > > >> > sense in a mostly self-contained way. I can't think of > anything > > I > > > > had > > > > > to > > > > > >> > sit and think about for more than a few seconds. I'll be > making > > my > > > > way > > > > > >> > through your individual PR's in more detail, but my first > > > > impressions > > > > > >> are > > > > > >> > that this is excellent. > > > > > >> > > > > > > >> > On Wed, Jan 3, 2018 at 12:43 PM, Michael Miklavcic < > > > > > >> > michael.miklav...@gmail.com> wrote: > > > > > >> > > > > > > >> >> I'm liking this design and growth strategy, Casey. I also > think > > > > Nick > > > > > >> and > > > > > >> >> Otto have some valid points. I always find there's a natural > > > > tension > > > > > >> >> between too little, just enough, and boiling the ocean and > > these > > > > > >> discuss > > > > > >> >> threads really help drive what the short and long term > visions > > > > > should > > > > > >> look > > > > > >> >> like. > > > > > >> >> > > > > > >> >> On the subject of repositories and strategies, I agree that > > > > > pluggable > > > > > >> >> repos and strategies for modifying them would be useful. For > > the > > > > > first > > > > > >> >> pass, I'd really like to see HDFS with the proposed set of > > > Stellar > > > > > >> >> functions. This gives us a lot of bang for our buck - we can > > > > > >> capitalize on > > > > > >> >> a set of powerful features around existence checking earlier > > > > without > > > > > >> having > > > > > >> >> to worry about later interface changes impacting users. With > > the > > > > > >> primary > > > > > >> >> interface coming through the JSON config, we are building a > > nice > > > > > >> facade > > > > > >> >> that protects users from later implementation abstractions > and > > > > > >> >> improvements, all while providing a stable enough interface > on > > > > which > > > > > >> we can > > > > > >> >> develop UI features as desired. I'd be interested to hear > more > > > > about > > > > > >> what > > > > > >> >> features could be provided by a repository as time goes by. > > > > > >> Federation, > > > > > >> >> permissions, governance, metadata management, perhaps? > > > > > >> >> > > > > > >> >> I also had some concern over duplicating existing Unix > > features. > > > I > > > > > >> think > > > > > >> >> where I'm at has been largely addressed by Casey's comments > on > > 1) > > > > > >> scaling, > > > > > >> >> 2) multiple variables, and 3) portability to Hadoop. > Providing > > 2 > > > > > >> approaches > > > > > >> >> - 1 which is config-based and the other a composable set of > > > > > functions > > > > > >> gives > > > > > >> >> us the ability to provide a core set of features that can > later > > > be > > > > > >> easily > > > > > >> >> expanded by users as the need arises. Here again I think the > > > > > >> prescribed > > > > > >> >> approach provides a strong first pass that we can then expand > > on > > > > > >> without > > > > > >> >> concern of future improvements becoming a hassle for end > users. > > > > > >> >> > > > > > >> >> Best, > > > > > >> >> Mike > > > > > >> >> > > > > > >> >> On Wed, Jan 3, 2018 at 10:25 AM, Simon Elliston Ball < > > > > > >> >> si...@simonellistonball.com> wrote: > > > > > >> >> > > > > > >> >>> There is some really cool stuff happening here, if only I’d > > been > > > > > >> allowed > > > > > >> >>> to see the lists over Christmas... :) > > > > > >> >>> > > > > > >> >>> A few thoughts... > > > > > >> >>> > > > > > >> >>> I like Otto’s generalisation of the problem to include > > specific > > > > > local > > > > > >> >>> stellar objects in a cache loaded from a store (HDFS seems a > > > > > >> natural, but > > > > > >> >>> not only place, maybe even a web service / local > microservicey > > > > > object > > > > > >> >>> provider!?) That said, I suspect that’s a good platform > > > > > optimisation > > > > > >> >>> approach. Should we look at this as a separate piece of work > > > > given > > > > > it > > > > > >> >>> extends beyond the scope of the summarisation concept and > > > > > ultimately > > > > > >> use it > > > > > >> >>> as a back-end to feed the summarising engine proposed here > for > > > > the > > > > > >> >>> enrichment loader? > > > > > >> >>> > > > > > >> >>> On the more specific use case, one think I would comment on > is > > > > the > > > > > >> >>> configuration approach. The iteration loop > > > > > (state_{init|update|merge} > > > > > >> >>> should be consistent with the way we handle things like the > > > > > profiler > > > > > >> >>> config, since it’s the same approach to data handling. > > > > > >> >>> > > > > > >> >>> The other thing that seems to have crept in here is the > > > interface > > > > > to > > > > > >> >>> something like Spark, which again, I am really very very > keen > > on > > > > > >> seeing > > > > > >> >>> happen. That said, not sure how that would happen in this > > > > context, > > > > > >> unless > > > > > >> >>> you’re talking about pushing to something like livy for > > example > > > > > >> (eminently > > > > > >> >>> sensible for things like cross instance caching and faster > > > > RPC-ish > > > > > >> access > > > > > >> >>> to an existing spark context which seem to be what Casey is > > > > driving > > > > > >> at with > > > > > >> >>> the spark piece. > > > > > >> >>> > > > > > >> >>> To address the question of text manipulation in Stellar / > > metron > > > > > >> >>> enrichment ingest etc, we already have this outside of the > > > > context > > > > > >> of the > > > > > >> >>> issues here. I would argue that yes, we don’t want too many > > > paths > > > > > >> for this, > > > > > >> >>> and that maybe our parser approach might be heavily related > to > > > > > >> text-based > > > > > >> >>> ingest. I would say the scope worth dealing with here though > > is > > > > not > > > > > >> really > > > > > >> >>> text manipulation, but summarisation, which is not well > served > > > by > > > > > >> existing > > > > > >> >>> CLI tools like awk / sed and friends. > > > > > >> >>> > > > > > >> >>> Simon > > > > > >> >>> > > > > > >> >>> > On 3 Jan 2018, at 15:48, Nick Allen <n...@nickallen.org> > > > > wrote: > > > > > >> >>> > > > > > > >> >>> >> Even with 5 threads, it takes an hour for the full Alexa > > 1m, > > > > so > > > > > I > > > > > >> >>> think > > > > > >> >>> > this will impact performance > > > > > >> >>> > > > > > > >> >>> > What exactly takes an hour? Adding 1M entries to a bloom > > > > filter? > > > > > >> That > > > > > >> >>> > seems really high, unless I am not understanding > something. > > > > > >> >>> > > > > > > >> >>> > > > > > > >> >>> > > > > > > >> >>> > > > > > > >> >>> > > > > > > >> >>> > > > > > > >> >>> > On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella < > > > > > ceste...@gmail.com> > > > > > >> >>> wrote: > > > > > >> >>> > > > > > > >> >>> >> Thanks for the feedback, Nick. > > > > > >> >>> >> > > > > > >> >>> >> Regarding "IMHO, I'd rather not reinvent the wheel for > text > > > > > >> >>> manipulation." > > > > > >> >>> >> > > > > > >> >>> >> I would argue that we are not reinventing the wheel for > > text > > > > > >> >>> manipulation > > > > > >> >>> >> as the extractor config exists already and we are doing a > > > > > similar > > > > > >> >>> thing in > > > > > >> >>> >> the flatfile loader (in fact, the code is reused and > merely > > > > > >> extended). > > > > > >> >>> >> Transformation operations are already supported in our > > > > codebase > > > > > >> in the > > > > > >> >>> >> extractor config, this PR has just added some hooks for > > > > stateful > > > > > >> >>> >> operations. > > > > > >> >>> >> > > > > > >> >>> >> Furthermore, we will need a configuration object to pass > to > > > > the > > > > > >> REST > > > > > >> >>> call > > > > > >> >>> >> if we are ever to create a UI around importing data into > > > hbase > > > > > or > > > > > >> >>> creating > > > > > >> >>> >> these summary objects. > > > > > >> >>> >> > > > > > >> >>> >> Regarding your example: > > > > > >> >>> >> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | > > > > stellar > > > > > -i > > > > > >> >>> >> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)' > > > > > >> >>> >> > > > > > >> >>> >> I'm very sympathetic to this type of extension, but it > has > > > > some > > > > > >> issues: > > > > > >> >>> >> > > > > > >> >>> >> 1. This implies a single-threaded addition to the bloom > > > > filter. > > > > > >> >>> >> 1. Even with 5 threads, it takes an hour for the full > alexa > > > > 1m, > > > > > >> >>> so I > > > > > >> >>> >> think this will impact performance > > > > > >> >>> >> 2. There's not a way to specify how to merge across > threads > > > if > > > > > we > > > > > >> >>> do > > > > > >> >>> >> make a multithread command line option > > > > > >> >>> >> 2. This restricts these kinds of operations to roles with > > > > heavy > > > > > >> unix > > > > > >> >>> CLI > > > > > >> >>> >> knowledge, which isn't often the types of people who > would > > be > > > > > >> doing > > > > > >> >>> this > > > > > >> >>> >> type of operation > > > > > >> >>> >> 3. What if we need two variables passed to stellar? > > > > > >> >>> >> 4. This approach will be harder to move to Hadoop. > > Eventually > > > > we > > > > > >> >>> will > > > > > >> >>> >> want to support data on HDFS being processed by Hadoop > > > > (similar > > > > > to > > > > > >> >>> >> flatfile > > > > > >> >>> >> loader), so instead of -m LOCAL being passed for the > > flatfile > > > > > >> >>> summarizer > > > > > >> >>> >> you'd pass -m SPARK and the processing would happen on > the > > > > > cluster > > > > > >> >>> >> 1. This is particularly relevant in this case as it's a > > > > > >> >>> >> embarrassingly parallel problem in general > > > > > >> >>> >> > > > > > >> >>> >> In summary, while this a CLI approach is attractive, I > > prefer > > > > > the > > > > > >> >>> extractor > > > > > >> >>> >> config solution because it is the solution with the > > smallest > > > > > >> iteration > > > > > >> >>> >> that: > > > > > >> >>> >> > > > > > >> >>> >> 1. Reuses existing metron extraction infrastructure > > > > > >> >>> >> 2. Provides the most solid base for the extensions that > > will > > > > be > > > > > >> >>> sorely > > > > > >> >>> >> needed soon (and will keep it in parity with the flatfile > > > > > loader) > > > > > >> >>> >> 3. Provides the most solid base for a future UI extension > > in > > > > the > > > > > >> >>> >> management UI to support both summarization and loading > > > > > >> >>> >> > > > > > >> >>> >> > > > > > >> >>> >> > > > > > >> >>> >> > > > > > >> >>> >> On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen < > > > > > n...@nickallen.org> > > > > > >> >>> wrote: > > > > > >> >>> >> > > > > > >> >>> >>> First off, I really do like the typosquatting use case > > and a > > > > > lot > > > > > >> of > > > > > >> >>> what > > > > > >> >>> >>> you have described. > > > > > >> >>> >>> > > > > > >> >>> >>>> We need a way to generate the summary sketches from > flat > > > > data > > > > > >> for > > > > > >> >>> this > > > > > >> >>> >> to > > > > > >> >>> >>>> work. > > > > > >> >>> >>>> .. > > > > > >> >>> >>>> > > > > > >> >>> >>> > > > > > >> >>> >>> I took this quote directly from your use case. Above is > > the > > > > > point > > > > > >> >>> that > > > > > >> >>> >> I'd > > > > > >> >>> >>> like to discuss and what your proposed solutions center > > on. > > > > > This > > > > > >> is > > > > > >> >>> >> what I > > > > > >> >>> >>> think you are trying to do, at least with PR #879 > > > > > >> >>> >>> <https://github.com/apache/metron/pull/879>... > > > > > >> >>> >>> > > > > > >> >>> >>> (Q) Can we repurpose Stellar functions so that they can > > > > operate > > > > > >> on > > > > > >> >>> text > > > > > >> >>> >>> stored in a file system? > > > > > >> >>> >>> > > > > > >> >>> >>> > > > > > >> >>> >>> Whether we use the (1) Configuration or the (2) > > > > Function-based > > > > > >> >>> approach > > > > > >> >>> >>> that you described, fundamentally we are introducing new > > > ways > > > > > to > > > > > >> >>> perform > > > > > >> >>> >>> text manipulation inside of Stellar. > > > > > >> >>> >>> > > > > > >> >>> >>> IMHO, I'd rather not reinvent the wheel for text > > > > manipulation. > > > > > It > > > > > >> >>> would > > > > > >> >>> >> be > > > > > >> >>> >>> painful to implement and maintain a bunch of Stellar > > > > functions > > > > > >> for > > > > > >> >>> text > > > > > >> >>> >>> manipulation. People already have a large number of > tools > > > > > >> available > > > > > >> >>> to > > > > > >> >>> >> do > > > > > >> >>> >>> this and everyone has their favorites. People are > > resistant > > > > to > > > > > >> >>> learning > > > > > >> >>> >>> something new when they already are familiar with > another > > > way > > > > > to > > > > > >> do > > > > > >> >>> the > > > > > >> >>> >>> same thing. > > > > > >> >>> >>> > > > > > >> >>> >>> So then the question is, how else can we do this? My > > > > suggestion > > > > > >> is > > > > > >> >>> that > > > > > >> >>> >>> rather than introducing text manipulation tools inside > of > > > > > >> Stellar, we > > > > > >> >>> >> allow > > > > > >> >>> >>> people to use the text manipulation tools they already > > know, > > > > > but > > > > > >> with > > > > > >> >>> the > > > > > >> >>> >>> Stellar functions that we already have. And the obvious > > way > > > > to > > > > > >> tie > > > > > >> >>> those > > > > > >> >>> >>> two things together is the Unix pipeline. > > > > > >> >>> >>> > > > > > >> >>> >>> A quick, albeit horribly incomplete, example to flesh > this > > > > out > > > > > a > > > > > >> bit > > > > > >> >>> more > > > > > >> >>> >>> based on the example you have in PR #879 > > > > > >> >>> >>> <https://github.com/apache/metron/pull/879>. This would > > > > allow > > > > > >> me to > > > > > >> >>> >>> integrate Stellar with whatever external tools that I > > want. > > > > > >> >>> >>> > > > > > >> >>> >>> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | > > > > stellar > > > > > >> -i > > > > > >> >>> >>> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)' > > > > > >> >>> >>> > > > > > >> >>> >>> > > > > > >> >>> >>> > > > > > >> >>> >>> > > > > > >> >>> >>> > > > > > >> >>> >>> > > > > > >> >>> >>> > > > > > >> >>> >>> > > > > > >> >>> >>> On Sun, Dec 24, 2017 at 8:28 PM, Casey Stella < > > > > > >> ceste...@gmail.com> > > > > > >> >>> >> wrote: > > > > > >> >>> >>> > > > > > >> >>> >>>> I'll start this discussion off with my idea around a > 2nd > > > > step > > > > > >> that is > > > > > >> >>> >>> more > > > > > >> >>> >>>> adaptable. I propose the following set of stellar > > functions > > > > > >> backed > > > > > >> >>> by > > > > > >> >>> >>>> Spark in the metron-management project: > > > > > >> >>> >>>> > > > > > >> >>> >>>> - CSV_PARSE(location, separator?, columns?) : > Constructs > > a > > > > > Spark > > > > > >> >>> >>>> Dataframe for reading the flatfile > > > > > >> >>> >>>> - SQL_TRANSFORM(dataframe, spark sql statement): > > Transforms > > > > > the > > > > > >> >>> >>>> dataframe > > > > > >> >>> >>>> - SUMMARIZE(state_init, state_update, state_merge): > > > > Summarize > > > > > >> the > > > > > >> >>> >>>> dataframe using the lambda functions: > > > > > >> >>> >>>> - state_init - executed once per worker to initialize > the > > > > > state > > > > > >> >>> >>>> - state_update - executed once per row > > > > > >> >>> >>>> - state_merge - Merge the worker states into one worker > > > > state > > > > > >> >>> >>>> - OBJECT_SAVE(obj, output_path) : Save the object obj > to > > > the > > > > > >> path > > > > > >> >>> >>>> output_path on HDFS. > > > > > >> >>> >>>> > > > > > >> >>> >>>> This would enable more flexibility and composibility > than > > > > the > > > > > >> >>> >>>> configuration-based approach that we have in the > flatfile > > > > > >> loader. > > > > > >> >>> >>>> My concern with this approach, and the reason I didn't > do > > > it > > > > > >> >>> initially, > > > > > >> >>> >>> was > > > > > >> >>> >>>> that I think that users will want at least 2 ways to > > > > summarize > > > > > >> data > > > > > >> >>> (or > > > > > >> >>> >>>> load data): > > > > > >> >>> >>>> > > > > > >> >>> >>>> - A configuration based approach, which enables a UI > > > > > >> >>> >>>> - A set of stellar functions via the scriptable REPL > > > > > >> >>> >>>> > > > > > >> >>> >>>> I would argue that both have a place and I started with > > the > > > > > >> >>> >> configuration > > > > > >> >>> >>>> based approach as it was a more natural extension of > what > > > we > > > > > >> already > > > > > >> >>> >> had. > > > > > >> >>> >>>> I'd love to hear thoughts about this idea too. > > > > > >> >>> >>>> > > > > > >> >>> >>>> > > > > > >> >>> >>>> On Sun, Dec 24, 2017 at 8:20 PM, Casey Stella < > > > > > >> ceste...@gmail.com> > > > > > >> >>> >>> wrote: > > > > > >> >>> >>>> > > > > > >> >>> >>>>> Hi all, > > > > > >> >>> >>>>> > > > > > >> >>> >>>>> I wanted to get some feedback on a sensible plan for > > > > > >> something. It > > > > > >> >>> >>>>> occurred to me the other day when considering the > > use-case > > > > of > > > > > >> >>> >> detecting > > > > > >> >>> >>>>> typosquatted domains, that one approach was to > generate > > > the > > > > > >> set of > > > > > >> >>> >>>>> typosquatted domains for some set of reference domains > > and > > > > > >> compare > > > > > >> >>> >>>> domains > > > > > >> >>> >>>>> as they flow through. > > > > > >> >>> >>>>> > > > > > >> >>> >>>>> One way we could do this would be to generate this > data > > > and > > > > > >> import > > > > > >> >>> >> the > > > > > >> >>> >>>>> typosquatted domains into HBase. I thought, however, > > that > > > > > >> another > > > > > >> >>> >>>> approach > > > > > >> >>> >>>>> which may trade-off accuracy to remove the network hop > > and > > > > > >> potential > > > > > >> >>> >>> disk > > > > > >> >>> >>>>> seek by constructing a bloom filter that includes the > > set > > > > of > > > > > >> >>> >>> typosquatted > > > > > >> >>> >>>>> domains. > > > > > >> >>> >>>>> > > > > > >> >>> >>>>> The challenge was that we don't have a way to do this > > > > > >> currently. We > > > > > >> >>> >>> do, > > > > > >> >>> >>>>> however, have a loading infrastructure (e.g. the > > > > > >> flatfile_loader) > > > > > >> >>> and > > > > > >> >>> >>>>> configuration (see https://github.com/apache/ > > > > > >> >>> >>> metron/tree/master/metron- > > > > > >> >>> >>>>> platform/metron-data-management#common-extractor- > > > > properties) > > > > > >> which > > > > > >> >>> >>>>> handles: > > > > > >> >>> >>>>> > > > > > >> >>> >>>>> - parsing flat files > > > > > >> >>> >>>>> - transforming the rows > > > > > >> >>> >>>>> - filtering the rows > > > > > >> >>> >>>>> > > > > > >> >>> >>>>> To enable the new use-case of generating a summary > > object > > > > > >> (e.g. a > > > > > >> >>> >> bloom > > > > > >> >>> >>>>> filter), in METRON-1378 ( > https://github.com/apache/met > > > > > >> ron/pull/879) > > > > > >> >>> >> I > > > > > >> >>> >>>>> propose that we create a new utility that uses the > same > > > > > >> extractor > > > > > >> >>> >>> config > > > > > >> >>> >>>>> add the ability to: > > > > > >> >>> >>>>> > > > > > >> >>> >>>>> - initialize a state object > > > > > >> >>> >>>>> - update the object for every row > > > > > >> >>> >>>>> - merge the state objects (in the case of multiple > > > threads, > > > > > in > > > > > >> the > > > > > >> >>> >>>>> case of one thread it's not needed). > > > > > >> >>> >>>>> > > > > > >> >>> >>>>> I think this is a sensible decision because: > > > > > >> >>> >>>>> > > > > > >> >>> >>>>> - It's a minimal movement from the flat file loader > > > > > >> >>> >>>>> - Uses the same configs > > > > > >> >>> >>>>> - Abstracts and reuses the existing infrastructure > > > > > >> >>> >>>>> - Having one extractor config means that it should be > > > > easier > > > > > to > > > > > >> >>> >>>>> generate a UI around this to simplify the experience > > > > > >> >>> >>>>> > > > > > >> >>> >>>>> All that being said, our extractor config is..shall we > > > > > >> >>> say...daunting > > > > > >> >>> >>> :). > > > > > >> >>> >>>>> I am sensitive to the fact that this adds to an > existing > > > > > >> difficult > > > > > >> >>> >>>> config. > > > > > >> >>> >>>>> I propose that this is an initial step forward to > > support > > > > the > > > > > >> >>> >> use-case > > > > > >> >>> >>>> and > > > > > >> >>> >>>>> we can enable something more composable going forward. > > My > > > > > >> concern > > > > > >> >>> in > > > > > >> >>> >>>>> considering this as the first step was that it felt > that > > > > the > > > > > >> >>> >> composable > > > > > >> >>> >>>>> units for data transformation and manipulation > suddenly > > > > takes > > > > > >> us > > > > > >> >>> >> into a > > > > > >> >>> >>>>> place where Stellar starts to look like Pig or Spark > RDD > > > > > API. I > > > > > >> >>> >> wasn't > > > > > >> >>> >>>>> ready for that without a lot more discussion. > > > > > >> >>> >>>>> > > > > > >> >>> >>>>> To summarize, what I'd like to get from the community > > is, > > > > > after > > > > > >> >>> >>> reviewing > > > > > >> >>> >>>>> the entire use-case at https://github.com/cestella/ > > > > > >> >>> >>>> incubator-metron/tree/ > > > > > >> >>> >>>>> typosquat_merge/use-cases/typosquat_detection: > > > > > >> >>> >>>>> > > > > > >> >>> >>>>> - Is this so confusing that it does not belong in > Metron > > > > even > > > > > >> as a > > > > > >> >>> >>>>> first-step? > > > > > >> >>> >>>>> - Is there a way to extend the extractor config in a > > less > > > > > >> >>> >> confusing > > > > > >> >>> >>>>> way to enable this? > > > > > >> >>> >>>>> > > > > > >> >>> >>>>> I apologize for making the discuss thread *after* the > > > > JIRAs, > > > > > >> but I > > > > > >> >>> >> felt > > > > > >> >>> >>>>> this one might bear having some working code to > > consider. > > > > > >> >>> >>>>> > > > > > >> >>> >>>> > > > > > >> >>> >>> > > > > > >> >>> >> > > > > > >> > > > > > >> ------------------- > > > > > >> Thank you, > > > > > >> > > > > > >> James Sirota > > > > > >> PMC- Apache Metron > > > > > >> jsirota AT apache DOT org > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >