I just went through these pull requests as well and also agree this is good work. I think it's a good first pass. I would be careful with trying to boil the ocean here. I think for the initial use case I would only support loading the bloom filters from HDFS. If people want to pre-process the CSV file of domains using awk or sed this should be out of scope of this work. It's easy enough to do out of band and I would not include any of these functions at all. I also think that the config could be considerably simplified. I think value_filter should be removed (since I believe that preprocessing should be done by the user outside of this process). I also have a question about the init, update, and merge configurations. Would I ever initialize to anything but an empty bloom filter? For the state update would I ever do anything other than add to the bloom filter? For the state merge would I ever do anything other than merge the states? If the answer to these is 'no', then this should simply be hard coded and not externalized into config values.
03.01.2018, 14:20, "Michael Miklavcic" <michael.miklav...@gmail.com>: > I just finished stepping through the typosquatting use case README in your > merge branch. This is really, really good work Casey. I see most of our > previous documentation issues addressed up front, e.g. special variables > are cited, all new fields explained, side effects documented. The use case > doc brings it all together soup-to-nuts and I think all the pieces make > sense in a mostly self-contained way. I can't think of anything I had to > sit and think about for more than a few seconds. I'll be making my way > through your individual PR's in more detail, but my first impressions are > that this is excellent. > > On Wed, Jan 3, 2018 at 12:43 PM, Michael Miklavcic < > michael.miklav...@gmail.com> wrote: > >> I'm liking this design and growth strategy, Casey. I also think Nick and >> Otto have some valid points. I always find there's a natural tension >> between too little, just enough, and boiling the ocean and these discuss >> threads really help drive what the short and long term visions should look >> like. >> >> On the subject of repositories and strategies, I agree that pluggable >> repos and strategies for modifying them would be useful. For the first >> pass, I'd really like to see HDFS with the proposed set of Stellar >> functions. This gives us a lot of bang for our buck - we can capitalize on >> a set of powerful features around existence checking earlier without having >> to worry about later interface changes impacting users. With the primary >> interface coming through the JSON config, we are building a nice facade >> that protects users from later implementation abstractions and >> improvements, all while providing a stable enough interface on which we can >> develop UI features as desired. I'd be interested to hear more about what >> features could be provided by a repository as time goes by. Federation, >> permissions, governance, metadata management, perhaps? >> >> I also had some concern over duplicating existing Unix features. I think >> where I'm at has been largely addressed by Casey's comments on 1) scaling, >> 2) multiple variables, and 3) portability to Hadoop. Providing 2 approaches >> - 1 which is config-based and the other a composable set of functions gives >> us the ability to provide a core set of features that can later be easily >> expanded by users as the need arises. Here again I think the prescribed >> approach provides a strong first pass that we can then expand on without >> concern of future improvements becoming a hassle for end users. >> >> Best, >> Mike >> >> On Wed, Jan 3, 2018 at 10:25 AM, Simon Elliston Ball < >> si...@simonellistonball.com> wrote: >> >>> There is some really cool stuff happening here, if only I’d been allowed >>> to see the lists over Christmas... :) >>> >>> A few thoughts... >>> >>> I like Otto’s generalisation of the problem to include specific local >>> stellar objects in a cache loaded from a store (HDFS seems a natural, but >>> not only place, maybe even a web service / local microservicey object >>> provider!?) That said, I suspect that’s a good platform optimisation >>> approach. Should we look at this as a separate piece of work given it >>> extends beyond the scope of the summarisation concept and ultimately use it >>> as a back-end to feed the summarising engine proposed here for the >>> enrichment loader? >>> >>> On the more specific use case, one think I would comment on is the >>> configuration approach. The iteration loop (state_{init|update|merge} >>> should be consistent with the way we handle things like the profiler >>> config, since it’s the same approach to data handling. >>> >>> The other thing that seems to have crept in here is the interface to >>> something like Spark, which again, I am really very very keen on seeing >>> happen. That said, not sure how that would happen in this context, unless >>> you’re talking about pushing to something like livy for example (eminently >>> sensible for things like cross instance caching and faster RPC-ish access >>> to an existing spark context which seem to be what Casey is driving at with >>> the spark piece. >>> >>> To address the question of text manipulation in Stellar / metron >>> enrichment ingest etc, we already have this outside of the context of the >>> issues here. I would argue that yes, we don’t want too many paths for this, >>> and that maybe our parser approach might be heavily related to text-based >>> ingest. I would say the scope worth dealing with here though is not really >>> text manipulation, but summarisation, which is not well served by existing >>> CLI tools like awk / sed and friends. >>> >>> Simon >>> >>> > On 3 Jan 2018, at 15:48, Nick Allen <n...@nickallen.org> wrote: >>> > >>> >> Even with 5 threads, it takes an hour for the full Alexa 1m, so I >>> think >>> > this will impact performance >>> > >>> > What exactly takes an hour? Adding 1M entries to a bloom filter? That >>> > seems really high, unless I am not understanding something. >>> > >>> > >>> > >>> > >>> > >>> > >>> > On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella <ceste...@gmail.com> >>> wrote: >>> > >>> >> Thanks for the feedback, Nick. >>> >> >>> >> Regarding "IMHO, I'd rather not reinvent the wheel for text >>> manipulation." >>> >> >>> >> I would argue that we are not reinventing the wheel for text >>> manipulation >>> >> as the extractor config exists already and we are doing a similar >>> thing in >>> >> the flatfile loader (in fact, the code is reused and merely extended). >>> >> Transformation operations are already supported in our codebase in the >>> >> extractor config, this PR has just added some hooks for stateful >>> >> operations. >>> >> >>> >> Furthermore, we will need a configuration object to pass to the REST >>> call >>> >> if we are ever to create a UI around importing data into hbase or >>> creating >>> >> these summary objects. >>> >> >>> >> Regarding your example: >>> >> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i >>> >> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)' >>> >> >>> >> I'm very sympathetic to this type of extension, but it has some issues: >>> >> >>> >> 1. This implies a single-threaded addition to the bloom filter. >>> >> 1. Even with 5 threads, it takes an hour for the full alexa 1m, >>> so I >>> >> think this will impact performance >>> >> 2. There's not a way to specify how to merge across threads if we >>> do >>> >> make a multithread command line option >>> >> 2. This restricts these kinds of operations to roles with heavy unix >>> CLI >>> >> knowledge, which isn't often the types of people who would be doing >>> this >>> >> type of operation >>> >> 3. What if we need two variables passed to stellar? >>> >> 4. This approach will be harder to move to Hadoop. Eventually we >>> will >>> >> want to support data on HDFS being processed by Hadoop (similar to >>> >> flatfile >>> >> loader), so instead of -m LOCAL being passed for the flatfile >>> summarizer >>> >> you'd pass -m SPARK and the processing would happen on the cluster >>> >> 1. This is particularly relevant in this case as it's a >>> >> embarrassingly parallel problem in general >>> >> >>> >> In summary, while this a CLI approach is attractive, I prefer the >>> extractor >>> >> config solution because it is the solution with the smallest iteration >>> >> that: >>> >> >>> >> 1. Reuses existing metron extraction infrastructure >>> >> 2. Provides the most solid base for the extensions that will be >>> sorely >>> >> needed soon (and will keep it in parity with the flatfile loader) >>> >> 3. Provides the most solid base for a future UI extension in the >>> >> management UI to support both summarization and loading >>> >> >>> >> >>> >> >>> >> >>> >> On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen <n...@nickallen.org> >>> wrote: >>> >> >>> >>> First off, I really do like the typosquatting use case and a lot of >>> what >>> >>> you have described. >>> >>> >>> >>>> We need a way to generate the summary sketches from flat data for >>> this >>> >> to >>> >>>> work. >>> >>>> .. >>> >>>> >>> >>> >>> >>> I took this quote directly from your use case. Above is the point >>> that >>> >> I'd >>> >>> like to discuss and what your proposed solutions center on. This is >>> >> what I >>> >>> think you are trying to do, at least with PR #879 >>> >>> <https://github.com/apache/metron/pull/879>... >>> >>> >>> >>> (Q) Can we repurpose Stellar functions so that they can operate on >>> text >>> >>> stored in a file system? >>> >>> >>> >>> >>> >>> Whether we use the (1) Configuration or the (2) Function-based >>> approach >>> >>> that you described, fundamentally we are introducing new ways to >>> perform >>> >>> text manipulation inside of Stellar. >>> >>> >>> >>> IMHO, I'd rather not reinvent the wheel for text manipulation. It >>> would >>> >> be >>> >>> painful to implement and maintain a bunch of Stellar functions for >>> text >>> >>> manipulation. People already have a large number of tools available >>> to >>> >> do >>> >>> this and everyone has their favorites. People are resistant to >>> learning >>> >>> something new when they already are familiar with another way to do >>> the >>> >>> same thing. >>> >>> >>> >>> So then the question is, how else can we do this? My suggestion is >>> that >>> >>> rather than introducing text manipulation tools inside of Stellar, we >>> >> allow >>> >>> people to use the text manipulation tools they already know, but with >>> the >>> >>> Stellar functions that we already have. And the obvious way to tie >>> those >>> >>> two things together is the Unix pipeline. >>> >>> >>> >>> A quick, albeit horribly incomplete, example to flesh this out a bit >>> more >>> >>> based on the example you have in PR #879 >>> >>> <https://github.com/apache/metron/pull/879>. This would allow me to >>> >>> integrate Stellar with whatever external tools that I want. >>> >>> >>> >>> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i >>> >>> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)' >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Sun, Dec 24, 2017 at 8:28 PM, Casey Stella <ceste...@gmail.com> >>> >> wrote: >>> >>> >>> >>>> I'll start this discussion off with my idea around a 2nd step that is >>> >>> more >>> >>>> adaptable. I propose the following set of stellar functions backed >>> by >>> >>>> Spark in the metron-management project: >>> >>>> >>> >>>> - CSV_PARSE(location, separator?, columns?) : Constructs a Spark >>> >>>> Dataframe for reading the flatfile >>> >>>> - SQL_TRANSFORM(dataframe, spark sql statement): Transforms the >>> >>>> dataframe >>> >>>> - SUMMARIZE(state_init, state_update, state_merge): Summarize the >>> >>>> dataframe using the lambda functions: >>> >>>> - state_init - executed once per worker to initialize the state >>> >>>> - state_update - executed once per row >>> >>>> - state_merge - Merge the worker states into one worker state >>> >>>> - OBJECT_SAVE(obj, output_path) : Save the object obj to the path >>> >>>> output_path on HDFS. >>> >>>> >>> >>>> This would enable more flexibility and composibility than the >>> >>>> configuration-based approach that we have in the flatfile loader. >>> >>>> My concern with this approach, and the reason I didn't do it >>> initially, >>> >>> was >>> >>>> that I think that users will want at least 2 ways to summarize data >>> (or >>> >>>> load data): >>> >>>> >>> >>>> - A configuration based approach, which enables a UI >>> >>>> - A set of stellar functions via the scriptable REPL >>> >>>> >>> >>>> I would argue that both have a place and I started with the >>> >> configuration >>> >>>> based approach as it was a more natural extension of what we already >>> >> had. >>> >>>> I'd love to hear thoughts about this idea too. >>> >>>> >>> >>>> >>> >>>> On Sun, Dec 24, 2017 at 8:20 PM, Casey Stella <ceste...@gmail.com> >>> >>> wrote: >>> >>>> >>> >>>>> Hi all, >>> >>>>> >>> >>>>> I wanted to get some feedback on a sensible plan for something. It >>> >>>>> occurred to me the other day when considering the use-case of >>> >> detecting >>> >>>>> typosquatted domains, that one approach was to generate the set of >>> >>>>> typosquatted domains for some set of reference domains and compare >>> >>>> domains >>> >>>>> as they flow through. >>> >>>>> >>> >>>>> One way we could do this would be to generate this data and import >>> >> the >>> >>>>> typosquatted domains into HBase. I thought, however, that another >>> >>>> approach >>> >>>>> which may trade-off accuracy to remove the network hop and potential >>> >>> disk >>> >>>>> seek by constructing a bloom filter that includes the set of >>> >>> typosquatted >>> >>>>> domains. >>> >>>>> >>> >>>>> The challenge was that we don't have a way to do this currently. We >>> >>> do, >>> >>>>> however, have a loading infrastructure (e.g. the flatfile_loader) >>> and >>> >>>>> configuration (see https://github.com/apache/ >>> >>> metron/tree/master/metron- >>> >>>>> platform/metron-data-management#common-extractor-properties) which >>> >>>>> handles: >>> >>>>> >>> >>>>> - parsing flat files >>> >>>>> - transforming the rows >>> >>>>> - filtering the rows >>> >>>>> >>> >>>>> To enable the new use-case of generating a summary object (e.g. a >>> >> bloom >>> >>>>> filter), in METRON-1378 (https://github.com/apache/metron/pull/879) >>> >> I >>> >>>>> propose that we create a new utility that uses the same extractor >>> >>> config >>> >>>>> add the ability to: >>> >>>>> >>> >>>>> - initialize a state object >>> >>>>> - update the object for every row >>> >>>>> - merge the state objects (in the case of multiple threads, in the >>> >>>>> case of one thread it's not needed). >>> >>>>> >>> >>>>> I think this is a sensible decision because: >>> >>>>> >>> >>>>> - It's a minimal movement from the flat file loader >>> >>>>> - Uses the same configs >>> >>>>> - Abstracts and reuses the existing infrastructure >>> >>>>> - Having one extractor config means that it should be easier to >>> >>>>> generate a UI around this to simplify the experience >>> >>>>> >>> >>>>> All that being said, our extractor config is..shall we >>> say...daunting >>> >>> :). >>> >>>>> I am sensitive to the fact that this adds to an existing difficult >>> >>>> config. >>> >>>>> I propose that this is an initial step forward to support the >>> >> use-case >>> >>>> and >>> >>>>> we can enable something more composable going forward. My concern >>> in >>> >>>>> considering this as the first step was that it felt that the >>> >> composable >>> >>>>> units for data transformation and manipulation suddenly takes us >>> >> into a >>> >>>>> place where Stellar starts to look like Pig or Spark RDD API. I >>> >> wasn't >>> >>>>> ready for that without a lot more discussion. >>> >>>>> >>> >>>>> To summarize, what I'd like to get from the community is, after >>> >>> reviewing >>> >>>>> the entire use-case at https://github.com/cestella/ >>> >>>> incubator-metron/tree/ >>> >>>>> typosquat_merge/use-cases/typosquat_detection: >>> >>>>> >>> >>>>> - Is this so confusing that it does not belong in Metron even as a >>> >>>>> first-step? >>> >>>>> - Is there a way to extend the extractor config in a less >>> >> confusing >>> >>>>> way to enable this? >>> >>>>> >>> >>>>> I apologize for making the discuss thread *after* the JIRAs, but I >>> >> felt >>> >>>>> this one might bear having some working code to consider. >>> >>>>> >>> >>>> >>> >>> >>> >> ------------------- Thank you, James Sirota PMC- Apache Metron jsirota AT apache DOT org