[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler
Github user nickwallen commented on the issue: https://github.com/apache/metron/pull/622 There was a lot of good discussion on this, but I find it hard to summarize completely the positions of everyone including @mattf-horton and @cestella. Here is my attempt in trying to do that. Please correct anything that I have misstated. 1. Everyone agreed that a ToC (table of contents) is a useful additional feature for the Profiler. The decodable row key would be needed in addition to, not instead of, a ToC. 1. In implementing a decodable row key, we do need to plan for future changes in row key format. This was handled in this PR, but can be improved. 1. The decodable row key feature should be completed **before** a ToC so that the row keys can be used to generate (or regenerate) a ToC on-demand. 1. There were various suggestions made on how to shorten up the row key format. Some of those I completed on this PR (like using a murmur hash) and others (like using shorts instead of ints) I would need to incorporate in a future PR for a decodable row key. 1. There is a need for a migration tool. A tool that can read the existing row key format and rewrite the same data using a new format. This tool is necessary even if it cannot be implemented deterministically with the current row key format. The tool may not hints from the user like the names of known profiles. Once I compile a summary of these changes, I will close this PR. All enhancements around this will be implemented on new PRs. ---
[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler
Github user justinleet commented on the issue: https://github.com/apache/metron/pull/622 @nickwallen I haven't been following this discussion, but it seems like a useful feature / enhancement that's been hanging out awhile after active discussion petered out. What are the next steps here? Does this PR need changes? Should the discussion be revived on the user lists? It doesn't seem like there was any consensus on the approach, but again, I like this enhancement a lot. ---
[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler
Github user mattf-horton commented on the issue: https://github.com/apache/metron/pull/622 @cestella , > Would this approach require scans on read in the critical path? I don't perceive that decoding rowkeys is on any critical path. You only need to look up Profile by serial number (or hash) in the case of decoding rowkeys. No? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler
Github user mattf-horton commented on the issue: https://github.com/apache/metron/pull/622 @nickwallen brought up the issue of wildcard queries on our rowkeys. It has always bothered me that we can't do wildcard queries on groups. If you have, for example, a single groupBy based on day of week, that's just 7 possible values, and if you want them all you could just do 7 queries and combine them. But if you have three groupBy's, and they have 7, 31, and 256 possible values, then to simulate a wildcard query you would have to do over 55,000 individual queries! Of course you would just do an hbase scan, but it would require a full table scan to select the time range desired. I propose that we re-order the rowkey elements to support prefix queries on Profile and time range, with wildcarding for primarily groups, and secondarily entities, ie: \\ \ \ \ \ So if I want the results for all rows in a time range regarding entity "192.168.222.123" regardless of group, I can query it, and if I want all rows in a time range regardless of entity value or group, I can query that too, as efficiently as an ordinary time range query. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler
Github user cestella commented on the issue: https://github.com/apache/metron/pull/622 @mattf-horton Would this approach require scans on read in the critical path? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler
Github user mattf-horton commented on the issue: https://github.com/apache/metron/pull/622 @cestella , tl;dr: The discussion of serial numbers is a distraction. Let's just use the profileHash and forget the serial number. It was a micro-optimization. Answer to your question: Two cases: - If you have the profileHash, then you can look up the Profile using an hbase wildcard query for rowkey \\* , and since the profileHash is unique, it will be essentially as efficient as using the full rowkey. - If you are trying to decode a rowkey and only have the serial number then I stated some assumptions: "The expectation is that we will seldom (almost never) need to reference back to the Profile specification, and the total usage of Profile specs will be human-scale finite, **so it is okay to "scan" the ProfileSpecs table to find the full Profile spec referenced by a profileSN.** If this is not true, use the full hash as both the rowkey in the PeriodSpecs table, and as the reference element in the Profile rowkeys." --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler
Github user cestella commented on the issue: https://github.com/apache/metron/pull/622 @mattf-horton Wouldn't you have to use the serial number to retrieve profiles? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler
Github user mattf-horton commented on the issue: https://github.com/apache/metron/pull/622 @cestella , we would not need to keep an index resident in memory. Most of the time we would just have the active Profiles in memory, exactly as we do today. You only need to retrieve the Profile by serial number on the rare occasions that you have to decode rowkeys. That said, it's fine with me to just use the profileHash. I agree it decreases complexity. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler
Github user mattf-horton commented on the issue: https://github.com/apache/metron/pull/622 And btw, since there is no easily expressed algorithm for the NLP part of the problem, I'm +1 on doing both a decodable rowkey and a ToC. For the existing profiles that @cestella expressed concern about, I would point out that as long as one DOES have the Profile specs still lying around, it's actually easy to re-write the old Profiles into new format with decodable rowkeys. That is a very modest-sized program, the main problem being noticing and dealing with duplicate titled Profiles with different periodDurations. But the info I pointed out in the paper helps sufficiently, I think. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler
Github user mattf-horton commented on the issue: https://github.com/apache/metron/pull/622 Here's what I've got on decoding old rowkeys: https://gist.github.com/mattf-horton/8e685e373b1a3fa6aeec8ef8828be096 The format of the keys is `salt (4B) + profile name (?) + entity name (?) + groupvalues (?) + period (8B)` with most of it (all but the salt and period number) in the clear as human-readable strings. Deducing periodDuration has a nice arithmetic answer, I think. The NLP issues are of course harder. Enjoy the read, it's only two pages. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler
Github user cestella commented on the issue: https://github.com/apache/metron/pull/622 I want to point out that I am also in favor of an audit log for the profiler, but I don't think it's a complete solution for the batch analytics use-case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler
Github user cestella commented on the issue: https://github.com/apache/metron/pull/622 Also, while we're in here, is there a strong reason why the prefixed hash is so large? It's just there for uniformity of distribution, correct? I'd propose a non-cryptographic hash for this purpose like Murmur. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler
Github user cestella commented on the issue: https://github.com/apache/metron/pull/622 So, in my mind the feature here is the enablement of batch analytics on the profiles. To that end, I'm in general in favor of a decodable row key. I think that the question really isn't a ToC *or* a decodable rowkey. I think, rather, we will want both. The two will follow different access patterns. A decodable rowkey sans ToC will be suitable only for full table scan-style access. A ToC would enable to slice or dice by profile/entity/etc. That being said, a ToC without a decodable rowkey is substantially less nice. Without being able to decode the rowkey, we will not be able to regenerate the ToC to provide alternative indexing. I see this as a first step to enable a broader discussion on just what kind of access semantics beyond Get/Put we want to place on the profiles. All that to say, I'm in favor of the effort. I worry at the impact going forward to existing profiles, though. From the point where we do this, we will create a fork whereby new profiles and old profiles diverge. I think we need to discuss the migration story more explicitly and see if it is plausible to create a migration tool that is fuzzy (i.e. will look at the existing profiles and try to pick them apart). I'd be ok for that work to be a follow-on, but I would want the plan to be very explicit and I would be -1 for a release until it's in. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler
Github user mattf-horton commented on the issue: https://github.com/apache/metron/pull/622 Let me take a look at this more deeply. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler
Github user nickwallen commented on the issue: https://github.com/apache/metron/pull/622 > I don't think our current row key is totally opaque, it just needs a brute-force approach to figure out. Not suitable for interactive queries, but would be acceptable for a one-time pass to build (or re-build) the ToC. For reference, here is what the existing row key looks-like. salt (16B) + profile name (?) + entity name (?) + groups (?) + time (8B) How would you decode it? The salt and the time components have known lengths; 16B and 8B respectively. Other than those two components, I don't know how to distinguish the profile name, entity or groups. I can only decode the row key if I already know either the profile name or the entity, which defeats the advantages of being able to decode it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler
Github user nickwallen commented on the issue: https://github.com/apache/metron/pull/622 > Your proposal has the advantage of making data in HBase self-identifying (if one has the key), which I always like. However, it's a large change and induces yet more complexity What do you find unnecessarily complex here? The code base was already designed to accept different row key implementations. So this change involves the following. 1. The new decodable row key 2. Profiler client logic to instantiate row key builders 3. Profiler client logic to pass parameters to the instantiated row key builders I would agree that I think item 3 is unnecessarily complex. That's where I wanted feedback. I think just passing parameters through an interface method would simplify this a lot. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---