[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler

2018-01-02 Thread nickwallen
Github user nickwallen commented on the issue:

https://github.com/apache/metron/pull/622
  
There was a lot of good discussion on this, but I find it hard to summarize 
completely the positions of everyone including @mattf-horton and @cestella.  
Here is my attempt in trying to do that.  Please correct anything that I have 
misstated.

1. Everyone agreed that a ToC (table of contents) is a useful additional 
feature for the Profiler.  The decodable row key would be needed in addition 
to, not instead of, a ToC.

1. In implementing a decodable row key, we do need to plan for future 
changes in row key format.  This was handled in this PR, but can be improved.

1. The decodable row key feature should be completed **before** a ToC so 
that the row keys can be used to generate (or regenerate) a ToC on-demand.

1. There were various suggestions made on how to shorten up the row key 
format.  Some of those I completed on this PR (like using a murmur hash) and 
others (like using shorts instead of ints) I would need to incorporate in a 
future PR for a decodable row key.

1. There is a need for a migration tool.  A tool that can read the existing 
row key format and rewrite the same data using a new format.  This tool is 
necessary even if it cannot be implemented deterministically with the current 
row key format.  The tool may not hints from the user like the names of known 
profiles.


Once I compile a summary of these changes, I will close this PR.  All 
enhancements around this will be implemented on new PRs.


---


[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler

2018-01-02 Thread justinleet
Github user justinleet commented on the issue:

https://github.com/apache/metron/pull/622
  
@nickwallen I haven't been following this discussion, but it seems like a 
useful feature / enhancement that's been hanging out awhile after active 
discussion petered out. What are the next steps here?  Does this PR need 
changes?  Should the discussion be revived on the user lists?  It doesn't seem 
like there was any consensus on the approach, but again, I like this 
enhancement a lot.


---


[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler

2017-07-25 Thread mattf-horton
Github user mattf-horton commented on the issue:

https://github.com/apache/metron/pull/622
  
@cestella , 
> Would this approach require scans on read in the critical path?

I don't perceive that decoding rowkeys is on any critical path.  You only 
need to look up Profile by serial number (or hash) in the case of decoding 
rowkeys.  No? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler

2017-07-25 Thread mattf-horton
Github user mattf-horton commented on the issue:

https://github.com/apache/metron/pull/622
  
@nickwallen brought up the issue of wildcard queries on our rowkeys.  It 
has always bothered me that we can't do wildcard queries on groups.  If you 
have, for example, a single groupBy based on day of week, that's just 7 
possible values, and if you want them all you could just do 7 queries and 
combine them.  But if you have three groupBy's, and they have 7, 31, and 256 
possible values, then to simulate a wildcard query you would have to do over 
55,000 individual queries!  Of course you would just do an hbase scan, but it 
would require a full table scan to select the time range desired.

I propose that we re-order the rowkey elements to support prefix queries on 
Profile and time range, with wildcarding for primarily groups, and secondarily 
entities, ie:
\\\\\\

So if I want the results for all rows in a time range regarding entity 
"192.168.222.123" regardless of group, I can query it, and if I want all rows 
in a time range regardless of entity value or group, I can query that too, as 
efficiently as an ordinary time range query.  What do you think?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler

2017-07-25 Thread cestella
Github user cestella commented on the issue:

https://github.com/apache/metron/pull/622
  
@mattf-horton Would this approach require scans on read in the critical 
path?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler

2017-07-25 Thread mattf-horton
Github user mattf-horton commented on the issue:

https://github.com/apache/metron/pull/622
  
@cestella , tl;dr: The discussion of serial numbers is a distraction.  
Let's just use the profileHash and forget the serial number.  It was a 
micro-optimization.

Answer to your question:  Two cases:
- If you have the profileHash, then you can look up the Profile using an 
hbase wildcard query for rowkey \\* , and since the profileHash 
is unique, it will be essentially as efficient as using the full rowkey.
- If you are trying to decode a rowkey and only have the serial number then 
I stated some assumptions: "The expectation is that we will seldom (almost 
never) need to reference back to the Profile specification, and the total usage 
of Profile specs will be human-scale finite, **so it is okay to "scan" the 
ProfileSpecs table to find the full Profile spec referenced by a profileSN.** 
If this is not true, use the full hash as both the rowkey in the PeriodSpecs 
table, and as the reference element in the Profile rowkeys."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler

2017-07-25 Thread cestella
Github user cestella commented on the issue:

https://github.com/apache/metron/pull/622
  
@mattf-horton Wouldn't you have to use the serial number to retrieve 
profiles?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler

2017-07-25 Thread mattf-horton
Github user mattf-horton commented on the issue:

https://github.com/apache/metron/pull/622
  
@cestella , we would not need to keep an index resident in memory.  Most of 
the time we would just have the active Profiles in memory, exactly as we do 
today.  You only need to retrieve the Profile by serial number on the rare 
occasions that you have to decode rowkeys.  That said, it's fine with me to 
just use the profileHash.  I agree it decreases complexity.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler

2017-07-21 Thread mattf-horton
Github user mattf-horton commented on the issue:

https://github.com/apache/metron/pull/622
  
And btw, since there is no easily expressed algorithm for the NLP part of 
the problem, I'm +1 on doing both a decodable rowkey and a ToC.  For the 
existing profiles that @cestella expressed concern about, I would point out 
that as long as one DOES have the Profile specs still lying around, it's 
actually easy to re-write the old Profiles into new format with decodable 
rowkeys.  That is a very modest-sized program, the main problem being noticing 
and dealing with duplicate titled Profiles with different periodDurations.  But 
the info I pointed out in the paper helps sufficiently, I think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler

2017-07-21 Thread mattf-horton
Github user mattf-horton commented on the issue:

https://github.com/apache/metron/pull/622
  
Here's what I've got on decoding old rowkeys:
https://gist.github.com/mattf-horton/8e685e373b1a3fa6aeec8ef8828be096

The format of the keys is
`salt (4B) + profile name (?) + entity name (?) + groupvalues (?) + period 
(8B)`
with most of it (all but the salt and period number) in the clear as 
human-readable strings.

Deducing periodDuration has a nice arithmetic answer, I think.
The NLP issues are of course harder.  Enjoy the read, it's only two pages.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler

2017-07-20 Thread cestella
Github user cestella commented on the issue:

https://github.com/apache/metron/pull/622
  
I want to point out that I am also in favor of an audit log for the 
profiler, but I don't think it's a complete solution for the batch analytics 
use-case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler

2017-07-20 Thread cestella
Github user cestella commented on the issue:

https://github.com/apache/metron/pull/622
  
Also, while we're in here, is there a strong reason why the prefixed hash 
is so large?  It's just there for uniformity of distribution, correct?  I'd 
propose a non-cryptographic hash for this purpose like Murmur.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler

2017-07-20 Thread cestella
Github user cestella commented on the issue:

https://github.com/apache/metron/pull/622
  
So, in my mind the feature here is the enablement of batch analytics on the 
profiles.  To that end, I'm in general in favor of a decodable row key. I think 
that the question really isn't a ToC *or* a decodable rowkey.  I think, rather, 
we will want both.  The two will follow different access patterns.  A decodable 
rowkey sans ToC will be suitable only for full table scan-style access.  A ToC 
would enable to slice or dice by profile/entity/etc.  

That being said, a ToC without a decodable rowkey is substantially less 
nice.  Without being able to decode the rowkey, we will not be able to 
regenerate the ToC to provide alternative indexing.  I see this as a first step 
to enable a broader discussion on just what kind of access semantics beyond 
Get/Put we want to place on the profiles.

All that to say, I'm in favor of the effort.  I worry at the impact going 
forward to existing profiles, though.  From the point where we do this, we will 
create a fork whereby new profiles and old profiles diverge.  I think we need 
to discuss the migration story more explicitly and see if it is plausible to 
create a migration tool that is fuzzy (i.e. will look at the existing profiles 
and try to pick them apart).

I'd be ok for that work to be a follow-on, but I would want the plan to be 
very explicit and I would be -1 for a release until it's in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler

2017-07-20 Thread mattf-horton
Github user mattf-horton commented on the issue:

https://github.com/apache/metron/pull/622
  
Let me take a look at this more deeply.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler

2017-07-20 Thread nickwallen
Github user nickwallen commented on the issue:

https://github.com/apache/metron/pull/622
  
>  I don't think our current row key is totally opaque, it just needs a 
brute-force approach to figure out. Not suitable for interactive queries, but 
would be acceptable for a one-time pass to build (or re-build) the ToC.

For reference, here is what the existing row key looks-like.

 salt (16B) + profile name (?) + entity name (?) + groups (?) + time (8B)

How would you decode it?  The salt and the time components have known 
lengths; 16B and 8B respectively.  Other than those two components, I don't 
know how to distinguish the profile name, entity or groups.  I can only decode 
the row key if I already know either the profile name or the entity, which 
defeats the advantages of being able to decode it.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] metron issue #622: METRON-1005 Create Decodable Row Key for Profiler

2017-07-15 Thread nickwallen
Github user nickwallen commented on the issue:

https://github.com/apache/metron/pull/622
  
> Your proposal has the advantage of making data in HBase self-identifying 
(if one has the key), which I always like. However, it's a large change and 
induces yet more complexity

What do you find unnecessarily complex here?  The code base was already 
designed to accept different row key implementations.  So this change involves 
the following.

1. The new decodable row key 
2. Profiler client logic to instantiate row key builders
3. Profiler client logic to pass parameters to the instantiated row key 
builders

I would agree that I think item 3 is unnecessarily complex.  That's where I 
wanted feedback.  I think just passing parameters through an interface method 
would simplify this a lot.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---