[ 
https://issues.apache.org/jira/browse/CASSANDRA-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14350956#comment-14350956
 ] 

Jonathan Shook edited comment on CASSANDRA-8929 at 3/6/15 9:48 PM:
-------------------------------------------------------------------

Ideas on how I would like to see this work: (This is where I contradict myself 
in terms of simplicity by asking for more.)
Intercept at the coordinator, only record samples at the coordinator. Make 
sampling a sticky setting. Make it a table option, but also soft-settable via 
JMX.

Sampling controls:
* sample_probability: Just like trace probability
* sample_interval_seconds: Number of seconds for each sampling interval (I 
can't imagine why we'd need something finer grained, but maybe?)
* sample_max_per_interval: explained below

sample_max_per_interval: Number of samples per sampling interval, after which 
samples are suppressed. In this case, when the interval completes, the number 
of suppressed samples should also be written to the sample log, and reset. It's 
ok for this to be inconsistent with respect to restarts, etc. The main purpose 
it to avoid significant "over sampling" load, while still being able to see 
meaningful data during unexpected bursts.

Data controls, for anonymizing field values, when needed, the ability to select 
a level of obfuscation:
sample_data_obfuscate:
* actual - No changes, record samples with full field values
* hashed - Use md5 or something better to hide original sample values, but 
allow for statistical analysis
* sizes - Discard value, but record string lengths and collection counts

Data coverage: What to record.
* the statement itself
* whether it was prepared or not
* consistency level
* the client address
* any changes to sampling policy or settings - This could be a separate type of 
record in the sample log, as long as the formatting is stable for each value it 
encodes
* any counts for suppressed samples (written lazily at unthrottling time)



was (Author: jshook):
Ideas on how I would like to see this work: (This is where I contradict myself 
in terms of simplicity by asking for more.)
Intercept at the coordinator, only record samples at the coordinator. Make 
sampling a sticky setting. Make it a table option, but also soft-settable via 
JMX.

Sampling controls:
* sample_probability: Just like trace probability
* sample_interval_seconds: Number of seconds for each sampling interval (I 
can't imagine why we'd need something finer grained, but maybe?)
* sample_max_per_interval: explained below

sample_max_per_interval: Number of samples per sampling interval, after which 
samples are suppressed. In this case, when the interval completes, the number 
of suppressed samples should also be written to the sample log, and reset. It's 
ok for this to be inconsistent with respect to restarts, etc. The main purpose 
it to avoid significant "over sampling" load, while still being able to see 
meaningful data during unexpected bursts.

Data controls, for anonymizing field values, when needed, the ability to select 
a level of obfuscation:
* sample_data_obfuscate
* actual - No changes, record samples with full field values
* hashed - Use md5 or something better to hide original sample values, but 
allow for statistical analysis
* sizes - Discard value, but record string lengths and collection counts

Data coverage: What to record.
* the statement itself
* whether it was prepared or not
* consistency level
* the client address
* any changes to sampling policy or settings - This could be a separate type of 
record in the sample log, as long as the formatting is stable for each value it 
encodes
* any counts for suppressed samples (written lazily at unthrottling time)


> Workload sampling
> -----------------
>
>                 Key: CASSANDRA-8929
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8929
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Tools
>            Reporter: Jonathan Ellis
>
> Workload *recording* looks to be unworkable (CASSANDRA-6572).  We could build 
> something almost as useful by sampling the requests sent to a node and 
> building a synthetic workload with the same characteristics using the same 
> (or anonymized) schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to