[ 
https://issues.apache.org/jira/browse/CASSANDRA-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14350856#comment-14350856
 ] 

Benedict commented on CASSANDRA-8929:
-------------------------------------

So the goal is for users to do this as an acceptance phase prior to deploying 
an upgrade?

We can certainly work to make it easier to produce a good profile (manually or 
otherwise), and I think better example profiles that we use for testing will go 
a long way towards this. 

I do like the _idea_ of automatic generation, but it's not a simple task, and 
it will touch quite a few integral codepaths. We need at minimum, for each 
update, to sample presence, size and compressibility for each column, along 
with a frequency distribution of partition key participation, and cql row 
participation (i.e. for each partition key, we need to reconstruct the 
distribution of updates for each row within it). Simply collecting this is 
non-trivial. Constructing a profile from this data - once stress supports all 
of the functionality encountered - probably isn't super challenging 
conceptually, as we can calculate a best-fit distribution for the data we've 
sampled. It's still a significant chunk of work though. I do wonder if we can't 
instead create a tool for generating this from an analysis of sstables combined 
with some user provided data, as it would be easier to build and maintain 
without it being intertwined with the c* code. Possibly alongside some very 
simple sampling of just the frequency of given CQL statements.

> Workload sampling
> -----------------
>
>                 Key: CASSANDRA-8929
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8929
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Tools
>            Reporter: Jonathan Ellis
>
> Workload *recording* looks to be unworkable (CASSANDRA-6572).  We could build 
> something almost as useful by sampling the requests sent to a node and 
> building a synthetic workload with the same characteristics using the same 
> (or anonymized) schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to