Main issue is resolved. The test I was using to determine normality was too
sensitive to discretization, so it was yielding a negative result even though
the data looked pretty normal on visual inspection. The tool only ever uses
the Strings generator; HexStrings is unused.
The only (minor) concern is that the Strings generator generates some control
characters as part of the generated string. I presume that this behavior is
undesired and that the characters should be restricted to ASCII printing
characters.
Thanks,
-Saleil
From: bened...@apache.org At: 12/13/18 17:10:17To: Saleil Bhat (BLOOMBERG/ 731
LEX ) , dev@cassandra.apache.org
Subject: Re: cassandra-stress HexStrings generator
I’m honestly not sure. The code has changed since I last worked on it, which
was years ago. I suspect the profile mode has entirely supplanted the prior
modes, and that these older modes supported the HexStrings generator.
Perhaps somebody else can help answer this question.
> On 13 Dec 2018, at 17:37, Saleil Bhat (BLOOMBERG/ 731 LEX)
wrote:
>
> Ah ok thanks. This brings up another question: how did the HexStrings
generator code path even get called?
>
>
>
> When I saw these results, I was using the following test table:
> CREATE TABLE testtable (
> partition_key text,
> clustering_column text,
> value text,
> PRIMARY KEY (partition_key, clustering_column)
> )
>
>
> From StressProfile.java, any column of type TEXT should use the Strings
generator.
> However, my data looks suspiciously like the HexStrings generator was being
used instead.
>
>
> First, the generated strings included control characters like SUB (\x1A), BEL
(\x07), etc. However, the Strings generator code looks like it forces the
characters to be in the printing characters range.
> Second, the result I documented previously (that the characters are normally
distributed, but the strings are not), matches the implementation of
HexStrings.
>
>
>
> Do you know why this might be the case?
>
> Thanks,
> -Saleil
>
>
> From: bened...@apache.org At: 12/12/18 18:09:14To: Saleil Bhat (BLOOMBERG/
731 LEX ) , dev@cassandra.apache.org
> Subject: Re: cassandra-stress HexStrings generator
>
> Yes, I’m pretty sure you understood correctly (I wrote most of this, but it’s
> been a long time so I cannot remember much for certain).
>
> It should be implemented like the Strings generator. It looks like both
> HexStrings and HexBytes are incorrect, and have been for a long time.
>
>
>> On 12 Dec 2018, at 22:27, Saleil Bhat (BLOOMBERG/ 731 LEX)
> wrote:
>>
>> Hi,
>>
>> I have a question about the behavior of the HexStrings value generator in
the
> cassandra-stress tool, particularly concerning its population/identity
> distribution.
>>
>>
>> Per the discussion in JIRA item CASSANDRA-6146 concerning the stress YAML
> profile, the population field in a columnspec “represents the total unique
> population distribution of that column across rows.”
>>
>>
>> I interpreted this to mean that if I specify some distribution 'F' for a
> column, then the probability of occurrence for each potential value of that
> column is given by 'F'.
>>
>> So, for example, if I provided the following columnspec for a text column:
>> name: fake_column
>> size: fixed(32)
>>population: gaussian(1..100)
>> and then generated a large amount of data according to this specification,
>> I would expect there to be 100 distinct values for ‘fake_column’, and that a
> histogram of the frequency of occurrence of each value would be roughly
> bell-shaped.
>>
>>
>>
>> However, the current implementation of the HexStrings generator deviates
from
> this expectation. In the current implementation, each CHARACTER in the string
> is drawn from F, rather than the string as a whole. Therefore, if you plot
the
> histogram of frequency of occurrence for each character, you get a
bell-shaped
> curve, but the distribution of the occurrences of whole strings (the actual
> columns) is something else.
>>
>>
>> My question is, is this the desired behavior for string columns? Was my
> expectation/interpretation incorrect? If so, can anyone give some insight as
to
> why strings are designed to behave this way and what the use case is for this
> behavior?
>>
>> Thanks,
>> -Saleil
>
>