Re: [DISCUSS] Nested YAML configs for new features

2021-11-29 Thread David Capwell
>  but I would hate to repeat the mistakes of our past by evolving the config 
> in a new direction without any coherent overarching design.

At the start I asked to keep the thread local to new features, but to more 
flesh out an “overarching design” maybe we should increase the “desired” scope 
to be “feature” (and leave non-features to CASSANDRA-15234 - Standardise config 
and JVM parameters)?  Aka, do we think the following is more ideal (configs 
scoped to a feature)

hinted_handoff:
  enabled: true
  disabled_datacenters:
- DC1
- DC2
  max_window: 3h
  flush_period: 10s
  max_file_size: 128mb
  compression:
class_name: LZ4Compressor
parameters:
  a: b

track_warnings:
  enabled: true
  local_read_size:
warn_threshold: 1mb
abort_threshold: 10mb
  coordinator_read_size:
warn_threshold: 5mb
abort_threshold: 20mb


OR

# I had to rename hint configs as there was 0 consistent naming
hinted_handoff_enabled: true
hinted_handoff_disabled_datacenters:
  - 'DC1'
  - 'DC2'
hinted_handoff_max_window: 3h
hinted_handoff_max_file_size: 128mb
hinted_handoff_flush_period: 10s
hinted_handoff_compression:
  class_name: LZ4Compressor
  parameters:
a: b

track_warnings_enabled: true
track_warnings_local_read_size_warn_threshold: 1mb
track_warnings_local_read_size_abort_threshold: 10mb
track_warnings_coordinator_read_size_warn_threshold: 5mb
track_warnings_coordinator_read_size_abort_threshold: 20mb


The main issue I have with flat structure is that we have no way to enforce 
standard naming; if you look at the hint example there were at least 3 naming 
conventions (CASSANDRA-15234 is to clean this up, but can we actually maintain 
that?).  And one of the core reasons track_warnings went nested was that 
warn/abort some times became warn/fail and threshold some times was 
thresholds…. By embracing nested structure we can actually enforce consistency, 
with flat we have no way to maintain consistency.

Additionally by embracing the nested structure we can accept a flat one as well 
(PR in CASSANDRA-17166 shows this working) if users desire it; so we get the 
consistency of nested, and the “grep” benefits of flat.


> On Nov 29, 2021, at 2:17 PM, bened...@apache.org wrote:
> 
> If we’re thinking of moving towards nested configuration, then before 
> employing the approach further we would ideally consider what a fully nested 
> config looks like for the project. Ekaterina has done a lot to clean up 
> inconsistent naming, but I would hate to repeat the mistakes of our past by 
> evolving the config in a new direction without any coherent overarching 
> design.
> 
> In case anyone missed it in the earlier discussion, this was my attempt to 
> prototype a nested config: 
> https://github.com/belliottsmith/cassandra/blob/5f80d1c0d38873b7a27dc137656d8b81f8e6bbd7/conf/cassandra_nocomment.yaml
> 
> I don’t have any specific attachment to it, but settling on some approximate 
> scheme would be helpful IMO.
> 
> From: David Capwell 
> Date: Monday, 29 November 2021 at 20:38
> To: dev@cassandra.apache.org 
> Subject: Re: [DISCUSS] Nested YAML configs for new features
>> What should our default example cassandra.yaml file use (flat or nested)?  
>> Currently default shows nested
> 
> Was told this statement was confusing, so trying to clarify.  At the moment 
> we do not allow a nested config to be expressed in any way outside of nesting 
> it (excluding YAML’s ability to inline objects), so if we did allow flat 
> config representation of nested configs, then this would be a brand new 
> feature; we currently show the nested structure in cassandra.yaml
> 
>> On Nov 29, 2021, at 11:58 AM, David Capwell  
>> wrote:
>> 
>> Thanks everyone for the comments, I hope below is a good summary of all the 
>> talking points?
>> 
>> We already use nested configs (networking, seed provider, commit log/hint 
>> compression, back pressure, etc.)
>> Flat configs are easier for grep, but can be solved with grep -A/-B and/or yq
>> It would be possible to support flat versions of our configs in 
>> cassandra.yaml (in addition to the nested versions)
>> "Settings" vtable currently uses the "_" separator (example of 
>> encryption/audit log).  Switching to "." Would be a change in behavior which 
>> may impact some users
>> "." Separator for nested configs are common in other systems (yq, elastic 
>> search, etc.)
>> "Structured / nested config is easier for human eyes to read"... "Flat 
>> config is harder for human eyes but easy for simple scripts"
>> For learning what configs are enabled, cassandra.yaml isn't the best 
>> interface as it may not reflect the actual configs; we can better expose 
>> this in CQL and/or Sidecar
>> What should our default example cassandra.yaml file use (flat or nested)?  
>> Currently default shows nested
>> When projecting the Config into CQL, we may want to consider UDTs to 
>> represent the complex types
>> Current limitations in CQL make nested structures hard to work with, 

Re: [DISCUSS] Throughput issues when inserting on contended partitions

2021-11-29 Thread bened...@apache.org
I’m in favour, though I have weaker requirements for backports than others.

This work is pretty significant, though. It’s nothing like the complexity of 
CEP-14, but it heavily modifies a critical piece of the system. I would say 
that it needs a rigorous review process if it’s going into a patch release.

I also won’t have time to forward port it in the near future, though I think it 
shouldn’t be too onerous for other contributors, and I’m happy to help guide 
that process.


From: Brandon Williams 
Date: Monday, 29 November 2021 at 17:03
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Throughput issues when inserting on contended partitions
I think it makes a lot of sense to fix these in 4.0, they have been
lingering issues long enough. +1

On Mon, Nov 29, 2021 at 10:59 AM Benjamin Lerer  wrote:
>
> Hi everybody,
>
> We have seen some serious throughput issues when inserting data with
> collections on contended partitions ( CASSANDRA-17163
>  and CASSANDRA-15464
> ).
> Benedict has created some patches to address those issues and improve the
> insertion throughput and memory consumptions ( CASSANDRA-15510
>  and  CASSANDRA-15511
> )
> Those patches are significant changes and are currently marked as
> improvements.
>
> I am wondering if we should not consider implementing those changes in 4.0
> considering the fact that they will fix serious existing issues and would
> like to hear your options about it.

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org


Re: [DISCUSS] Nested YAML configs for new features

2021-11-29 Thread bened...@apache.org
If we’re thinking of moving towards nested configuration, then before employing 
the approach further we would ideally consider what a fully nested config looks 
like for the project. Ekaterina has done a lot to clean up inconsistent naming, 
but I would hate to repeat the mistakes of our past by evolving the config in a 
new direction without any coherent overarching design.

In case anyone missed it in the earlier discussion, this was my attempt to 
prototype a nested config: 
https://github.com/belliottsmith/cassandra/blob/5f80d1c0d38873b7a27dc137656d8b81f8e6bbd7/conf/cassandra_nocomment.yaml

I don’t have any specific attachment to it, but settling on some approximate 
scheme would be helpful IMO.

From: David Capwell 
Date: Monday, 29 November 2021 at 20:38
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Nested YAML configs for new features
> What should our default example cassandra.yaml file use (flat or nested)?  
> Currently default shows nested

Was told this statement was confusing, so trying to clarify.  At the moment we 
do not allow a nested config to be expressed in any way outside of nesting it 
(excluding YAML’s ability to inline objects), so if we did allow flat config 
representation of nested configs, then this would be a brand new feature; we 
currently show the nested structure in cassandra.yaml

> On Nov 29, 2021, at 11:58 AM, David Capwell  
> wrote:
>
> Thanks everyone for the comments, I hope below is a good summary of all the 
> talking points?
>
> We already use nested configs (networking, seed provider, commit log/hint 
> compression, back pressure, etc.)
> Flat configs are easier for grep, but can be solved with grep -A/-B and/or yq
> It would be possible to support flat versions of our configs in 
> cassandra.yaml (in addition to the nested versions)
> "Settings" vtable currently uses the "_" separator (example of 
> encryption/audit log).  Switching to "." Would be a change in behavior which 
> may impact some users
> "." Separator for nested configs are common in other systems (yq, elastic 
> search, etc.)
> "Structured / nested config is easier for human eyes to read"... "Flat config 
> is harder for human eyes but easy for simple scripts"
> For learning what configs are enabled, cassandra.yaml isn't the best 
> interface as it may not reflect the actual configs; we can better expose this 
> in CQL and/or Sidecar
> What should our default example cassandra.yaml file use (flat or nested)?  
> Currently default shows nested
> When projecting the Config into CQL, we may want to consider UDTs to 
> represent the complex types
> Current limitations in CQL make nested structures hard to work with, it may 
> be worth wild to expand CQL support for nested structures.
>
> I also took a quick stab at enhancing our cassandra.yaml logic to: 1) be 
> reusable outside of yaml parsing, 2) support setters (we currently do, but 
> setters must be snake case… I fixed that)…, 3) support both nested and 
> structured, 4) support ignoring fields in a consistent way (Settings vtable 
> will include things SnakeYAML won’t and visa-versa).
>
> https://github.com/apache/cassandra/pull/1335 
> .  This PR is NOT a final 
> ready to merge thing, but instead a POC to show how we can solve a lot of the 
> core problems in a consistent and reusable manner.
>
> The following cassandra.yaml was used to show both worlds would work fine in 
> the config (and compliment each other)
>
> track_warnings:
>  enabled: true
>  # nested relative to the local level (TrackWarnings)
>  coordinator_read_size.warn_threshold_kb: 1024
>  local_read_size.abort_threshold_kb: 1024
>  row_index_size:
>warn_threshold_kb: 1024
>abort_threshold_kb: 1024
> # nested relative to the top level
> track_warnings.coordinator_read_size.abort_threshold_kb: 42
>
> For the “Settings” vtable, a new Loader interface was added to get all the 
> properties, and Properties.flatten would turn every property into a “flatten” 
> version (isScalar (isPrimitive or not hasSubProperties) or isCollection).  
> This doesn’t solve 100% of the issues that vtable has (types such as Duration 
> would need additional translation as they are Scalar but need a translation 
> from String -> Duration), and doesn’t solve the fact the table currently uses 
> “_”.
>
>> On Nov 29, 2021, at 10:11 AM, bened...@apache.org wrote:
>>
>> I meant to imply we should improve our UDT usability to support this kind of 
>> querying, essentially – but that if we support a simple text->property setup 
>> we might want to offer LIKE support so we can search them (via simple 
>> filtering, not any index) – which is actually pretty easy to provide.
>>
>> I think we should aim to provide users all the facilities they need to 
>> interact with config via vtables. If the user requires external tooling, it 
>> suggests a weakness in CQL that we should address, and maybe help the user 
>> in other scenario too…
>>
>> From: 

Re: [DISCUSS] Nested YAML configs for new features

2021-11-29 Thread David Capwell
> What should our default example cassandra.yaml file use (flat or nested)?  
> Currently default shows nested

Was told this statement was confusing, so trying to clarify.  At the moment we 
do not allow a nested config to be expressed in any way outside of nesting it 
(excluding YAML’s ability to inline objects), so if we did allow flat config 
representation of nested configs, then this would be a brand new feature; we 
currently show the nested structure in cassandra.yaml

> On Nov 29, 2021, at 11:58 AM, David Capwell  
> wrote:
> 
> Thanks everyone for the comments, I hope below is a good summary of all the 
> talking points?
> 
> We already use nested configs (networking, seed provider, commit log/hint 
> compression, back pressure, etc.)
> Flat configs are easier for grep, but can be solved with grep -A/-B and/or yq
> It would be possible to support flat versions of our configs in 
> cassandra.yaml (in addition to the nested versions)
> "Settings" vtable currently uses the "_" separator (example of 
> encryption/audit log).  Switching to "." Would be a change in behavior which 
> may impact some users
> "." Separator for nested configs are common in other systems (yq, elastic 
> search, etc.)
> "Structured / nested config is easier for human eyes to read"... "Flat config 
> is harder for human eyes but easy for simple scripts"
> For learning what configs are enabled, cassandra.yaml isn't the best 
> interface as it may not reflect the actual configs; we can better expose this 
> in CQL and/or Sidecar
> What should our default example cassandra.yaml file use (flat or nested)?  
> Currently default shows nested
> When projecting the Config into CQL, we may want to consider UDTs to 
> represent the complex types
> Current limitations in CQL make nested structures hard to work with, it may 
> be worth wild to expand CQL support for nested structures.
> 
> I also took a quick stab at enhancing our cassandra.yaml logic to: 1) be 
> reusable outside of yaml parsing, 2) support setters (we currently do, but 
> setters must be snake case… I fixed that)…, 3) support both nested and 
> structured, 4) support ignoring fields in a consistent way (Settings vtable 
> will include things SnakeYAML won’t and visa-versa).
> 
> https://github.com/apache/cassandra/pull/1335 
> .  This PR is NOT a final 
> ready to merge thing, but instead a POC to show how we can solve a lot of the 
> core problems in a consistent and reusable manner.
> 
> The following cassandra.yaml was used to show both worlds would work fine in 
> the config (and compliment each other)
> 
> track_warnings:
>  enabled: true
>  # nested relative to the local level (TrackWarnings)
>  coordinator_read_size.warn_threshold_kb: 1024
>  local_read_size.abort_threshold_kb: 1024
>  row_index_size:
>warn_threshold_kb: 1024
>abort_threshold_kb: 1024
> # nested relative to the top level
> track_warnings.coordinator_read_size.abort_threshold_kb: 42
> 
> For the “Settings” vtable, a new Loader interface was added to get all the 
> properties, and Properties.flatten would turn every property into a “flatten” 
> version (isScalar (isPrimitive or not hasSubProperties) or isCollection).  
> This doesn’t solve 100% of the issues that vtable has (types such as Duration 
> would need additional translation as they are Scalar but need a translation 
> from String -> Duration), and doesn’t solve the fact the table currently uses 
> “_”.
> 
>> On Nov 29, 2021, at 10:11 AM, bened...@apache.org wrote:
>> 
>> I meant to imply we should improve our UDT usability to support this kind of 
>> querying, essentially – but that if we support a simple text->property setup 
>> we might want to offer LIKE support so we can search them (via simple 
>> filtering, not any index) – which is actually pretty easy to provide.
>> 
>> I think we should aim to provide users all the facilities they need to 
>> interact with config via vtables. If the user requires external tooling, it 
>> suggests a weakness in CQL that we should address, and maybe help the user 
>> in other scenario too…
>> 
>> From: Joseph Lynch 
>> Date: Monday, 29 November 2021 at 17:32
>> To: dev@cassandra.apache.org 
>> Subject: Re: [DISCUSS] Nested YAML configs for new features
>> On Mon, Nov 29, 2021 at 11:51 AM bened...@apache.org
>>  wrote:
>>> 
>>> Maybe we can make our query language more expressive 
>>> 
>>> We might anyway want to introduce e.g. a LIKE filtering option to 
>>> find/discover flattened config parameters?
>> 
>> This sounds more complicated than just having the settings virtual
>> table return text (dot encoded) -> text (json) and probably not even
>> that much more useful. A full table scan on the settings table could
>> return all top level keys (strings before the first dot) and if we
>> just return a valid json string then users can bring their own
>> querying capabilities via jq [1], or one line of code in almost any
>> programming language 

Re: [DISCUSS] Nested YAML configs for new features

2021-11-29 Thread David Capwell
Thanks everyone for the comments, I hope below is a good summary of all the 
talking points?

We already use nested configs (networking, seed provider, commit log/hint 
compression, back pressure, etc.)
Flat configs are easier for grep, but can be solved with grep -A/-B and/or yq
It would be possible to support flat versions of our configs in cassandra.yaml 
(in addition to the nested versions)
"Settings" vtable currently uses the "_" separator (example of encryption/audit 
log).  Switching to "." Would be a change in behavior which may impact some 
users
"." Separator for nested configs are common in other systems (yq, elastic 
search, etc.)
"Structured / nested config is easier for human eyes to read"... "Flat config 
is harder for human eyes but easy for simple scripts"
For learning what configs are enabled, cassandra.yaml isn't the best interface 
as it may not reflect the actual configs; we can better expose this in CQL 
and/or Sidecar
What should our default example cassandra.yaml file use (flat or nested)?  
Currently default shows nested
When projecting the Config into CQL, we may want to consider UDTs to represent 
the complex types
Current limitations in CQL make nested structures hard to work with, it may be 
worth wild to expand CQL support for nested structures.

I also took a quick stab at enhancing our cassandra.yaml logic to: 1) be 
reusable outside of yaml parsing, 2) support setters (we currently do, but 
setters must be snake case… I fixed that)…, 3) support both nested and 
structured, 4) support ignoring fields in a consistent way (Settings vtable 
will include things SnakeYAML won’t and visa-versa).

https://github.com/apache/cassandra/pull/1335 
.  This PR is NOT a final ready 
to merge thing, but instead a POC to show how we can solve a lot of the core 
problems in a consistent and reusable manner.

The following cassandra.yaml was used to show both worlds would work fine in 
the config (and compliment each other)

track_warnings:
  enabled: true
  # nested relative to the local level (TrackWarnings)
  coordinator_read_size.warn_threshold_kb: 1024
  local_read_size.abort_threshold_kb: 1024
  row_index_size:
warn_threshold_kb: 1024
abort_threshold_kb: 1024
# nested relative to the top level
track_warnings.coordinator_read_size.abort_threshold_kb: 42

For the “Settings” vtable, a new Loader interface was added to get all the 
properties, and Properties.flatten would turn every property into a “flatten” 
version (isScalar (isPrimitive or not hasSubProperties) or isCollection).  This 
doesn’t solve 100% of the issues that vtable has (types such as Duration would 
need additional translation as they are Scalar but need a translation from 
String -> Duration), and doesn’t solve the fact the table currently uses “_”.

> On Nov 29, 2021, at 10:11 AM, bened...@apache.org wrote:
> 
> I meant to imply we should improve our UDT usability to support this kind of 
> querying, essentially – but that if we support a simple text->property setup 
> we might want to offer LIKE support so we can search them (via simple 
> filtering, not any index) – which is actually pretty easy to provide.
> 
> I think we should aim to provide users all the facilities they need to 
> interact with config via vtables. If the user requires external tooling, it 
> suggests a weakness in CQL that we should address, and maybe help the user in 
> other scenario too…
> 
> From: Joseph Lynch 
> Date: Monday, 29 November 2021 at 17:32
> To: dev@cassandra.apache.org 
> Subject: Re: [DISCUSS] Nested YAML configs for new features
> On Mon, Nov 29, 2021 at 11:51 AM bened...@apache.org
>  wrote:
>> 
>> Maybe we can make our query language more expressive 
>> 
>> We might anyway want to introduce e.g. a LIKE filtering option to 
>> find/discover flattened config parameters?
> 
> This sounds more complicated than just having the settings virtual
> table return text (dot encoded) -> text (json) and probably not even
> that much more useful. A full table scan on the settings table could
> return all top level keys (strings before the first dot) and if we
> just return a valid json string then users can bring their own
> querying capabilities via jq [1], or one line of code in almost any
> programming language (especially python, perl, etc ...).
> 
> Alternatively if we want to modify the grammar it seems supporting
> structured data querying on text fields would maybe be more preferable
> to LIKE since you could get what you want without a grammar change and
> if we could generalize to any text column it would be amazingly useful
> elsewhere to users. For example, we could emulate jq's query syntax in
> the select which is, imo, best-in-class for quickly querying into
> nearest structures. Assuming a key (text) -> value (json) schema:
> 
> 'a' -> "{'b': [{'c': {'d': 4}}]}",
> 
> SELECT json(value).b.0.c.d FROM settings WHERE key = 'a';
> 
> To have exactly jq syntax (but harder 

Cassandra project biweekly status update 2021-11-29

2021-11-29 Thread Joshua McKenzie
Sorry for the miss last week; it being a holiday in the US meant I was on
the road managing tiny humans and a puppy with my partner and I failed to
hand off update email responsibility to someone else. Which means we have
three weeks to cover!

[New contributor Getting Started]
As a new contributor we recommend starting in one of two places: Failing
tests, or starter tickets we label "lhf" (low hanging fruit).

Failing tests:
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=496=2252
Unassigned starter tickets:
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484=2162=2160

Holding steady at 22 failing ticket JIRA's in the past few weeks. More on
CI trends below with the new Butler tool below.

For unassigned lhf, we're up from 11 to 12 on 4.0.2 (our next minor
release) and holding steady at 14 on 4.1.0 (our next major release). We
have a selection of people who have volunteered to be mentors; feel free to
reach out to any of us if you have any questions on where to get started or
just self-select from either of these above lists (list of mentor names can
easily be seen here for now:
https://issues.apache.org/jira/browse/INFRA-22556).

[Dev list conversations]
https://lists.apache.org/list?dev@cassandra.apache.org:lte=3w:
In the past 3 weeks we've had some pretty interesting conversations surface
and evolve.

First off and quite exciting: Butler was introduced by a collection of
contributors which helps us both see the current state of our CI across
multiple branches as well as look more deeply into the failure rates of
individual tests. The url to access this system is here:
https://butler.cassandra.apache.org/#/

CEP-3 for Guardrails discussion closed up, was voted on, and passed the
vote.
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-3%3A+Guardrails,
JIRA ticket: https://issues.apache.org/jira/browse/CASSANDRA-17146. Really
excited to see how this work evolves.

CEP-17 for the SSTable format API was also voted on and the vote passed.
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-17%3A+SSTable+format+API,
JIRA ticket: https://issues.apache.org/jira/browse/CASSANDRA-17056. Curious
to see what various approaches to SSTables and persistence this eventually
enables.

We had a good discussion on mentoring and how to introduce new members to
the project - that thread can be found here:
https://lists.apache.org/thread/xr6b14w0b5sg00s79j4xlq6hls6fqs49. A lot of
folks came forward volunteering to mentor new contributors to the project
and we're working on making that list visible on the site, project, etc.
Expect to see more on this front as the infra changes take place.

A lively discussion is still going on around encryption at rest of SSTables
within Cassandra.
https://lists.apache.org/thread/6z1hkygqj48241sbgq8ogovkx5w7vv8p. There
were multiple calls to the question of whether this ought to become its own
CEP since it's a complex topic with a lot of tradeoffs.

Benjamin hit the list talking about some of the pain points he and some
other contributors have seen regarding inserting into collections on
contended partitions (
https://lists.apache.org/thread/f3dl7rfc2kv9f5r9pxzyz6zojsss81b9). There's
some work that's been put together by Benedict back in early 2020 that
makes a substantial improvement in this area (warning: understatement of
the week) and the question's up as to whether or not we want to include
this in 4.0 or in 4.1. There's no clear right or wrong here; would
definitely like to see more opinions on this relatively new thread.

We have an ongoing discussion about YAML config, nesting vs. flat, how
other projects do it, you name it. Another one of those "no obvious right
answer" kind of situations; if you've been holding off on the topic thus
far and have an opinion informed by experience, please chime in!
https://lists.apache.org/thread/3l6f8lypoj5pj2bl3m9x4x75rsjchv4q

And last but not least (non-inclusive; busy three weeks): Benjamin put out
the call for a "get new contributors involved" advent calendar of tickets
in December (
https://lists.apache.org/thread/3l6f8lypoj5pj2bl3m9x4x75rsjchv4q). While
our LHF pool is pretty loosely curated at this point, there's a lot of
potential here so if you can lend a hand please reach out to Benjamin.

[CI Trends]
New and flashy in this issue: what's going on with our CI? Well, a lot:
https://butler.cassandra.apache.org/#/

First off, we actually _have_ butler now, which not only shows us trends
for each branch but also allows you to see detailed results of which tests
are consistently failing, which flaking, etc. While trunk is in a bit
better of a state now than early November, 4.0 saw a pretty drastic jump to
45+ failures about a week ago. Builds 294 and 295 look to be having
something systemic going on with them w/a variety of environmental errors
(I/O, renaming file failures, etc):
https://butler.cassandra.apache.org/#/ci/upstream/compare/Cassandra-4.0/cassandra-4.0

I started another another email 

Re: [DISCUSS] Nested YAML configs for new features

2021-11-29 Thread bened...@apache.org
I meant to imply we should improve our UDT usability to support this kind of 
querying, essentially – but that if we support a simple text->property setup we 
might want to offer LIKE support so we can search them (via simple filtering, 
not any index) – which is actually pretty easy to provide.

I think we should aim to provide users all the facilities they need to interact 
with config via vtables. If the user requires external tooling, it suggests a 
weakness in CQL that we should address, and maybe help the user in other 
scenario too…

From: Joseph Lynch 
Date: Monday, 29 November 2021 at 17:32
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Nested YAML configs for new features
On Mon, Nov 29, 2021 at 11:51 AM bened...@apache.org
 wrote:
>
> Maybe we can make our query language more expressive 
>
> We might anyway want to introduce e.g. a LIKE filtering option to 
> find/discover flattened config parameters?

This sounds more complicated than just having the settings virtual
table return text (dot encoded) -> text (json) and probably not even
that much more useful. A full table scan on the settings table could
return all top level keys (strings before the first dot) and if we
just return a valid json string then users can bring their own
querying capabilities via jq [1], or one line of code in almost any
programming language (especially python, perl, etc ...).

Alternatively if we want to modify the grammar it seems supporting
structured data querying on text fields would maybe be more preferable
to LIKE since you could get what you want without a grammar change and
if we could generalize to any text column it would be amazingly useful
elsewhere to users. For example, we could emulate jq's query syntax in
the select which is, imo, best-in-class for quickly querying into
nearest structures. Assuming a key (text) -> value (json) schema:

'a' -> "{'b': [{'c': {'d': 4}}]}",

SELECT json(value).b.0.c.d FROM settings WHERE key = 'a';

To have exactly jq syntax (but harder to parse) it would be:

SELECT json(value).b[0].c.d FROM settings WHERE key = 'a';

Since we're not indexing the structured data in any way, filtering
before selection probably doesn't give us much performance improvement
as we'd still have to parse the whole text field in most cases.

-Joey

[1] https://stedolan.github.io/jq/

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org


Re: [RESULT] [VOTE] CEP-10: Cluster and Code Simulations

2021-11-29 Thread bened...@apache.org
FYI, CASSANDRA-17008 (the main element of CEP-10) is ready to merge, in case 
anybody still plans to take a look. Otherwise it will land in a day or two.

From: bened...@apache.org 
Date: Friday, 30 July 2021 at 14:27
To: dev@cassandra.apache.org 
Subject: [RESULT] [VOTE] CEP-10: Cluster and Code Simulations
The vote passes, with 6 +1s and no -1s.


From: Blake Eggleston 
Date: Wednesday, 28 July 2021 at 20:53
To: dev@cassandra.apache.org 
Subject: Re: [VOTE] CEP-10: Cluster and Code Simulations
+1

> On Jul 27, 2021, at 9:21 PM, Scott Andreas  wrote:
>
> +1 nb
>
> 
> From: Sam Tunnicliffe 
> Sent: Tuesday, July 27, 2021 12:54 AM
> To: dev@cassandra.apache.org
> Subject: Re: [VOTE] CEP-10: Cluster and Code Simulations
>
> +1
>
>> On 26 Jul 2021, at 11:51, bened...@apache.org wrote:
>>
>> Proposing the CEP-10 (Cluster and Code Simulations) for adoption
>>
>> Proposal: 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-10%3A+Cluster+and+Code+Simulations
>> Discussion: 
>> https://lists.apache.org/thread.html/rc908165994b15a29ef9c17b0b1205b2abc5bd38228b5a0117e442104%40%3Cdev.cassandra.apache.org%3E
>>
>> The vote will be open for 72 hours.
>> Votes by PMC members are considered binding. A
>> vote passes if there are at least three binding +1s and no binding vetoes.
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org


Re: [DISCUSS] Nested YAML configs for new features

2021-11-29 Thread Joseph Lynch
On Mon, Nov 29, 2021 at 11:51 AM bened...@apache.org
 wrote:
>
> Maybe we can make our query language more expressive 
>
> We might anyway want to introduce e.g. a LIKE filtering option to 
> find/discover flattened config parameters?

This sounds more complicated than just having the settings virtual
table return text (dot encoded) -> text (json) and probably not even
that much more useful. A full table scan on the settings table could
return all top level keys (strings before the first dot) and if we
just return a valid json string then users can bring their own
querying capabilities via jq [1], or one line of code in almost any
programming language (especially python, perl, etc ...).

Alternatively if we want to modify the grammar it seems supporting
structured data querying on text fields would maybe be more preferable
to LIKE since you could get what you want without a grammar change and
if we could generalize to any text column it would be amazingly useful
elsewhere to users. For example, we could emulate jq's query syntax in
the select which is, imo, best-in-class for quickly querying into
nearest structures. Assuming a key (text) -> value (json) schema:

'a' -> "{'b': [{'c': {'d': 4}}]}",

SELECT json(value).b.0.c.d FROM settings WHERE key = 'a';

To have exactly jq syntax (but harder to parse) it would be:

SELECT json(value).b[0].c.d FROM settings WHERE key = 'a';

Since we're not indexing the structured data in any way, filtering
before selection probably doesn't give us much performance improvement
as we'd still have to parse the whole text field in most cases.

-Joey

[1] https://stedolan.github.io/jq/

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: [DISCUSS] Throughput issues when inserting on contended partitions

2021-11-29 Thread Brandon Williams
I think it makes a lot of sense to fix these in 4.0, they have been
lingering issues long enough. +1

On Mon, Nov 29, 2021 at 10:59 AM Benjamin Lerer  wrote:
>
> Hi everybody,
>
> We have seen some serious throughput issues when inserting data with
> collections on contended partitions ( CASSANDRA-17163
>  and CASSANDRA-15464
> ).
> Benedict has created some patches to address those issues and improve the
> insertion throughput and memory consumptions ( CASSANDRA-15510
>  and  CASSANDRA-15511
> )
> Those patches are significant changes and are currently marked as
> improvements.
>
> I am wondering if we should not consider implementing those changes in 4.0
> considering the fact that they will fix serious existing issues and would
> like to hear your options about it.

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: [DISCUSS] Nested YAML configs for new features

2021-11-29 Thread Benjamin Lerer
>
> We might anyway want to introduce e.g. a LIKE filtering option to
> find/discover flattened config parameters?


+100

Le lun. 29 nov. 2021 à 17:51, bened...@apache.org  a
écrit :

> Maybe we can make our query language more expressive 
>
> We might anyway want to introduce e.g. a LIKE filtering option to
> find/discover flattened config parameters?
>
> From: Benjamin Lerer 
> Date: Monday, 29 November 2021 at 16:41
> To: dev@cassandra.apache.org 
> Subject: Re: [DISCUSS] Nested YAML configs for new features
> >
> > I don’t think it’s necessarily a requirement that we use the flattened
> > version in vtables. At the very least we can make use of sets, lists,
> etc.
> > But we can probably also use UDTs if this improves clarity.
>
>
> In my opinion part of the issue is on the query side. How do we select a
> nested set or a specific set easily? UDTs are not great for this type of
> queries. For collection we can use CONTAINS and element or range selection
> but insertion might be the problem.
>
> Le lun. 29 nov. 2021 à 17:23, Bowen Song  a écrit :
>
> > In ElasticSearch, the default is a flattened format with almost all
> > lines commented out. See
> >
> >
> https://github.com/elastic/elasticsearch/blob/master/distribution/src/config/elasticsearch.yml
> >
> > I guess they chose to do that because user can uncomment individual
> > lines to make changes. In a structured config file, the user will have
> > to uncomment all lines containing the parent keys to get it work. For
> > example, if someone wants to set the config keyABB to a non-default
> > value, they will have to correctly uncomment 3 lines: keyA, keyAB and
> > keyABB, which can be annoying and could easily maker a mistake. If any
> > of the first two keys is not uncommented, the YAML file will still be
> > valid but the config like keyX.keyAB.keyABB might just get silently
> > ignored by the database.
> >
> > keyX:
> >keyY:
> >  keyZ: value
> > # keyA:
> > #   keyAA:
> > # key AAA: value
> > #   keyAB:
> > # keyABA: value
> > # keyABB: value
> >
> > On 29/11/2021 15:54, Benjamin Lerer wrote:
> > > I do not think that supporting both options is an issue. The settings
> > > virtual table would have to use the flattened version.
> > > If we support both formats, the question would be: what should be the
> one
> > > used by default in the configuration file?
> > >
> > > Le ven. 26 nov. 2021 à 15:40,bened...@apache.org   >
> > a
> > > écrit :
> > >
> > >> This is the approach I favour for config files also. We had a much
> less
> > >> engaged discussion on this topic only a few months ago, so glad to see
> > more
> > >> people getting involved now.
> > >>
> > >> I would however personally prefer to see the configuration file slowly
> > >> deprecated (if perhaps never retired), in favour of virtual tables, so
> > that
> > >> operators may easily set configurations for the entire cluster.
> Ideally
> > it
> > >> would be possible to specify configuration per cluster, per DC and per
> > >> node, with the most specific configuration applying I would like to
> see
> > a
> > >> similar hierarchy for Keyspace, Table and Per-Query options. Ideally
> > only
> > >> the barest minimum number of options would be necessary to supply in a
> > >> config file, and only on first launch – seed nodes, for instance.
> > >>
> > >> So whatever design we employ here, we should IMO be aiming for it to
> be
> > >> compatible with a CQL representation also.
> > >>
> > >>
> > >> From: Bowen Song
> > >> Date: Wednesday, 24 November 2021 at 18:15
> > >> To:dev@cassandra.apache.org  
> > >> Subject: Re: [DISCUSS] Nested YAML configs for new features
> > >> Since you mentioned ElasticSearch, I'm actually pretty happy with
> their
> > >> config file syntax. It allows the user to completely flatten out the
> > >> entire config file. To give people who isn't familiar with
> ElasticSearch
> > >> an idea, here is a config file we use:
> > >>
> > >>  cluster.name: foobar
> > >>
> > >>  node.remote_cluster_client: false
> > >>  node.name: "foo.example.com"
> > >>  node.master: true
> > >>  node.data: true
> > >>  node.ingest: true
> > >>  node.ml: false
> > >>
> > >>  xpack.ml.enabled: false
> > >>  xpack.security.enabled: false
> > >>  xpack.security.audit.enabled: false
> > >>  xpack.watcher.enabled: false
> > >>
> > >>  action.auto_create_index: "+.,-*"
> > >>
> > >>  network.host: _global_
> > >>
> > >>  discovery.zen.hosts_provider: file
> > >>  discovery.zen.minimum_master_nodes: 2
> > >>
> > >>  http.publish_host: "foo.example.com"
> > >>  http.publish_port: 443
> > >>  http.bind_host: 127.0.0.1
> > >>
> > >>  transport.publish_host: "bar.example.com"
> > >>  transport.bind_host: 0.0.0.0
> > >>
> > >>  indices.fielddata.cache.size: 1GB
> > >>  indices.breaker.total.use_real_memory: false
> > >>
> > >>  path.logs: 

[DISCUSS] Throughput issues when inserting on contended partitions

2021-11-29 Thread Benjamin Lerer
Hi everybody,

We have seen some serious throughput issues when inserting data with
collections on contended partitions ( CASSANDRA-17163
 and CASSANDRA-15464
).
Benedict has created some patches to address those issues and improve the
insertion throughput and memory consumptions ( CASSANDRA-15510
 and  CASSANDRA-15511
)
Those patches are significant changes and are currently marked as
improvements.

I am wondering if we should not consider implementing those changes in 4.0
considering the fact that they will fix serious existing issues and would
like to hear your options about it.


Re: [DISCUSS] Nested YAML configs for new features

2021-11-29 Thread bened...@apache.org
Maybe we can make our query language more expressive 

We might anyway want to introduce e.g. a LIKE filtering option to find/discover 
flattened config parameters?

From: Benjamin Lerer 
Date: Monday, 29 November 2021 at 16:41
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Nested YAML configs for new features
>
> I don’t think it’s necessarily a requirement that we use the flattened
> version in vtables. At the very least we can make use of sets, lists, etc.
> But we can probably also use UDTs if this improves clarity.


In my opinion part of the issue is on the query side. How do we select a
nested set or a specific set easily? UDTs are not great for this type of
queries. For collection we can use CONTAINS and element or range selection
but insertion might be the problem.

Le lun. 29 nov. 2021 à 17:23, Bowen Song  a écrit :

> In ElasticSearch, the default is a flattened format with almost all
> lines commented out. See
>
> https://github.com/elastic/elasticsearch/blob/master/distribution/src/config/elasticsearch.yml
>
> I guess they chose to do that because user can uncomment individual
> lines to make changes. In a structured config file, the user will have
> to uncomment all lines containing the parent keys to get it work. For
> example, if someone wants to set the config keyABB to a non-default
> value, they will have to correctly uncomment 3 lines: keyA, keyAB and
> keyABB, which can be annoying and could easily maker a mistake. If any
> of the first two keys is not uncommented, the YAML file will still be
> valid but the config like keyX.keyAB.keyABB might just get silently
> ignored by the database.
>
> keyX:
>keyY:
>  keyZ: value
> # keyA:
> #   keyAA:
> # key AAA: value
> #   keyAB:
> # keyABA: value
> # keyABB: value
>
> On 29/11/2021 15:54, Benjamin Lerer wrote:
> > I do not think that supporting both options is an issue. The settings
> > virtual table would have to use the flattened version.
> > If we support both formats, the question would be: what should be the one
> > used by default in the configuration file?
> >
> > Le ven. 26 nov. 2021 à 15:40,bened...@apache.org  
> a
> > écrit :
> >
> >> This is the approach I favour for config files also. We had a much less
> >> engaged discussion on this topic only a few months ago, so glad to see
> more
> >> people getting involved now.
> >>
> >> I would however personally prefer to see the configuration file slowly
> >> deprecated (if perhaps never retired), in favour of virtual tables, so
> that
> >> operators may easily set configurations for the entire cluster. Ideally
> it
> >> would be possible to specify configuration per cluster, per DC and per
> >> node, with the most specific configuration applying I would like to see
> a
> >> similar hierarchy for Keyspace, Table and Per-Query options. Ideally
> only
> >> the barest minimum number of options would be necessary to supply in a
> >> config file, and only on first launch – seed nodes, for instance.
> >>
> >> So whatever design we employ here, we should IMO be aiming for it to be
> >> compatible with a CQL representation also.
> >>
> >>
> >> From: Bowen Song
> >> Date: Wednesday, 24 November 2021 at 18:15
> >> To:dev@cassandra.apache.org  
> >> Subject: Re: [DISCUSS] Nested YAML configs for new features
> >> Since you mentioned ElasticSearch, I'm actually pretty happy with their
> >> config file syntax. It allows the user to completely flatten out the
> >> entire config file. To give people who isn't familiar with ElasticSearch
> >> an idea, here is a config file we use:
> >>
> >>  cluster.name: foobar
> >>
> >>  node.remote_cluster_client: false
> >>  node.name: "foo.example.com"
> >>  node.master: true
> >>  node.data: true
> >>  node.ingest: true
> >>  node.ml: false
> >>
> >>  xpack.ml.enabled: false
> >>  xpack.security.enabled: false
> >>  xpack.security.audit.enabled: false
> >>  xpack.watcher.enabled: false
> >>
> >>  action.auto_create_index: "+.,-*"
> >>
> >>  network.host: _global_
> >>
> >>  discovery.zen.hosts_provider: file
> >>  discovery.zen.minimum_master_nodes: 2
> >>
> >>  http.publish_host: "foo.example.com"
> >>  http.publish_port: 443
> >>  http.bind_host: 127.0.0.1
> >>
> >>  transport.publish_host: "bar.example.com"
> >>  transport.bind_host: 0.0.0.0
> >>
> >>  indices.fielddata.cache.size: 1GB
> >>  indices.breaker.total.use_real_memory: false
> >>
> >>  path.logs: /var/log/elasticsearch
> >>  path.data: /var/lib/elasticsearch/data
> >>
> >> As you can see we can use the flat (grep-able) syntax for everything.
> >> This is also human readable because we can group options together by
> >> inserting empty lines between them.
> >>
> >> The equivalent of the above in a structured syntax will be:
> >>
> >>  cluster:
> >>   name: foobar
> >>
> >>  node:
> >>   remote_cluster_client: false
> 

Re: [DISCUSS] Nested YAML configs for new features

2021-11-29 Thread Benjamin Lerer
>
> I don’t think it’s necessarily a requirement that we use the flattened
> version in vtables. At the very least we can make use of sets, lists, etc.
> But we can probably also use UDTs if this improves clarity.


In my opinion part of the issue is on the query side. How do we select a
nested set or a specific set easily? UDTs are not great for this type of
queries. For collection we can use CONTAINS and element or range selection
but insertion might be the problem.

Le lun. 29 nov. 2021 à 17:23, Bowen Song  a écrit :

> In ElasticSearch, the default is a flattened format with almost all
> lines commented out. See
>
> https://github.com/elastic/elasticsearch/blob/master/distribution/src/config/elasticsearch.yml
>
> I guess they chose to do that because user can uncomment individual
> lines to make changes. In a structured config file, the user will have
> to uncomment all lines containing the parent keys to get it work. For
> example, if someone wants to set the config keyABB to a non-default
> value, they will have to correctly uncomment 3 lines: keyA, keyAB and
> keyABB, which can be annoying and could easily maker a mistake. If any
> of the first two keys is not uncommented, the YAML file will still be
> valid but the config like keyX.keyAB.keyABB might just get silently
> ignored by the database.
>
> keyX:
>keyY:
>  keyZ: value
> # keyA:
> #   keyAA:
> # key AAA: value
> #   keyAB:
> # keyABA: value
> # keyABB: value
>
> On 29/11/2021 15:54, Benjamin Lerer wrote:
> > I do not think that supporting both options is an issue. The settings
> > virtual table would have to use the flattened version.
> > If we support both formats, the question would be: what should be the one
> > used by default in the configuration file?
> >
> > Le ven. 26 nov. 2021 à 15:40,bened...@apache.org  
> a
> > écrit :
> >
> >> This is the approach I favour for config files also. We had a much less
> >> engaged discussion on this topic only a few months ago, so glad to see
> more
> >> people getting involved now.
> >>
> >> I would however personally prefer to see the configuration file slowly
> >> deprecated (if perhaps never retired), in favour of virtual tables, so
> that
> >> operators may easily set configurations for the entire cluster. Ideally
> it
> >> would be possible to specify configuration per cluster, per DC and per
> >> node, with the most specific configuration applying I would like to see
> a
> >> similar hierarchy for Keyspace, Table and Per-Query options. Ideally
> only
> >> the barest minimum number of options would be necessary to supply in a
> >> config file, and only on first launch – seed nodes, for instance.
> >>
> >> So whatever design we employ here, we should IMO be aiming for it to be
> >> compatible with a CQL representation also.
> >>
> >>
> >> From: Bowen Song
> >> Date: Wednesday, 24 November 2021 at 18:15
> >> To:dev@cassandra.apache.org  
> >> Subject: Re: [DISCUSS] Nested YAML configs for new features
> >> Since you mentioned ElasticSearch, I'm actually pretty happy with their
> >> config file syntax. It allows the user to completely flatten out the
> >> entire config file. To give people who isn't familiar with ElasticSearch
> >> an idea, here is a config file we use:
> >>
> >>  cluster.name: foobar
> >>
> >>  node.remote_cluster_client: false
> >>  node.name: "foo.example.com"
> >>  node.master: true
> >>  node.data: true
> >>  node.ingest: true
> >>  node.ml: false
> >>
> >>  xpack.ml.enabled: false
> >>  xpack.security.enabled: false
> >>  xpack.security.audit.enabled: false
> >>  xpack.watcher.enabled: false
> >>
> >>  action.auto_create_index: "+.,-*"
> >>
> >>  network.host: _global_
> >>
> >>  discovery.zen.hosts_provider: file
> >>  discovery.zen.minimum_master_nodes: 2
> >>
> >>  http.publish_host: "foo.example.com"
> >>  http.publish_port: 443
> >>  http.bind_host: 127.0.0.1
> >>
> >>  transport.publish_host: "bar.example.com"
> >>  transport.bind_host: 0.0.0.0
> >>
> >>  indices.fielddata.cache.size: 1GB
> >>  indices.breaker.total.use_real_memory: false
> >>
> >>  path.logs: /var/log/elasticsearch
> >>  path.data: /var/lib/elasticsearch/data
> >>
> >> As you can see we can use the flat (grep-able) syntax for everything.
> >> This is also human readable because we can group options together by
> >> inserting empty lines between them.
> >>
> >> The equivalent of the above in a structured syntax will be:
> >>
> >>  cluster:
> >>   name: foobar
> >>
> >>  node:
> >>   remote_cluster_client: false
> >>   name: "foo.example.com"
> >>   master: true
> >>   data: true
> >>   ingest: true
> >>   ml: false
> >>
> >>  xpack:
> >>   ml:
> >>   enabled: false
> >>   security:
> >>   enabled: false
> >>   audit:
> >> 

Re: [DISCUSS] Nested YAML configs for new features

2021-11-29 Thread Bowen Song
In ElasticSearch, the default is a flattened format with almost all 
lines commented out. See 
https://github.com/elastic/elasticsearch/blob/master/distribution/src/config/elasticsearch.yml


I guess they chose to do that because user can uncomment individual 
lines to make changes. In a structured config file, the user will have 
to uncomment all lines containing the parent keys to get it work. For 
example, if someone wants to set the config keyABB to a non-default 
value, they will have to correctly uncomment 3 lines: keyA, keyAB and 
keyABB, which can be annoying and could easily maker a mistake. If any 
of the first two keys is not uncommented, the YAML file will still be 
valid but the config like keyX.keyAB.keyABB might just get silently 
ignored by the database.


   keyX:
  keyY:
    keyZ: value
   # keyA:
   #   keyAA:
   # key AAA: value
   #   keyAB:
   # keyABA: value
   # keyABB: value

On 29/11/2021 15:54, Benjamin Lerer wrote:

I do not think that supporting both options is an issue. The settings
virtual table would have to use the flattened version.
If we support both formats, the question would be: what should be the one
used by default in the configuration file?

Le ven. 26 nov. 2021 à 15:40,bened...@apache.orga
écrit :


This is the approach I favour for config files also. We had a much less
engaged discussion on this topic only a few months ago, so glad to see more
people getting involved now.

I would however personally prefer to see the configuration file slowly
deprecated (if perhaps never retired), in favour of virtual tables, so that
operators may easily set configurations for the entire cluster. Ideally it
would be possible to specify configuration per cluster, per DC and per
node, with the most specific configuration applying I would like to see a
similar hierarchy for Keyspace, Table and Per-Query options. Ideally only
the barest minimum number of options would be necessary to supply in a
config file, and only on first launch – seed nodes, for instance.

So whatever design we employ here, we should IMO be aiming for it to be
compatible with a CQL representation also.


From: Bowen Song
Date: Wednesday, 24 November 2021 at 18:15
To:dev@cassandra.apache.org  
Subject: Re: [DISCUSS] Nested YAML configs for new features
Since you mentioned ElasticSearch, I'm actually pretty happy with their
config file syntax. It allows the user to completely flatten out the
entire config file. To give people who isn't familiar with ElasticSearch
an idea, here is a config file we use:

 cluster.name: foobar

 node.remote_cluster_client: false
 node.name: "foo.example.com"
 node.master: true
 node.data: true
 node.ingest: true
 node.ml: false

 xpack.ml.enabled: false
 xpack.security.enabled: false
 xpack.security.audit.enabled: false
 xpack.watcher.enabled: false

 action.auto_create_index: "+.,-*"

 network.host: _global_

 discovery.zen.hosts_provider: file
 discovery.zen.minimum_master_nodes: 2

 http.publish_host: "foo.example.com"
 http.publish_port: 443
 http.bind_host: 127.0.0.1

 transport.publish_host: "bar.example.com"
 transport.bind_host: 0.0.0.0

 indices.fielddata.cache.size: 1GB
 indices.breaker.total.use_real_memory: false

 path.logs: /var/log/elasticsearch
 path.data: /var/lib/elasticsearch/data

As you can see we can use the flat (grep-able) syntax for everything.
This is also human readable because we can group options together by
inserting empty lines between them.

The equivalent of the above in a structured syntax will be:

 cluster:
  name: foobar

 node:
  remote_cluster_client: false
  name: "foo.example.com"
  master: true
  data: true
  ingest: true
  ml: false

 xpack:
  ml:
  enabled: false
  security:
  enabled: false
  audit:
  enabled: false
  watcher:
  enabled: false

 action:
  auto_create_index: "+.,-*"

 network:
  host: _global_

 discovery:
  zen:
  hosts_provider: file
  minimum_master_nodes: 2

 http:
  publish_host: "foo.example.com"
  publish_port: 443
  bind_host: 127.0.0.1

 transport:
  publish_host: "bar.example.com"
  bind_host: 0.0.0.0

 indices:
  fielddata:
  cache:
  size: 1GB
 indices:
  breaker:
  total:
  use_real_memory: false

 path:
  logs: /var/log/elasticsearch
  data: /var/lib/elasticsearch/data

This may be easier to read for some people, but it is a total nightmare
for "grep" - so many keys have identical names, such as "enabled".

Also, for the virtual tables, it would be a lot easier to represent
individual values in a virtual table when the config is flat 

Re: [DISCUSS] Nested YAML configs for new features

2021-11-29 Thread bened...@apache.org
I don’t think it’s necessarily a requirement that we use the flattened version 
in vtables. At the very least we can make use of sets, lists, etc. But we can 
probably also use UDTs if this improves clarity.

From: Benjamin Lerer 
Date: Monday, 29 November 2021 at 15:54
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Nested YAML configs for new features
I do not think that supporting both options is an issue. The settings
virtual table would have to use the flattened version.
If we support both formats, the question would be: what should be the one
used by default in the configuration file?

Le ven. 26 nov. 2021 à 15:40, bened...@apache.org  a
écrit :

> This is the approach I favour for config files also. We had a much less
> engaged discussion on this topic only a few months ago, so glad to see more
> people getting involved now.
>
> I would however personally prefer to see the configuration file slowly
> deprecated (if perhaps never retired), in favour of virtual tables, so that
> operators may easily set configurations for the entire cluster. Ideally it
> would be possible to specify configuration per cluster, per DC and per
> node, with the most specific configuration applying I would like to see a
> similar hierarchy for Keyspace, Table and Per-Query options. Ideally only
> the barest minimum number of options would be necessary to supply in a
> config file, and only on first launch – seed nodes, for instance.
>
> So whatever design we employ here, we should IMO be aiming for it to be
> compatible with a CQL representation also.
>
>
> From: Bowen Song 
> Date: Wednesday, 24 November 2021 at 18:15
> To: dev@cassandra.apache.org 
> Subject: Re: [DISCUSS] Nested YAML configs for new features
> Since you mentioned ElasticSearch, I'm actually pretty happy with their
> config file syntax. It allows the user to completely flatten out the
> entire config file. To give people who isn't familiar with ElasticSearch
> an idea, here is a config file we use:
>
> cluster.name: foobar
>
> node.remote_cluster_client: false
> node.name: "foo.example.com"
> node.master: true
> node.data: true
> node.ingest: true
> node.ml: false
>
> xpack.ml.enabled: false
> xpack.security.enabled: false
> xpack.security.audit.enabled: false
> xpack.watcher.enabled: false
>
> action.auto_create_index: "+.,-*"
>
> network.host: _global_
>
> discovery.zen.hosts_provider: file
> discovery.zen.minimum_master_nodes: 2
>
> http.publish_host: "foo.example.com"
> http.publish_port: 443
> http.bind_host: 127.0.0.1
>
> transport.publish_host: "bar.example.com"
> transport.bind_host: 0.0.0.0
>
> indices.fielddata.cache.size: 1GB
> indices.breaker.total.use_real_memory: false
>
> path.logs: /var/log/elasticsearch
> path.data: /var/lib/elasticsearch/data
>
> As you can see we can use the flat (grep-able) syntax for everything.
> This is also human readable because we can group options together by
> inserting empty lines between them.
>
> The equivalent of the above in a structured syntax will be:
>
> cluster:
>  name: foobar
>
> node:
>  remote_cluster_client: false
>  name: "foo.example.com"
>  master: true
>  data: true
>  ingest: true
>  ml: false
>
> xpack:
>  ml:
>  enabled: false
>  security:
>  enabled: false
>  audit:
>  enabled: false
>  watcher:
>  enabled: false
>
> action:
>  auto_create_index: "+.,-*"
>
> network:
>  host: _global_
>
> discovery:
>  zen:
>  hosts_provider: file
>  minimum_master_nodes: 2
>
> http:
>  publish_host: "foo.example.com"
>  publish_port: 443
>  bind_host: 127.0.0.1
>
> transport:
>  publish_host: "bar.example.com"
>  bind_host: 0.0.0.0
>
> indices:
>  fielddata:
>  cache:
>  size: 1GB
> indices:
>  breaker:
>  total:
>  use_real_memory: false
>
> path:
>  logs: /var/log/elasticsearch
>  data: /var/lib/elasticsearch/data
>
> This may be easier to read for some people, but it is a total nightmare
> for "grep" - so many keys have identical names, such as "enabled".
>
> Also, for the virtual tables, it would be a lot easier to represent
> individual values in a virtual table when the config is flat and keys
> are unique. The virtual tables would need to either support the encoding
> and decoding of the structured config into a flat structure, or use JSON
> encoded string value. The use of JSON would make querying individual
> value much harder.
>
> On 22/11/2021 16:16, Joseph Lynch wrote:
> > Isn't one of the primary reasons to have a YAML configuration instead
> > of a properties file is to allow typed and structured (implies nested)
> > configuration? I 

Re: [DISCUSS] Nested YAML configs for new features

2021-11-29 Thread Benjamin Lerer
I do not think that supporting both options is an issue. The settings
virtual table would have to use the flattened version.
If we support both formats, the question would be: what should be the one
used by default in the configuration file?

Le ven. 26 nov. 2021 à 15:40, bened...@apache.org  a
écrit :

> This is the approach I favour for config files also. We had a much less
> engaged discussion on this topic only a few months ago, so glad to see more
> people getting involved now.
>
> I would however personally prefer to see the configuration file slowly
> deprecated (if perhaps never retired), in favour of virtual tables, so that
> operators may easily set configurations for the entire cluster. Ideally it
> would be possible to specify configuration per cluster, per DC and per
> node, with the most specific configuration applying I would like to see a
> similar hierarchy for Keyspace, Table and Per-Query options. Ideally only
> the barest minimum number of options would be necessary to supply in a
> config file, and only on first launch – seed nodes, for instance.
>
> So whatever design we employ here, we should IMO be aiming for it to be
> compatible with a CQL representation also.
>
>
> From: Bowen Song 
> Date: Wednesday, 24 November 2021 at 18:15
> To: dev@cassandra.apache.org 
> Subject: Re: [DISCUSS] Nested YAML configs for new features
> Since you mentioned ElasticSearch, I'm actually pretty happy with their
> config file syntax. It allows the user to completely flatten out the
> entire config file. To give people who isn't familiar with ElasticSearch
> an idea, here is a config file we use:
>
> cluster.name: foobar
>
> node.remote_cluster_client: false
> node.name: "foo.example.com"
> node.master: true
> node.data: true
> node.ingest: true
> node.ml: false
>
> xpack.ml.enabled: false
> xpack.security.enabled: false
> xpack.security.audit.enabled: false
> xpack.watcher.enabled: false
>
> action.auto_create_index: "+.,-*"
>
> network.host: _global_
>
> discovery.zen.hosts_provider: file
> discovery.zen.minimum_master_nodes: 2
>
> http.publish_host: "foo.example.com"
> http.publish_port: 443
> http.bind_host: 127.0.0.1
>
> transport.publish_host: "bar.example.com"
> transport.bind_host: 0.0.0.0
>
> indices.fielddata.cache.size: 1GB
> indices.breaker.total.use_real_memory: false
>
> path.logs: /var/log/elasticsearch
> path.data: /var/lib/elasticsearch/data
>
> As you can see we can use the flat (grep-able) syntax for everything.
> This is also human readable because we can group options together by
> inserting empty lines between them.
>
> The equivalent of the above in a structured syntax will be:
>
> cluster:
>  name: foobar
>
> node:
>  remote_cluster_client: false
>  name: "foo.example.com"
>  master: true
>  data: true
>  ingest: true
>  ml: false
>
> xpack:
>  ml:
>  enabled: false
>  security:
>  enabled: false
>  audit:
>  enabled: false
>  watcher:
>  enabled: false
>
> action:
>  auto_create_index: "+.,-*"
>
> network:
>  host: _global_
>
> discovery:
>  zen:
>  hosts_provider: file
>  minimum_master_nodes: 2
>
> http:
>  publish_host: "foo.example.com"
>  publish_port: 443
>  bind_host: 127.0.0.1
>
> transport:
>  publish_host: "bar.example.com"
>  bind_host: 0.0.0.0
>
> indices:
>  fielddata:
>  cache:
>  size: 1GB
> indices:
>  breaker:
>  total:
>  use_real_memory: false
>
> path:
>  logs: /var/log/elasticsearch
>  data: /var/lib/elasticsearch/data
>
> This may be easier to read for some people, but it is a total nightmare
> for "grep" - so many keys have identical names, such as "enabled".
>
> Also, for the virtual tables, it would be a lot easier to represent
> individual values in a virtual table when the config is flat and keys
> are unique. The virtual tables would need to either support the encoding
> and decoding of the structured config into a flat structure, or use JSON
> encoded string value. The use of JSON would make querying individual
> value much harder.
>
> On 22/11/2021 16:16, Joseph Lynch wrote:
> > Isn't one of the primary reasons to have a YAML configuration instead
> > of a properties file is to allow typed and structured (implies nested)
> > configuration? I think it makes a lot of sense to group related
> > configuration options (e.g. a feature) into a typed class when we're
> > talking about more than one or two related options.
> >
> > It's pretty standard elsewhere in the JVM ecosystem to encode YAMLs to
> > period encoded key->value pairs when required (usually when providing
> > a property or override