Re: [DISCUSS] CEP-11: Pluggable memtable implementations

2021-07-21 Thread Michael Burman
Hi,

It is nice to see these going forward (and a great use of CEP) so thanks
for the proposal. I have my reservations regarding the linking of memtable
to CommitLog and flushing and should not leak abstraction from one to
another. And I don't see the reasoning why they should be, it doesn't seem
to add anything else than tight coupling of components, reducing reuse and
making things unnecessarily complicated. Also, the streaming notions seem
weird to me - how are they related to memtable? Why should memtable care
about the behavior outside memtable's responsibility?

Some misc (with some thoughts split / duplicated to different parts) quotes
and comments:

> Tight coupling between CFS and memtable will be reduced: flushing
functionality is to be extracted, controlling memtable memory and period
expiration will be handled by the memtable.

Why is flushing control bad to do in CFS and better in the memtable? Doing
it outside memtable would allow to control the flushing regardless of how
the actual memtable is implemented. For example, lets say someone would
want to implement the HBase's accordion to Cassandra. It shouldn't matter
what the implementation of memtable is as the compaction of different
memtables could be beneficial to all implementations. Or the flushing would
push the memtable to a proper caching instead of only to disk.

Or if we had per table caching structure, we could control the flushing of
memtables and the cache structure separately. Some data benefits from LRU
and some from MRW (most-recently-written) caching strategies. But both
could benefit from the same memtable implementation, it's the data and how
its used that could control how the flushing should work. For example time
series data behaves quite differently in terms of data accesses to
something more "random".

Or even "total memory control" which would check which tables need more
memory to do their writes and which do not. Or that the memory doesn't grow
over a boundary and needs to manually maintain how much is dedicated to
caching and how much to memtables waiting to be flushed. Or delay flushing
because the disks can't keep up etc. Not to be implemented in this CEP, but
pushing this strategy to memtable would prevent many features.

> Beyond thread-safety, the concurrency constraints of the memtable are
intentionally left unspecified.

I like this. I could see use-cases where a single-thread implementation
could actually outperform some concurrent data structures. But it also
provides me with a question, is this proposal going to take an angle
towards per-range memtables? There are certainly benefits to splitting the
memtables as it would reduce the "n" in the operations, thus providing less
overhead in lookups and writes. Although, taking it one step backwards I
could see the benefit of having a commitlog per range also, which would
allow higher utilization of NVME drives with larger queue depths. And why
not per-range-sstables for faster scale-outs and .. a bit outside the scope
of CEP, but just to ensure that the implementation does not block such
improvement.

Interfaces:

> boolean writesAreDurable()
> boolean writesShouldSkipCommitLog()

The placement inside memtable implementation for these methods just feels
incredibly wrong to me. The writing pipeline should have these configured
and they could differ for each table even with the same memtable
implementation. Lets take the example of an in-memory memtable use case
that's never written to a SSTable. We could have one table with just simply
in-memory cached storage and another one with a Redis style persistence of
AOF, where writes would be written to the commitlog for fast recovery, but
the data is otherwise always only kept in the memtable instead of writing
to the SSTable (for performance reasons). Same implementation of memtable
still.

Why would the write process of the table not ask the table what settings it
has and instead asks the memtable what settings the table has? This seems
counterintuitive to me. Even the persistent memory case is a bit
questionable, why not simply disable commitlog in the writing process? Why
ask the memtable?

This feels like memtable is going to be the write pipeline, but to me that
doesn't feel like the correct architectural decision. I'd rather see these
decisions done outside the memtable. Even a persistent memory memtable user
might want to have a commitlog enabled for data capture / shipping logs, or
layers of persistence speed. The whole persistent memory without any
commercially known future is a bit weird at the moment (even Optane has no
known manufacturing anymore with last factory being dismantled based on
public information).

> boolean streamToMemtable()

And that one I don't understand. Why is streaming in the memtable? This
smells like a scope creep from something else. The explanation would
indicate to me that the wanted behavior is just disabling automated
flushing.

But these are just some questions that came to 

Re: Implicit Casts for Arithmetic Operators

2018-11-20 Thread Michael Burman
gt;>>> I agree with what's been said about expectations regarding
> expressions
> >> involving floating point numbers. I think that if one of the inputs is
> >> approximate then the result should be approximate.
> >>>>>
> >>>>> One thing we could look at for inspiration is the SQL spec. Not to
> >> follow dogmatically necessarily.
> >>>>>
> >>>>> From the SQL 92 spec regarding assignment
> >> http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt section 4.6:
> >>>>> "
> >>>>>   Values of the data types NUMERIC, DECIMAL, INTEGER, SMALLINT,
> >>>>>   FLOAT, REAL, and DOUBLE PRECISION are numbers and are all
> >> mutually
> >>>>>   comparable and mutually assignable. If an assignment would
> >> result
> >>>>>   in a loss of the most significant digits, an exception
> condition
> >>>>>   is raised. If least significant digits are lost,
> implementation-
> >>>>>   defined rounding or truncating occurs with no exception
> >> condition
> >>>>>   being raised. The rules for arithmetic are generally governed
> by
> >>>>>   Subclause 6.12, "".
> >>>>> "
> >>>>>
> >>>>> Section 6.12 numeric value expressions:
> >>>>> "
> >>>>>   1) If the data type of both operands of a dyadic arithmetic
> >> opera-
> >>>>>  tor is exact numeric, then the data type of the result is
> >> exact
> >>>>>  numeric, with precision and scale determined as follows:
> >>>>> ...
> >>>>>   2) If the data type of either operand of a dyadic arithmetic
> op-
> >>>>>  erator is approximate numeric, then the data type of the re-
> >>>>>  sult is approximate numeric. The precision of the result is
> >>>>>  implementation-defined.
> >>>>> "
> >>>>>
> >>>>> And this makes sense to me. I think we should only return an exact
> >> result if both of the inputs are exact.
> >>>>>
> >>>>> I think we might want to look closely at the SQL spec and especially
> >> when the spec requires an error to be generated. Those are sometimes in
> the
> >> spec to prevent subtle paths to wrong answers. Any time we deviate from
> the
> >> spec we should be asking why is it in the spec and why are we deviating.
> >>>>>
> >>>>> Another issue besides overflow handling is how we determine precision
> >> and scale for expressions involving two exact types.
> >>>>>
> >>>>> Ariel
> >>>>>
> >>>>>> On Fri, Oct 12, 2018, at 11:51 AM, Michael Burman wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> I'm not sure if I would prefer the Postgres way of doing things,
> >> which is
> >>>>>> returning just about any type depending on the order of operators.
> >>>>>> Considering it actually mentions in the docs that using
> >> numeric/decimal is
> >>>>>> slow and also multiple times that floating points are inexact. So
> >> doing
> >>>>>> some math with Postgres (9.6.5):
> >>>>>>
> >>>>>> SELECT 2147483647 <(214)%20748-3647>::bigint*1.0::double precision
> returns double
> >>>>>> precision 2147483647 <(214)%20748-3647>
> >>>>>> SELECT 2147483647 <(214)%20748-3647>::bigint*1.0 returns numeric
> 2147483647 <(214)%20748-3647>.0
> >>>>>> SELECT 2147483647 <(214)%20748-3647>::bigint*1.0::real returns
> double
> >>>>>> SELECT 2147483647 <(214)%20748-3647>::double precision*1::bigint
> returns double
> >> 2147483647 <(214)%20748-3647>
> >>>>>> SELECT 2147483647 <(214)%20748-3647>::double precision*1.0::bigint
> returns double
> >> 2147483647 <(214)%20748-3647>
> >>>>>>
> >>>>>> With + - we can get the same amount of mixture of returned types.
> >> There's
> >>>>>> no difference in those calculations, just some casting. To me
> >>>>>> floating-point math indicates inexactness and has errors and whoever
> >> mi

Re: Cassandra 4.0 on Windows 10 crashing upon startup with Java 11

2018-11-16 Thread Michael Burman

On 11/12/18 5:37 PM, Michael Shuler wrote:

Issue with upstream links to:
https://github.com/hyperic/sigar/issues/77


.. clip. Considering the Sigar has been unmaintained for years (and has 
large amount of unfixed bugs), should we consider removing it from the 
project? It's not used much, so finding suitable replacement for those 
few functions shouldn't be that big of a deal.


  - Micke

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: Implicit Casts for Arithmetic Operators

2018-10-12 Thread Michael Burman
Hi,

I'm not sure if I would prefer the Postgres way of doing things, which is
returning just about any type depending on the order of operators.
Considering it actually mentions in the docs that using numeric/decimal is
slow and also multiple times that floating points are inexact. So doing
some math with Postgres (9.6.5):

SELECT 2147483647::bigint*1.0::double precision returns double
precision 2147483647
SELECT 2147483647::bigint*1.0 returns numeric 2147483647.0
SELECT 2147483647::bigint*1.0::real returns double
SELECT 2147483647::double precision*1::bigint returns double 2147483647
SELECT 2147483647::double precision*1.0::bigint returns double 2147483647

With + - we can get the same amount of mixture of returned types. There's
no difference in those calculations, just some casting. To me
floating-point math indicates inexactness and has errors and whoever mixes
up two different types should understand that. If one didn't want exact
numeric type, why would the server return such? The floating point value
itself could be wrong already before the calculation - trying to say we do
it lossless is just wrong.

Fun with 2.65:

SELECT 2.65::real * 1::int returns double 2.6509536743
SELECT 2.65::double precision * 1::int returns double 2.65

SELECT round(2.65) returns numeric 4
SELECT round(2.65::double precision) returns double 4

SELECT 2.65 * 1 returns double 2.65
SELECT 2.65 * 1::bigint returns numeric 2.65
SELECT 2.65 * 1.0 returns numeric 2.650
SELECT 2.65 * 1.0::double precision returns double 2.65

SELECT round(2.65) * 1 returns numeric 3
SELECT round(2.65) * round(1) returns double 3

So as we're going to have silly values in any case, why pretend something
else? Also, exact calculations are slow if we crunch large amount of
numbers. I guess I slightly deviated towards Postgres' implemention in this
case, but I wish it wasn't used as a benchmark in this case. And most
importantly, I would definitely want the exact same type returned each time
I do a calculation.

  - Micke

On Fri, Oct 12, 2018 at 4:29 PM Benedict Elliott Smith 
wrote:

> As far as I can tell we reached a relatively strong consensus that we
> should implement lossless casts by default?  Does anyone have anything more
> to add?
>
> Looking at the emails, everyone who participated and expressed a
> preference was in favour of the “Postgres approach” of upcasting to decimal
> for mixed float/int operands?
>
> I’d like to get a clear-cut decision on this, so we know what we’re doing
> for 4.0.  Then hopefully we can move on to a collective decision on Ariel’s
> concerns about overflow, which I think are also pressing - particularly for
> tinyint and smallint.  This does also impact implicit casts for mixed
> integer type operations, but an approach for these will probably fall out
> of any decision on overflow.
>
>
>
>
>
>
> > On 3 Oct 2018, at 11:38, Murukesh Mohanan 
> wrote:
> >
> > I think you're conflating two things here. There's the loss resulting
> from
> > using some operators, and loss involved in casting. Dividing an integer
> by
> > another integer to obtain an integer result can result in loss, but
> there's
> > no implicit casting there and no loss due to casting.  Casting an integer
> > to a float can also result in loss. So dividing an integer by a float,
> for
> > example, with an implicit cast has an additional avenue for loss: the
> > implicit cast for the operands so that they're of the same type. I
> believe
> > this discussion so far has been about the latter, not the loss from the
> > operations themselves.
> >
> > On Wed, 3 Oct 2018 at 18:35 Benjamin Lerer 
> > wrote:
> >
> >> Hi,
> >>
> >> I would like to try to clarify things a bit to help people to understand
> >> the true complexity of the problem.
> >>
> >> The *float *and *double *types are inexact numeric types. Not only at
> the
> >> operation level.
> >>
> >> If you insert 676543.21 in a *float* column and then read it, you will
> >> realize that the value has been truncated to 676543.2.
> >>
> >> If you want accuracy the only way is to avoid those inexact types.
> >> Using *decimals
> >> *during operations will mitigate the problem but will not remove it.
> >>
> >>
> >> I do not recall PostgreSQL behaving has described. If I am not mistaken
> in
> >> PostgreSQL *SELECT 3/2* will return *1*. Which is similar to what MS SQL
> >> server and Oracle do. So all thoses databases will lose precision if you
> >> are not carefull.
> >>
> >> If you truly need precision you can have it by using exact numeric types
> >> for your data types. Of course it has a cost on performance, memory and
> >> disk usage.
> >>
> >> The advantage of the current approach is that it give you the choice.
> It is
> >> up to you to decide what you need for your application. It is also in
> line
> >> with the way CQL behave everywhere else.
> >>
> > --
> >
> > Muru
>
>
> -
> To unsubscribe, e-mail: 

Re: Scratch an itch

2018-07-12 Thread Michael Burman

On 07/12/2018 07:38 PM, Stefan Podkowinski wrote:

this point? Also, if we tell someone that their contribution will be
reviewed and committed later after 4.0-beta, how is that actually making
a difference for that person, compared to committing it now for a 4.x
version. It may be satisfying to get a patch committed, but what matters
more is when the code will actually be released and deferring committing
contributions after 4.0-beta doesn't necessarily mean that there's any
disadvantage when it comes to that.

Deferring huge amount of commits gives rebase/redo hell. That's the 
biggest impact and the order in which these deferred commits are then 
actually committed can make it more painful or less painful depending on 
the commit. And that in turn will have to then wait for each contributor 
to rebase/redo their commit and those timings might make more rebase 
issues. If those committers will want to rebase something after n-months 
or have time at that point.


That's a problem for all Cassandra patches that take huge time to commit 
and if this block takes a lot of time, then that will for sure be even 
more painful. I know products such as Kubernetes does the same (I guess 
that's where this idea might have come from) "trunk patches only", but 
their block is quite short.


My wish is that this freeze does not last too long to kill enthusiasm 
towards committing to Cassandra. There are (I assume) many hobbyist who 
do this as a side-project instead of their daily work and might not have 
the capabilities to test 4.0 in a way that will trigger bugs (easy bugs 
are fixed quite quickly I hope). And if they feel like it's not worth 
the time at this point to invest time to Cassandra (because nothing they 
do will get merged) - they might move to another project. And there's no 
guarantee they will return. Getting stuff to the product is part of the 
satisfaction and without satisfaction there's no interest in continuing.


  - Micke

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: Planning to port cqlsh to Python 3 (CASSANDRA-10190)

2018-06-01 Thread Michael Burman

Hi,

Deprecating in this context does not mean removing it or it being 
replaced by 3 (RHEL 7.x will remain with Python 2.x as default). It 
refers to future versions (>7), but there are none at this point. It 
appears Ubuntu has deviated from Debian in this sense, but Debian has 
not changed yet (likely Debian 10 will, but that's not out yet and has 
no announced release date).


Thus, 2.x still remains the most used version for servers. And servers 
deployed at this point of time will use these versions for years.


  - Micke


On 06/01/2018 10:52 AM, Murukesh Mohanan wrote:

On 2018/06/01 07:40:04, Michael Burman  wrote:

IIRC, there's no major distribution yet that defaults to Python 3 (I
think Ubuntu & Debian are still defaulting to Python 2 also). This will
happen eventually (maybe), but not yet. Discarding Python 2 support
would mean more base-OS work for most people wanting to run Cassandra
and that's not a positive thing.


Ubuntu since 16.04 defaults to Python 3:


Python2 is not installed anymore by default on the server, cloud and the touch 
images, long live Python3! Python3 itself has been upgraded to the 3.5 series. 
- https://wiki.ubuntu.com/XenialXerus/ReleaseNotes#Python_3

RHEL 7.5 deprecates Python 2 
(https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/7.5_release_notes/chap-red_hat_enterprise_linux-7.5_release_notes-deprecated_functionality).



-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: Planning to port cqlsh to Python 3 (CASSANDRA-10190)

2018-06-01 Thread Michael Burman

Hi,

Should definitely be cross compatible with Python 2/3. Most of the 
systems (such as those running on RHEL7 or distros based on it like 
CentOS) are shipping with 2.7 only by default. And these systems are 
probably going to be used for a long time to run Cassandra.


IIRC, there's no major distribution yet that defaults to Python 3 (I 
think Ubuntu & Debian are still defaulting to Python 2 also). This will 
happen eventually (maybe), but not yet. Discarding Python 2 support 
would mean more base-OS work for most people wanting to run Cassandra 
and that's not a positive thing.


For future, 2 & 3 compatibility would mean that we support larger amount 
of distributions out of the box.


  - Micke


On 06/01/2018 05:44 AM, Patrick Bannister wrote:

I propose porting cqlsh and cqlshlib to Python 3. End-of-life for Python 2.7
 is currently planned for 1
January 2020. We should prepare to port the tool to a version of Python
that will be officially supported.

I'm seeking input on three questions:
- Should we port it to straight Python 3, or Python 2/3 cross compatible?
- How much more testing is needed?
- Can we wait until after 4.0 for this?

I have an implementation
 to go with my
proposal. In parallel with getting the dtest cqlsh_tests working again, I
ported cqlsh.py and cqlshlib to Python 3. It passes with almost all of the
dtests and the unittests, so it's in pretty good shape, although it's not
100% done (more on that below).

*Python 3 or 2/3 cross compatible?* There are plenty of examples of Python
libraries that are compatible with both Python 2 and Python 3 (notably the
Cassandra Python driver), so I think this is achievable. The question is,
do we want to pay the price of cross compatibility? If we write cqlsh to be
2/3 cross compatible, we'll carry a long term technical debt to maintain
that feature. The value of continuing to support Python 2 will diminish
over time. However, a cross compatible implementation may ease the
transition for some users, especially if there are users who have made
significant custom modifications to the Python 2.7 implementation of cqlsh,
so I think we must at least consider the question.

*What additional testing is needed before we could release it?* I used
coverage.py to check on the code coverage of our existing dtest cqlsh_tests
and cqlshlib unittests. There are several blind spots in our current
testing that should be addressed before we release a port of cqlsh. Details
of this are available on JIRA ticket CASSANDRA-10190
 in the attachment
coverage_notes.txt
.
Beyond that, I've made no efforts to test on platforms other than Ubuntu
and CentOS, so Windows testing is needed if we're making efforts to support
Windows. It would also be preferable for some real users to try out the
port before it replaces the Python 2.7 cqlsh in a release.

Besides this, there are a couple of test failures I'm still trying to
figure out, notably tests involving user defined map types (a task made
more interesting by Python's general lack of support for immutable map
types).

*Can we wait until after 4.0 for this?* I don't think it's reasonable to
try to release this with 4.0 given the current consensus around a feature
freeze in the next few months. My feeling is that our testers and
committers are already very busy with the currently planned changes for
4.0. I recommend planning toward a release to occur after 4.0. If we run up
against Python 2.7 EOL before we can cut the next release, we could
consider releasing a ported cqlsh independently, for installation through
distutils or pip.

Patrick Bannister




-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: [DISCUSS] java 9 and the future of cassandra on the jdk

2018-03-21 Thread Michael Burman

On 03/21/2018 04:52 PM, Josh McKenzie wrote:


This would certainly mitigate a lot of the core problems with the new
release model. Has there been any public statements of plans/intent
with regards to distros doing this?
Since the latest official LTS version is Java 8, that's the only one 
with publicly available information For RHEL, OpenJDK8 will receive 
updates until October 2020.  "A major version of OpenJDK is supported 
for a period of six years from the time that it is first introduced in 
any version of RHEL, or until the retirement date of the underlying RHEL 
platform , whichever is earlier." [1]


[1] https://access.redhat.com/articles/1299013


In terms of the burden of bugfixes and security fixes if we bundled a
JRE w/C*, cutting a patch release of C* with a new JRE distribution
would be a really low friction process (add to build, check CI, green,
done), so I don't think that would be a blocker for the concept.

And do we have someone actively monitoring CVEs for this? Would we ship 
a version of OpenJDK which ensures that it works with all the major 
distributions? Would we run tests against all the major distributions 
for each of the OpenJDK version we would ship after each CVE with each 
Cassandra version? Who compiles the OpenJDK distribution we would create 
(which wouldn't be the official one if we need to maintain support for 
each distribution we support) ? What if one build doesn't work for one 
distro? Would we not update that CVE? OpenJDK builds that are in the 
distros are not necessarily the pure ones from the upstream, they might 
include patches that provide better support for the distribution - or 
even fix bugs that are not yet in the upstream version.


I guess we also need the Windows versions, maybe the PowerPC & ARM 
versions also at some point. I'm not sure if we plan to support J9 or 
other JVMs at some point.


We would also need to create CVE reports after each Java CVE for 
Cassandra as well I would assume since it would affect us separately 
(and updating only the Java wouldn't help).


To me this sounds like an understatement of the amount of work that 
would go to this. Not to mention the bad publicity if Java CVEs are not 
actually patched instantly in the Cassandra also (and then each user 
would have to validate that the shipped version actually works with 
their installation in their hardware since they won't get support for it 
from the vendors as it's unofficial package).


  - Micke

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: Expensive metrics?

2018-02-28 Thread Michael Burman

Hi,

I wrote CASSANDRA-14281 for the initial idea and where I ended up with 
my current prototype. This maintains the current layout of the JMX 
metrics so it shouldn't be visible to users. Should, because I couldn't 
really find any definition of our metrics. For example, our histograms 
when built with a Meter use a ExponentiallyDecayingReservoir but our 
histograms built directly use DecayingEstimatedHistogramReservoir 
algorithm. So these two histograms behave differently, but the JMX 
requester has no idea which way they're built.


Also, recentValues() for example does not use either behavior, but 
actually follows two different behaviors again (!). CASSANDRA-13642 
added the recentValues() which uses values() underneath. The ticket 
itself did not really describe what's the behavior that should be 
exposed (luckily it was in 4.0, so it technically is not in any released 
version yet).


I doubt anyone really knows how our metrics are supposed to work? What 
if I change how they decay? Or should we use a single decay strategy for 
all our histograms (and related) ?


I guess my ticket became suddenly slightly more complex ;)

  - Micke

On 02/28/2018 12:25 AM, Nate McCall wrote:

Hi Micke,
There is some good research in here - have you had a chance to create
some issues in Jira from this?

On Fri, Feb 23, 2018 at 6:28 AM, Michael Burman <mibur...@redhat.com> wrote:

Hi,

I was referring to this article by Shipilev (there are few small issues
forgotten in that url you pasted):

https://shipilev.net/blog/2014/nanotrusting-nanotime/

And his lovely recommendation on it: "System.nanoTime is as bad as
String.intern now: you can use it, but use it wisely. ". And Cassandra uses
it quite a lot in the write path at least. There isn't necessarily a better
option in Java for it, but for that reason we shouldn't push them everywhere
in the code "for fun".

   - Micke



On 02/22/2018 06:08 PM, Jeremiah D Jordan wrote:

re: nanoTime vs currentTimeMillis there is a good blog post here about the
timing of both and how your choice of Linux clock source can drastically
effect the speed of the calls, and also showing that in general on linux
there is no perf improvement for one over the other.
http://pzemtsov.github.io/2017/07/23/the-slow-currenttimemillis.html


On Feb 22, 2018, at 11:01 AM, Blake Eggleston <beggles...@apple.com>
wrote:

Hi Micke,

This is really cool, thanks for taking the time to investigate this. I
believe the metrics around memtable insert time come in handy in identifying
high partition contention in the memtable. I know I've been involved in a
situation over the past year where we got actionable info from this metric.
Reducing resolution to milliseconds is probably a no go since most things in
this path should complete in less than a millisecond.

Revisiting the use of the codahale metrics in the hot path like this
definitely seems like a good idea though. I don't think it's been something
we've talked about a lot, and it definitely looks like we could benefit from
using something more specialized here. I think it's worth doing, especially
since there won't be any major changes to how we do threading in 4.0. It's
probably also worth opening a JIRA and investigating the calls to nano time.
We at least need microsecond resolution here, and there could be something
we haven't thought of? It's worth a look at least.

Thanks,

Blake

On 2/22/18, 6:10 AM, "Michael Burman" <mibur...@redhat.com> wrote:

 Hi,

 I wanted to get some input from the mailing list before making a JIRA
 and potential fixes. I'll touch the performance more on latter part,
but
 there's one important question regarding the write latency metric
 recording place. Currently we measure the writeLatency (and metric
write
 sampler..) in ColumnFamilyStore.apply() and this is also the metric
we
 then replicate to Keyspace metrics etc.

 This is an odd place for writeLatency. Not to mention it is in a
 hot-path of Memtable-modifications, but it also does not measure the
 real write latency, since it completely ignores the CommitLog latency
in
 that same process. Is the intention really to measure
 Memtable-modification latency only or the actual write latencies?

 Then the real issue.. this single metric is a cause of huge overhead
in
 Memtable processing. There are several metrics / events in the CFS
apply
 method, including metric sampler, storageHook reportWrite,
 colUpdateTimeDeltaHistogram and metric.writeLatency. These are not
free
 at all when it comes to the processing. I made a small JMH benchmark
 here:
https://gist.github.com/burmanm/b5b284bc9f1d410b1d635f6d3dac3ade
 that I'll be referring to.

 The most offending of all these metrics is the writeLatency metric.
What
 it does is update the latency in codahale's timer, doing a histogram
 update and then going through all the parent metrics 

Re: Expensive metrics?

2018-02-22 Thread Michael Burman

Hi,

I was referring to this article by Shipilev (there are few small issues 
forgotten in that url you pasted):


https://shipilev.net/blog/2014/nanotrusting-nanotime/

And his lovely recommendation on it: "System.nanoTime is as bad as 
String.intern now: you can use it, but use it wisely. ". And Cassandra 
uses it quite a lot in the write path at least. There isn't necessarily 
a better option in Java for it, but for that reason we shouldn't push 
them everywhere in the code "for fun".


  - Micke


On 02/22/2018 06:08 PM, Jeremiah D Jordan wrote:

re: nanoTime vs currentTimeMillis there is a good blog post here about the 
timing of both and how your choice of Linux clock source can drastically effect 
the speed of the calls, and also showing that in general on linux there is no 
perf improvement for one over the other.
http://pzemtsov.github.io/2017/07/23/the-slow-currenttimemillis.html


On Feb 22, 2018, at 11:01 AM, Blake Eggleston <beggles...@apple.com> wrote:

Hi Micke,

This is really cool, thanks for taking the time to investigate this. I believe 
the metrics around memtable insert time come in handy in identifying high 
partition contention in the memtable. I know I've been involved in a situation 
over the past year where we got actionable info from this metric. Reducing 
resolution to milliseconds is probably a no go since most things in this path 
should complete in less than a millisecond.

Revisiting the use of the codahale metrics in the hot path like this definitely 
seems like a good idea though. I don't think it's been something we've talked 
about a lot, and it definitely looks like we could benefit from using something 
more specialized here. I think it's worth doing, especially since there won't 
be any major changes to how we do threading in 4.0. It's probably also worth 
opening a JIRA and investigating the calls to nano time. We at least need 
microsecond resolution here, and there could be something we haven't thought 
of? It's worth a look at least.

Thanks,

Blake

On 2/22/18, 6:10 AM, "Michael Burman" <mibur...@redhat.com> wrote:

Hi,

I wanted to get some input from the mailing list before making a JIRA
and potential fixes. I'll touch the performance more on latter part, but
there's one important question regarding the write latency metric
recording place. Currently we measure the writeLatency (and metric write
sampler..) in ColumnFamilyStore.apply() and this is also the metric we
then replicate to Keyspace metrics etc.

This is an odd place for writeLatency. Not to mention it is in a
hot-path of Memtable-modifications, but it also does not measure the
real write latency, since it completely ignores the CommitLog latency in
that same process. Is the intention really to measure
Memtable-modification latency only or the actual write latencies?

Then the real issue.. this single metric is a cause of huge overhead in
Memtable processing. There are several metrics / events in the CFS apply
method, including metric sampler, storageHook reportWrite,
colUpdateTimeDeltaHistogram and metric.writeLatency. These are not free
at all when it comes to the processing. I made a small JMH benchmark
here: https://gist.github.com/burmanm/b5b284bc9f1d410b1d635f6d3dac3ade
that I'll be referring to.

The most offending of all these metrics is the writeLatency metric. What
it does is update the latency in codahale's timer, doing a histogram
update and then going through all the parent metrics also which update
the keyspace writeLatency and globalWriteLatency. When measuring the
performance of Memtable.put with parameter of 1 partition (to reduce the
ConcurrentSkipListMap search speed impact - that's separate issue and
takes a little bit longer to solve although I've started to prototype
something..) on my machine I see 1.3M/s performance with the metric and
when it is disabled the performance climbs to 4M/s. So the overhead for
this single metric is ~2/3 of total performance. That's insane. My perf
stats indicate that the CPU is starved as it can't get enough data in.

Removing the replication from TableMetrics to the Keyspace & global
latencies in the write time (and doing this when metrics are requested
instead) improves the performance to 2.1M/s on my machine. It's an
improvement, but it's still huge amount. Even when we pressure the
ConcurrentSkipListMap with 100 000 partitions in one active Memtable,
the performance drops by about ~40% due to this metric, so it's never free.

i did not find any discussion replacing the metric processing with
something faster, so has this been considered before? At least for these
performance sensitive ones. The other issue is obviously the use of
System.nanotime() which by itself is very slow (two System.nanotime()
calls eat another ~1M/s from the

Re: Expensive metrics?

2018-02-22 Thread Michael Burman

Hi,

I've looked at the high level the metrics' expense. It's around ~4% of 
the total CPU time in my machine. But the problem with that higher level 
measurement is that it does not show waits. When I push writes to the 
Cassandra (through CQL) I'm mostly getting stalls according to the 
kernel level measurements - that is, the CPU can't do work. IIRC (I'm on 
the wrong computer now - can't check exact numbers) my CPU reports that 
about ~70% of the CPU cycles are wasted when running write test on live 
Cassandra instance. But at that level there's too much data to actually 
get into the details and what is blocking the CPU from getting things to 
execute, are they cache misses, branch misses (these are actually quite 
high) and where they do happen.


That's why I'm trying to look at the smaller parts first because it is 
easier to measure and debug (while that approach certainly can't solve 
so easily architectural issues). I do have some more higher level 
versions of that gist-microbench also (but I omitted for clarity) and I 
intend to create more of them.


  - Micke


On 02/22/2018 06:32 PM, Jonathan Haddad wrote:

Hey Micke, very cool you're looking to improve C*'s performance, we would
absolutely benefit from it.

Have you done any other benchmarks beside the micro one to determine the
total effect of these metrics on the system overall?  Microbenchmarks are a
great way to tune small sections of code but they aren't a great starting
point.  It would be good if we could put some context around the idea by
benchmarking a tuned, single node (so there's less network overhead)
running on fast disks with compaction disabled so we can see what kind of
impact these metrics are adding.  Ideally we'd look at GC promotion and CPU
time using something like YourKit to identify the overall effect of the
metrics, so we can set our expectations and goals in a reasonable manner.
Happy to coordinate with you on this!

On Thu, Feb 22, 2018 at 8:08 AM Jeremiah D Jordan <jeremiah.jor...@gmail.com>
wrote:


re: nanoTime vs currentTimeMillis there is a good blog post here about the
timing of both and how your choice of Linux clock source can drastically
effect the speed of the calls, and also showing that in general on linux
there is no perf improvement for one over the other.
http://pzemtsov.github.io/2017/07/23/the-slow-currenttimemillis.html


On Feb 22, 2018, at 11:01 AM, Blake Eggleston <beggles...@apple.com>

wrote:

Hi Micke,

This is really cool, thanks for taking the time to investigate this. I

believe the metrics around memtable insert time come in handy in
identifying high partition contention in the memtable. I know I've been
involved in a situation over the past year where we got actionable info
from this metric. Reducing resolution to milliseconds is probably a no go
since most things in this path should complete in less than a millisecond.

Revisiting the use of the codahale metrics in the hot path like this

definitely seems like a good idea though. I don't think it's been something
we've talked about a lot, and it definitely looks like we could benefit
from using something more specialized here. I think it's worth doing,
especially since there won't be any major changes to how we do threading in
4.0. It's probably also worth opening a JIRA and investigating the calls to
nano time. We at least need microsecond resolution here, and there could be
something we haven't thought of? It's worth a look at least.

Thanks,

Blake

On 2/22/18, 6:10 AM, "Michael Burman" <mibur...@redhat.com> wrote:

Hi,

I wanted to get some input from the mailing list before making a JIRA
and potential fixes. I'll touch the performance more on latter part,

but

there's one important question regarding the write latency metric
recording place. Currently we measure the writeLatency (and metric

write

sampler..) in ColumnFamilyStore.apply() and this is also the metric we
then replicate to Keyspace metrics etc.

This is an odd place for writeLatency. Not to mention it is in a
hot-path of Memtable-modifications, but it also does not measure the
real write latency, since it completely ignores the CommitLog latency

in

that same process. Is the intention really to measure
Memtable-modification latency only or the actual write latencies?

Then the real issue.. this single metric is a cause of huge overhead

in

Memtable processing. There are several metrics / events in the CFS

apply

method, including metric sampler, storageHook reportWrite,
colUpdateTimeDeltaHistogram and metric.writeLatency. These are not

free

at all when it comes to the processing. I made a small JMH benchmark
here:

https://gist.github.com/burmanm/b5b284bc9f1d410b1d635f6d3dac3ade

that I'll be referring to.

The most offending of all these metrics is the writeLatency metric.

What

it does is update the latency in codahale's timer, doing a histogr

Expensive metrics?

2018-02-22 Thread Michael Burman

Hi,

I wanted to get some input from the mailing list before making a JIRA 
and potential fixes. I'll touch the performance more on latter part, but 
there's one important question regarding the write latency metric 
recording place. Currently we measure the writeLatency (and metric write 
sampler..) in ColumnFamilyStore.apply() and this is also the metric we 
then replicate to Keyspace metrics etc.


This is an odd place for writeLatency. Not to mention it is in a 
hot-path of Memtable-modifications, but it also does not measure the 
real write latency, since it completely ignores the CommitLog latency in 
that same process. Is the intention really to measure 
Memtable-modification latency only or the actual write latencies?


Then the real issue.. this single metric is a cause of huge overhead in 
Memtable processing. There are several metrics / events in the CFS apply 
method, including metric sampler, storageHook reportWrite, 
colUpdateTimeDeltaHistogram and metric.writeLatency. These are not free 
at all when it comes to the processing. I made a small JMH benchmark 
here: https://gist.github.com/burmanm/b5b284bc9f1d410b1d635f6d3dac3ade 
that I'll be referring to.


The most offending of all these metrics is the writeLatency metric. What 
it does is update the latency in codahale's timer, doing a histogram 
update and then going through all the parent metrics also which update 
the keyspace writeLatency and globalWriteLatency. When measuring the 
performance of Memtable.put with parameter of 1 partition (to reduce the 
ConcurrentSkipListMap search speed impact - that's separate issue and 
takes a little bit longer to solve although I've started to prototype 
something..) on my machine I see 1.3M/s performance with the metric and 
when it is disabled the performance climbs to 4M/s. So the overhead for 
this single metric is ~2/3 of total performance. That's insane. My perf 
stats indicate that the CPU is starved as it can't get enough data in.


Removing the replication from TableMetrics to the Keyspace & global 
latencies in the write time (and doing this when metrics are requested 
instead) improves the performance to 2.1M/s on my machine. It's an 
improvement, but it's still huge amount. Even when we pressure the 
ConcurrentSkipListMap with 100 000 partitions in one active Memtable, 
the performance drops by about ~40% due to this metric, so it's never free.


i did not find any discussion replacing the metric processing with 
something faster, so has this been considered before? At least for these 
performance sensitive ones. The other issue is obviously the use of 
System.nanotime() which by itself is very slow (two System.nanotime() 
calls eat another ~1M/s from the performance)


My personal quick fix would be to move writeLatency to Keyspace.apply, 
change write time aggregates to read time processing (metrics are read 
less often than we write data) and maybe even reduce the nanotime -> 
currentTimeMillis (even given it's relative lack of precision). That is 
- if these metrics make any sense at all at CFS level? Maybe these 
should be measured from the network processing time (including all the 
deserializations and such) ? Especially if at some point the smarter 
threading / eventlooping changes go forward (in which case they might 
sleep at some "queue" for a while).


  - Micke


-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: Pluggable storage engine discussion

2017-11-05 Thread Michael Burman

Hi,

There's a ticket also for columnar storage option, which I guess is 
something that many might want. Not least because in many cases it could 
reduce the storage footprint by a large margin (and enable more 
sophisticated compression options), even if we discount the possible 
query advantages. For range queries it might be seriously faster - as 
Evan Chan reported with FiloDB and I have personally done in 
Hawkular-Metrics also (storing "columnar blocks" in a single cell lead 
to ~95% reduction in query times for any non-trivial query even when it 
had to be deserialized at the client side).


CASSANDRA-7447

While anything can obviously be built on top of any storage engine, 
that's not necessarily the most effective way.


  - Micke


On 11/03/2017 10:48 PM, Stefan Podkowinski wrote:

Hi Dikang

Have you been able to continue evaluating RocksDB? I'm afraid we might
be a bit too much ahead in the discussion by already talking about a
pluggable architecture, while we haven't fully evaluated yet if we can
and want to support an alternative RocksDB engine implementation at all.
Because if we don't, we also don't need a pluggable architecture at this
point, do we? There's little to be gained from a major refactoring, just
to find out that alternative engines we thought of didn't turn out to be
a good fit for production for whatever reasons.

On the other hand, if RocksDB is (by whatever standards) a better
storage implementation, why not completely switch, instead of just
making it an option? But if it's not, is a major refactoring still worth it?


On 03.11.17 19:22, Dikang Gu wrote:

Hi,

We are having discussions about the pluggable storage engine plan on the
jira: https://issues.apache.org/jira/browse/CASSANDRA-13475.

We are trying to figure out a plan for the pluggable storage engine effort.
Right now, the discussion is mainly happening between couple C* committers,
like Blake and me. But I want to increase the visibility, and I'm very
welcome more developers to be involved in the discussion. It will help us
on moving forward on this effort.

Also, I have a quip as a (very high level) design doc for this project.
https://quip.com/bhw5ABUCi3co

Thanks
Dikang.



-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org