Re: [DISCUSS] Chronicle Queue's development model and a hypothetical replacement of the library

Josh McKenzie Mon, 30 Sep 2024 05:56:04 -0700

> thinking more about stuff like protobuf and while I do see benefits of that, 
> honestly, it just does not matter too much if it is done like that or not.
I disagree; I think having a language and environment agnostic file format 
matters a great deal. Unless we're talking specifically about the FQL case in 
which case I totally agree. :)


Part of what makes open source projects successful is their ability to 
modularly interop with other projects, often times in ways we don't predict; us 
using the Chronicle format for our binary files is choosing a format that's 
esoteric with fairly antagonistic licensing compared to something like 
protobufs.

I'd strongly support either rolling the format change into the CEP-12 proposal 
or having another CEP for introducing protobuf, spark, etc - some kind of more 
broadly adopted format, and removing chronicle from our stack.

To Jon's point: having bespoke formats with all the code for it housed in 
cassandra-all rather than in a modular support library really ties our hands on 
a lot of fronts; updating the core DB in terms of serialization formats, of JDK 
versions, etc all potentially become client-impacting exercises.


On Mon, Sep 30, 2024, at 5:12 AM, Štefan Miklošovič wrote:
> That is all OK to mention. So as I read it correctly, Jon, myself, Andrew, 
> David - we all would like to see some cross-language events ingestion. Other 
> people do not seem to consider it important enough (just correct me if I am 
> wrong). I do not mind, I am 50:50, not an absolute must but sure, let's add 
> that to the wish list.
> 
> I believe that unless it is an absolute show-stopper (which it is not) it is 
> not necessary. It is something to keep in mind upon next refactorization. I 
> am personally not affected by this. Whoever will need that, they will find a 
> way to make it happen.
> 
> I would really appreciate it if we found consensus among these points I wrote 
> down. From my point of view, the most ideal outcome would be to make CEP-12 
> happen which cleans it all up and makes it more robust.
> 
> My perception is that we have always found the most practical solution and if 
> it brings value to a user (being able to inspect diagnostic events offline 
> for further inspection) we should not avoid delivering that. Something 
> similar happened in e.g. Password validator / generator (CEP-24) where the 
> most ideal solution was to base it on transactional configuration but even 
> though we are not there yet, that did not stop us from delivering it because 
> for some entities in this space it brought value anyway. 
> 
> It may seem as if I am invested in delivering it because I spent some time on 
> that already - that is not the case - I am OK to drop diagnostic events 
> persistence if you insist on that but honestly I just do not see any 
> compelling argument to do so. The library we are using that for (Chronicle 
> Queues) is already there, it is just a functionality addition and as I said, 
> if somebody comes and delivers a solution which would replace it, I do not 
> see it to be problematic. It would be replaced as anything else (FQL, Audit 
> ...). 
> 
> On Sun, Sep 29, 2024 at 9:06 PM Jon Haddad <j...@rustyrazorblade.com> wrote:
>> Strong +1 to the file format issue, and if we're building a wish list - it 
>> would be great if we could read the file format without pulling in 
>> cassandra-all.  Long term, I'd love to see this for SSTables & Commit logs 
>> as well.
>> 
>> I've long been a fan of Gradle subprojects because it makes this kind of 
>> thing fairly easy. 
>> 
>> Jon
>> 
>> 
>> 
>> On Sun, Sep 29, 2024 at 11:46 AM Andrew Weaver <andrewjwea...@gmail.com> 
>> wrote:
>>> I'm late to the discussion here, but I want to add my experience from 
>>> dealing with audit logs specifically. 
>>> 
>>> Chronicle has some advantages (binary, compact) but it has a serious 
>>> disadvantage from a consumption standpoint. It's not a well-supported file 
>>> format. Audit logs are something that I think most operators are interested 
>>> in archiving for compliance purposes and analyzing offline for any number 
>>> of reasons and an oddball file format is an unnecessary hurdle for the 
>>> audit logs use-case. 
>>> 
>>> I would welcome support for an existing format that is compact, 
>>> high-performance and compatible with common tools (Spark, etc.).
>>> 
>>> On Sun, Sep 29, 2024, 10:11 AM Štefan Miklošovič <smikloso...@apache.org> 
>>> wrote:
>>>> Thank you all for your answers and opinions. I would like to have some 
>>>> kind of a resolution here in order to move forward, especially with 
>>>> relation to CEP-12 I mentioned earlier. (1).
>>>> 
>>>> I think we have these options:
>>>> 
>>>> 1) Do nothing and wait until this gets back to us, probably in a more 
>>>> serious way (we find a bug and we will not be able to update it because it 
>>>> would be on "ea" or new features will be available only in newer versions).
>>>> 
>>>> 2) Fork it and continue to maintain it - I do not think this is realistic, 
>>>> nobody is going to take care of forking that and maintaining it long term.
>>>> 
>>>> 3) Do nothing but refactor it in such a way that it will be easier to 
>>>> replace it with something else in the future. CEP-12 is not only adding 
>>>> persistence to diagnostic events but the patch I have also makes whole 
>>>> logging more robust. Even it is all on Chronicle Queues (FQL, Audit ...), 
>>>> there are some differences between that when it comes to the 
>>>> implementation and I think that refactoring it in such a way that it would 
>>>> have all clear class structure and hierarchy (bottom of CEP-12) we will 
>>>> have easier job if we ever go to replace that.
>>>> 
>>>> 4) Proceed with CEP-12 even though we know we are building it on top of 
>>>> something which should not be there.
>>>> 
>>>> 5) Do absolutely nothing until we replace it with something else and we 
>>>> get rid of what is there right now - that would mean that we will not 
>>>> benefit from the code which is easier to maintain etc (if CEP-12 is not 
>>>> going to materialize) which I think is a welcomed attribute of the code 
>>>> base to have.
>>>> 
>>>> I was thinking more about stuff like protobuf and while I do see benefits 
>>>> of that, honestly, it just does not matter too much if it is done like 
>>>> that or not. I mean, sure, it would be cool to have, but we could spend a 
>>>> lot of effort on protobuf and integrating with it or on anything which 
>>>> would make the consumption of these events language-agnostic but these are 
>>>> quite niche scenarios and I think that time might be used somewhere else 
>>>> more effectively.
>>>> 
>>>> The bottom line is that I am reluctant to do anything unless CEP-12 makes 
>>>> it in one way or another (either with diagnostic persistence or without it 
>>>> but with a nice refactoring) and, let's get real here, I do not think that 
>>>> anybody is going to spend any time on this particular piece of the 
>>>> functionality either. So the net result is that it will be either 
>>>> athrophying or we at least clean it up so whoever comes next has an easier 
>>>> job to replace it. 
>>>> 
>>>> (1) 
>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-12+Diagnostics+events+persistence+and+their+exposure+in+virtual+tables
>>>> 
>>>> On Tue, Sep 24, 2024 at 6:11 PM Ariel Weisberg <ar...@weisberg.ws> wrote:
>>>>> __
>>>>> Hi,
>>>>> 
>>>>>> I just don't understand what "good enough performance" is. 
>>>>> Should really specify throughput. There is a single thread writing 
>>>>> records to the log and it's a bottleneck around a few hundred thousand 
>>>>> entries/sec and 1gb/sec. It doesn't scale to arbitrary throughput 
>>>>> requirements.
>>>>> 
>>>>>> What is a "predictable footprint"? Was that measured too? How did we 
>>>>>> quantify that?
>>>>> You can set a rolling cycle to limit the size of the log. It's not that 
>>>>> predictable disk space wise because rolling is time based, and that is 
>>>>> one of the things I don't like about Chronicle.
>>>>> 
>>>>>> This is interesting, if I understand correctly, the messages are 
>>>>>> weighted and the heavier they are, the more probable it is they will be 
>>>>>> dropped when it is overloaded? Or vice versa, the tighter ones are 
>>>>>> dropped first?
>>>>> It's still a FIFO queue. Elements aren't dropped from the queue they are 
>>>>> dropped by the producers who don't have to wait for the consumer of the 
>>>>> queue to catch up. The queue size is described in terms of weight not 
>>>>> number of elements so it can bound memory usage.
>>>>> 
>>>>>> Have we _ever_ experienced in production that some log events were 
>>>>>> really dropped? Has anybody ever hit that?
>>>>> Dropping samples is off by default so it can be used in a lossless way.
>>>>> 
>>>>> Notionally one of the use cases of full query logging is that you have a 
>>>>> cluster that is overloaded and want to find out what is causing it. These 
>>>>> nodes maybe low on IO/CPU and turning on the full query log could cause 
>>>>> additional timeouts so one goal of the full query log is that enabling it 
>>>>> shouldn't make things worse.
>>>>> 
>>>>> That is the motivation for memory limits and not blocking request threads 
>>>>> on IO. Really there should also be rate limits and random sampling 
>>>>> because right now dropping samples will be biased towards dropping large 
>>>>> footprint samples.
>>>>> 
>>>>> David Capwell mentioned some performance issues. I recall we talked about 
>>>>> it and I did a quick microbenchmark and didn't have a problem writing 
>>>>> records (1 gigabyte/sec, hundreds of thousands of entries) so I am not 
>>>>> sure what scenarios is where performance is bad and whether it is 
>>>>> addressable. Not sure it matters since Chronicle's approach to OSS is so 
>>>>> problematic. 
>>>>> 
>>>>> Ariel
>>>>> On Tue, Sep 17, 2024, at 4:27 AM, Štefan Miklošovič wrote:
>>>>>> to Benedict:
>>>>>> 
>>>>>> well ... I was not around when the decision about the usage of Chronicle 
>>>>>> Queues was made. I think that at that time it was the most obvious 
>>>>>> candidate without reinventing the wheel given the features and 
>>>>>> capabilities it had so taking something off the shelf was a natural 
>>>>>> conclusion.
>>>>>> 
>>>>>> Josh / Jordan:
>>>>>> 
>>>>>> not only FQL but Audit as well these are two separate things. There is 
>>>>>> also quite a "rich" ecosystem around that.
>>>>>> 
>>>>>> 1) nodetool commands like
>>>>>> 
>>>>>> enableauditlog
>>>>>> enablefullquerylog
>>>>>> disableauditlog
>>>>>> disablefullquerylog
>>>>>> getauditlog
>>>>>> getfullquerylog
>>>>>> 
>>>>>> Also, because the files it produces are binary, we need a special 
>>>>>> tooling to inspect it, it is in tools/fqltool with a bunch of classes, 
>>>>>> and there is also an AuditLogViewer for reviewing audit logs.
>>>>>> 
>>>>>> There are MBean methods enabling nodetool commands.
>>>>>> 
>>>>>> We have also shipped that in two major releases (4.0 and now in 5.0) so 
>>>>>> the community is quite well used to this, they have the processes set 
>>>>>> around this etc.
>>>>>> 
>>>>>> I mention this all because it is just not so easy to replace it with 
>>>>>> something else if somebody wanted that, in any case. How do we even go 
>>>>>> around deprecating this if we are indeed going to replace that?
>>>>>> 
>>>>>> To discuss the release aspect they have in place: I think you are right 
>>>>>> that the latest ea is as close as possible, if not the same, as what 
>>>>>> they release privately. Yes. But if we want to stick to the rule that we 
>>>>>> upgrade only to the latest ea relese before their next minor, then
>>>>>> 
>>>>>> 1) we will be always at least one minor late
>>>>>> 2) we do not know when they make up their minds to transition to a new 
>>>>>> minor so we can upgrade to the latest ea one minor before
>>>>>> 3) if something is broken and we need to fix it and we are on ea, then 
>>>>>> what we get to update to is the latest ea at that time which might fix 
>>>>>> the issue but it will also bring new stuff in which might open doors to 
>>>>>> instability as well. So we update to fix the bugs but we might include 
>>>>>> new ones unknowingly.
>>>>>> 
>>>>>> Anyway, I don't think this has any silver bullet solution, we might just 
>>>>>> stick to the latest "ea" and be done with it. I do not expect this 
>>>>>> project to evolve wildly and unpredictably, it just solves "one 
>>>>>> problem", there is basically nothing new coming in.
>>>>>> 
>>>>>> Brandon:
>>>>>> 
>>>>>> I understand your concerns about phoning home but
>>>>>> 
>>>>>> 1) we already resolved this by setting the respective property
>>>>>> 2) I do not think that Chronicle will mess with this once they introduce 
>>>>>> that. There is nothing to "improve" or "change" there. It is phoning 
>>>>>> home or not and it is driven by one property. If they made a change that 
>>>>>> we can not turn it off then we would really be in trouble but for now we 
>>>>>> are not and practically speaking I don't expect this would change.
>>>>>> 
>>>>>> I know that this might sound like wishful thinking but in practical 
>>>>>> terms I really just don't expect this phoning home thing would come back 
>>>>>> ever.
>>>>>> 
>>>>>> Speaking of alternatives, I think the primary reason Chronicle was used 
>>>>>> is this (1).
>>>>>> 
>>>>>> "It's goal is good enough performance, predictable footprint, simplicity 
>>>>>> in terms of implementation and configuration and most importantly 
>>>>>> minimal impact on producers of log records."
>>>>>> 
>>>>>> While I understand English (I guess, well enough :D), I just don't 
>>>>>> understand what "good enough performance" is. How is this measured? What 
>>>>>> is a "predictable footprint"? Was that measured too? How did we quantify 
>>>>>> that?
>>>>>> 
>>>>>> " Performance safety is accomplished by feeding items to the binary log 
>>>>>> using a weighted queue and dropping records if the binary log falls 
>>>>>> sufficiently far behind."
>>>>>> 
>>>>>> This is interesting, if I understand correctly, the messages are 
>>>>>> weighted and the heavier they are, the more probable it is they will be 
>>>>>> dropped when it is overloaded? Or vice versa, the tighter ones are 
>>>>>> dropped first?
>>>>>> 
>>>>>> Have we _ever_ experienced in production that some log events were 
>>>>>> really dropped? Has anybody ever hit that?
>>>>>> 
>>>>>> When it comes to alternatives, what about logback + slf4j? It has 
>>>>>> appenders where we want, it is sync / async, we can code some nio 
>>>>>> appender too I guess, it logs it as text into a file so we do not need 
>>>>>> any special tooling to review that. For tailing which Chronicle also 
>>>>>> offers, I guess "tail -f that.log" just does the job? logback even rolls 
>>>>>> the files after they are big enough so it rolls the files the same way 
>>>>>> after some configured period / size as Chronicle does (It even 
>>>>>> compresses the logs).
>>>>>> 
>>>>>> Do we log so much so that battle-tested logback is just absolutely not 
>>>>>> enough for us? Come on, this is not a rocket science that we need to use 
>>>>>> a library from the realm of "high frequency trading" to just append 
>>>>>> queries and audit logs as they are executed. logback can handle the load 
>>>>>> we have just fine imo ...
>>>>>> 
>>>>>> Or maybe I am completely wrong and we just HAVE TO use Chronicle?
>>>>>> 
>>>>>> (1) 
>>>>>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/binlog/BinLog.java#L58-L69
>>>>>> 
>>>>>> On Tue, Sep 17, 2024 at 3:12 AM Brandon Williams <dri...@gmail.com> 
>>>>>> wrote:
>>>>>>> My concern is that we have to keep making sure it's not phoning 
>>>>>>> home(1,2).
>>>>>>> 
>>>>>>> (1) https://issues.apache.org/jira/browse/CASSANDRA-18538
>>>>>>> (2) https://issues.apache.org/jira/browse/CASSANDRA-19656
>>>>>>> 
>>>>>>> Kind Regards,
>>>>>>> Brandon
>>>>>>> 
>>>>>>> On Mon, Sep 16, 2024 at 7:53 PM Josh McKenzie <jmcken...@apache.org> 
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > I think it's FQLTool only right now; I bumped into it recently doing 
>>>>>>> > the JDK21 compat work.
>>>>>>> >
>>>>>>> > I'm not concerned about current usage / dependency, but if our usage 
>>>>>>> > expands this could start to become a problem and that's going to be a 
>>>>>>> > hard thing to track and mange.
>>>>>>> >
>>>>>>> > So reading through those issues Stefan, I think it boils down to:
>>>>>>> >
>>>>>>> > The latest ea is code identical to the stable release
>>>>>>> > Subsequent bugfixes get applied to the customer-only stable branch 
>>>>>>> > and one release forward
>>>>>>> > Projects running ea releases would need to cherry-pick those bugfixes 
>>>>>>> > back or run on the next branch's ea, which could introduce the 
>>>>>>> > project to API changes or other risks
>>>>>>> >
>>>>>>> > Assuming that's the case... blech. Our exposure is low, but that 
>>>>>>> > seems like a real pain.
>>>>>>> >
>>>>>>> > On Mon, Sep 16, 2024, at 5:16 PM, Benedict wrote:
>>>>>>> >
>>>>>>> >
>>>>>>> > Don’t we essentially just use it as a file format for storing a 
>>>>>>> > couple of kinds of append-only data?
>>>>>>> >
>>>>>>> > I was never entirely clear on the value it brought to the project.
>>>>>>> >
>>>>>>> >
>>>>>>> > On 16 Sep 2024, at 22:11, Jordan West <jw...@apache.org> wrote:
>>>>>>> >
>>>>>>> > 
>>>>>>> > Thanks for the sleuthing Stefan! This definitely is a bit 
>>>>>>> > unfortunate. It sounds like a replacement is not really practical so 
>>>>>>> > I'll ignore that option for now, until a viable alternative is 
>>>>>>> > proposed. I am -1 on us writing our own without strong, strong 
>>>>>>> > justification -- primarily because I think the likelihood is we 
>>>>>>> > introduce more bugs before getting to something stable.
>>>>>>> >
>>>>>>> > Regarding the remaining options, mostly some thoughts:
>>>>>>> >
>>>>>>> > - it would be nice to have some specific evidence of other projects 
>>>>>>> > using the EA versions and what their developers have said about it.
>>>>>>> > - it sounds like if we go with the EA route, the onus to test for 
>>>>>>> > correctness / compatibility increases. They do test but anything 
>>>>>>> > marked "early access" I think deserves more scrutiny from the C* 
>>>>>>> > community before release. That could come in the form of more tests 
>>>>>>> > (or showing that we already have good coverage of where its used).
>>>>>>> > - i assume each time we upgrade we would pick the most recently 
>>>>>>> > released EA version
>>>>>>> >
>>>>>>> > Jordan
>>>>>>> >
>>>>>>> >
>>>>>>> > On Mon, Sep 16, 2024 at 1:46 PM Štefan Miklošovič 
>>>>>>> > <smikloso...@apache.org> wrote:
>>>>>>> >
>>>>>>> > We are using a library called Chronicle Queue (1) and its 
>>>>>>> > dependencies and we ship them in the distribution tarball.
>>>>>>> >
>>>>>>> > The version we use in 5.0 / trunk as I write this is 2.23.36. If you 
>>>>>>> > look closely here (2), there is one more release like this, 2.23.37 
>>>>>>> > and after that all these releases have "ea" in their name.
>>>>>>> >
>>>>>>> > "ea" stands for "early access". The project has changed the 
>>>>>>> > versioning / development model in such a way that "ea" releases act, 
>>>>>>> > more or less, as glorified snapshots which are indeed released to 
>>>>>>> > Maven Central but the "regular" releases are not there. The reason 
>>>>>>> > behind this is that "regular" releases are published only for 
>>>>>>> > customers who pay to the company behind this project and they offer 
>>>>>>> > commercial support for that.
>>>>>>> >
>>>>>>> > "regular" releases are meant to get all the bug fixes after "ea" is 
>>>>>>> > published and they are official stable releases. On the other hand 
>>>>>>> > "ea" releases are the ones where the development happens and every 
>>>>>>> > now and then, once the developers think that it is time to cut new 
>>>>>>> > 2.x, they just publish that privately.
>>>>>>> >
>>>>>>> > I was investigating how this all works here (3) and while they said 
>>>>>>> > that, I quote (4):
>>>>>>> >
>>>>>>> > "In my experience this is consumed by a large number of open source 
>>>>>>> > projects reliably (for our other artifacts too). This development/ea 
>>>>>>> > branch still goes through an extensive test suite prior to release. 
>>>>>>> > Releases from this branch will contain the latest features and bug 
>>>>>>> > fixes."
>>>>>>> >
>>>>>>> > I am not completely sure if we are OK with this. For the record, Mick 
>>>>>>> > is not overly comfortable with that and Brandon would prefer to just 
>>>>>>> > replace it / get rid of this dependency (comments / reasons / 
>>>>>>> > discussion from (5) to the end)
>>>>>>> >
>>>>>>> > The question is if we are OK with how things are and if we are then 
>>>>>>> > what are the rules when upgrading the version of this project in 
>>>>>>> > Cassandra in the context of "ea" versions they publish.
>>>>>>> >
>>>>>>> > If we are not OK with this, then the question is what we are going to 
>>>>>>> > replace it with.
>>>>>>> >
>>>>>>> > If we are going to replace it, I very briefly took a look and there 
>>>>>>> > is practically nothing out there which would hit all the buttons for 
>>>>>>> > us. Chronicle is just perfect for this job and I am not a fan of 
>>>>>>> > rewriting this at all.
>>>>>>> >
>>>>>>> > I would like to have this resolved because there is CEP-12 I plan to 
>>>>>>> > deliver and I hit this and I do not want to base that work on 
>>>>>>> > something we might eventually abandon. There are some ideas for 
>>>>>>> > CEP-12 how to bypass this without using Chronicle but I would like to 
>>>>>>> > firstly hear your opinion.
>>>>>>> >
>>>>>>> > Regards
>>>>>>> >
>>>>>>> > (1) https://github.com/OpenHFT/Chronicle-Queue
>>>>>>> > (2) https://repo1.maven.org/maven2/net/openhft/chronicle-core/
>>>>>>> > (3) https://github.com/OpenHFT/Chronicle-Core/issues/668
>>>>>>> > (4) 
>>>>>>> > https://github.com/OpenHFT/Chronicle-Core/issues/668#issuecomment-2322038676
>>>>>>> > (5) 
>>>>>>> > https://issues.apache.org/jira/browse/CASSANDRA-18712?focusedCommentId=17878254&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17878254
>>>>>>> >
>>>>>>> >
>>>>>

Re: [DISCUSS] Chronicle Queue's development model and a hypothetical replacement of the library

Reply via email to