> thinking more about stuff like protobuf and while I do see benefits of that, > honestly, it just does not matter too much if it is done like that or not. I disagree; I think having a language and environment agnostic file format matters a great deal. Unless we're talking specifically about the FQL case in which case I totally agree. :)
Part of what makes open source projects successful is their ability to modularly interop with other projects, often times in ways we don't predict; us using the Chronicle format for our binary files is choosing a format that's esoteric with fairly antagonistic licensing compared to something like protobufs. I'd strongly support either rolling the format change into the CEP-12 proposal or having another CEP for introducing protobuf, spark, etc - some kind of more broadly adopted format, and removing chronicle from our stack. To Jon's point: having bespoke formats with all the code for it housed in cassandra-all rather than in a modular support library really ties our hands on a lot of fronts; updating the core DB in terms of serialization formats, of JDK versions, etc all potentially become client-impacting exercises. On Mon, Sep 30, 2024, at 5:12 AM, Štefan Miklošovič wrote: > That is all OK to mention. So as I read it correctly, Jon, myself, Andrew, > David - we all would like to see some cross-language events ingestion. Other > people do not seem to consider it important enough (just correct me if I am > wrong). I do not mind, I am 50:50, not an absolute must but sure, let's add > that to the wish list. > > I believe that unless it is an absolute show-stopper (which it is not) it is > not necessary. It is something to keep in mind upon next refactorization. I > am personally not affected by this. Whoever will need that, they will find a > way to make it happen. > > I would really appreciate it if we found consensus among these points I wrote > down. From my point of view, the most ideal outcome would be to make CEP-12 > happen which cleans it all up and makes it more robust. > > My perception is that we have always found the most practical solution and if > it brings value to a user (being able to inspect diagnostic events offline > for further inspection) we should not avoid delivering that. Something > similar happened in e.g. Password validator / generator (CEP-24) where the > most ideal solution was to base it on transactional configuration but even > though we are not there yet, that did not stop us from delivering it because > for some entities in this space it brought value anyway. > > It may seem as if I am invested in delivering it because I spent some time on > that already - that is not the case - I am OK to drop diagnostic events > persistence if you insist on that but honestly I just do not see any > compelling argument to do so. The library we are using that for (Chronicle > Queues) is already there, it is just a functionality addition and as I said, > if somebody comes and delivers a solution which would replace it, I do not > see it to be problematic. It would be replaced as anything else (FQL, Audit > ...). > > On Sun, Sep 29, 2024 at 9:06 PM Jon Haddad <j...@rustyrazorblade.com> wrote: >> Strong +1 to the file format issue, and if we're building a wish list - it >> would be great if we could read the file format without pulling in >> cassandra-all. Long term, I'd love to see this for SSTables & Commit logs >> as well. >> >> I've long been a fan of Gradle subprojects because it makes this kind of >> thing fairly easy. >> >> Jon >> >> >> >> On Sun, Sep 29, 2024 at 11:46 AM Andrew Weaver <andrewjwea...@gmail.com> >> wrote: >>> I'm late to the discussion here, but I want to add my experience from >>> dealing with audit logs specifically. >>> >>> Chronicle has some advantages (binary, compact) but it has a serious >>> disadvantage from a consumption standpoint. It's not a well-supported file >>> format. Audit logs are something that I think most operators are interested >>> in archiving for compliance purposes and analyzing offline for any number >>> of reasons and an oddball file format is an unnecessary hurdle for the >>> audit logs use-case. >>> >>> I would welcome support for an existing format that is compact, >>> high-performance and compatible with common tools (Spark, etc.). >>> >>> On Sun, Sep 29, 2024, 10:11 AM Štefan Miklošovič <smikloso...@apache.org> >>> wrote: >>>> Thank you all for your answers and opinions. I would like to have some >>>> kind of a resolution here in order to move forward, especially with >>>> relation to CEP-12 I mentioned earlier. (1). >>>> >>>> I think we have these options: >>>> >>>> 1) Do nothing and wait until this gets back to us, probably in a more >>>> serious way (we find a bug and we will not be able to update it because it >>>> would be on "ea" or new features will be available only in newer versions). >>>> >>>> 2) Fork it and continue to maintain it - I do not think this is realistic, >>>> nobody is going to take care of forking that and maintaining it long term. >>>> >>>> 3) Do nothing but refactor it in such a way that it will be easier to >>>> replace it with something else in the future. CEP-12 is not only adding >>>> persistence to diagnostic events but the patch I have also makes whole >>>> logging more robust. Even it is all on Chronicle Queues (FQL, Audit ...), >>>> there are some differences between that when it comes to the >>>> implementation and I think that refactoring it in such a way that it would >>>> have all clear class structure and hierarchy (bottom of CEP-12) we will >>>> have easier job if we ever go to replace that. >>>> >>>> 4) Proceed with CEP-12 even though we know we are building it on top of >>>> something which should not be there. >>>> >>>> 5) Do absolutely nothing until we replace it with something else and we >>>> get rid of what is there right now - that would mean that we will not >>>> benefit from the code which is easier to maintain etc (if CEP-12 is not >>>> going to materialize) which I think is a welcomed attribute of the code >>>> base to have. >>>> >>>> I was thinking more about stuff like protobuf and while I do see benefits >>>> of that, honestly, it just does not matter too much if it is done like >>>> that or not. I mean, sure, it would be cool to have, but we could spend a >>>> lot of effort on protobuf and integrating with it or on anything which >>>> would make the consumption of these events language-agnostic but these are >>>> quite niche scenarios and I think that time might be used somewhere else >>>> more effectively. >>>> >>>> The bottom line is that I am reluctant to do anything unless CEP-12 makes >>>> it in one way or another (either with diagnostic persistence or without it >>>> but with a nice refactoring) and, let's get real here, I do not think that >>>> anybody is going to spend any time on this particular piece of the >>>> functionality either. So the net result is that it will be either >>>> athrophying or we at least clean it up so whoever comes next has an easier >>>> job to replace it. >>>> >>>> (1) >>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-12+Diagnostics+events+persistence+and+their+exposure+in+virtual+tables >>>> >>>> On Tue, Sep 24, 2024 at 6:11 PM Ariel Weisberg <ar...@weisberg.ws> wrote: >>>>> __ >>>>> Hi, >>>>> >>>>>> I just don't understand what "good enough performance" is. >>>>> Should really specify throughput. There is a single thread writing >>>>> records to the log and it's a bottleneck around a few hundred thousand >>>>> entries/sec and 1gb/sec. It doesn't scale to arbitrary throughput >>>>> requirements. >>>>> >>>>>> What is a "predictable footprint"? Was that measured too? How did we >>>>>> quantify that? >>>>> You can set a rolling cycle to limit the size of the log. It's not that >>>>> predictable disk space wise because rolling is time based, and that is >>>>> one of the things I don't like about Chronicle. >>>>> >>>>>> This is interesting, if I understand correctly, the messages are >>>>>> weighted and the heavier they are, the more probable it is they will be >>>>>> dropped when it is overloaded? Or vice versa, the tighter ones are >>>>>> dropped first? >>>>> It's still a FIFO queue. Elements aren't dropped from the queue they are >>>>> dropped by the producers who don't have to wait for the consumer of the >>>>> queue to catch up. The queue size is described in terms of weight not >>>>> number of elements so it can bound memory usage. >>>>> >>>>>> Have we _ever_ experienced in production that some log events were >>>>>> really dropped? Has anybody ever hit that? >>>>> Dropping samples is off by default so it can be used in a lossless way. >>>>> >>>>> Notionally one of the use cases of full query logging is that you have a >>>>> cluster that is overloaded and want to find out what is causing it. These >>>>> nodes maybe low on IO/CPU and turning on the full query log could cause >>>>> additional timeouts so one goal of the full query log is that enabling it >>>>> shouldn't make things worse. >>>>> >>>>> That is the motivation for memory limits and not blocking request threads >>>>> on IO. Really there should also be rate limits and random sampling >>>>> because right now dropping samples will be biased towards dropping large >>>>> footprint samples. >>>>> >>>>> David Capwell mentioned some performance issues. I recall we talked about >>>>> it and I did a quick microbenchmark and didn't have a problem writing >>>>> records (1 gigabyte/sec, hundreds of thousands of entries) so I am not >>>>> sure what scenarios is where performance is bad and whether it is >>>>> addressable. Not sure it matters since Chronicle's approach to OSS is so >>>>> problematic. >>>>> >>>>> Ariel >>>>> On Tue, Sep 17, 2024, at 4:27 AM, Štefan Miklošovič wrote: >>>>>> to Benedict: >>>>>> >>>>>> well ... I was not around when the decision about the usage of Chronicle >>>>>> Queues was made. I think that at that time it was the most obvious >>>>>> candidate without reinventing the wheel given the features and >>>>>> capabilities it had so taking something off the shelf was a natural >>>>>> conclusion. >>>>>> >>>>>> Josh / Jordan: >>>>>> >>>>>> not only FQL but Audit as well these are two separate things. There is >>>>>> also quite a "rich" ecosystem around that. >>>>>> >>>>>> 1) nodetool commands like >>>>>> >>>>>> enableauditlog >>>>>> enablefullquerylog >>>>>> disableauditlog >>>>>> disablefullquerylog >>>>>> getauditlog >>>>>> getfullquerylog >>>>>> >>>>>> Also, because the files it produces are binary, we need a special >>>>>> tooling to inspect it, it is in tools/fqltool with a bunch of classes, >>>>>> and there is also an AuditLogViewer for reviewing audit logs. >>>>>> >>>>>> There are MBean methods enabling nodetool commands. >>>>>> >>>>>> We have also shipped that in two major releases (4.0 and now in 5.0) so >>>>>> the community is quite well used to this, they have the processes set >>>>>> around this etc. >>>>>> >>>>>> I mention this all because it is just not so easy to replace it with >>>>>> something else if somebody wanted that, in any case. How do we even go >>>>>> around deprecating this if we are indeed going to replace that? >>>>>> >>>>>> To discuss the release aspect they have in place: I think you are right >>>>>> that the latest ea is as close as possible, if not the same, as what >>>>>> they release privately. Yes. But if we want to stick to the rule that we >>>>>> upgrade only to the latest ea relese before their next minor, then >>>>>> >>>>>> 1) we will be always at least one minor late >>>>>> 2) we do not know when they make up their minds to transition to a new >>>>>> minor so we can upgrade to the latest ea one minor before >>>>>> 3) if something is broken and we need to fix it and we are on ea, then >>>>>> what we get to update to is the latest ea at that time which might fix >>>>>> the issue but it will also bring new stuff in which might open doors to >>>>>> instability as well. So we update to fix the bugs but we might include >>>>>> new ones unknowingly. >>>>>> >>>>>> Anyway, I don't think this has any silver bullet solution, we might just >>>>>> stick to the latest "ea" and be done with it. I do not expect this >>>>>> project to evolve wildly and unpredictably, it just solves "one >>>>>> problem", there is basically nothing new coming in. >>>>>> >>>>>> Brandon: >>>>>> >>>>>> I understand your concerns about phoning home but >>>>>> >>>>>> 1) we already resolved this by setting the respective property >>>>>> 2) I do not think that Chronicle will mess with this once they introduce >>>>>> that. There is nothing to "improve" or "change" there. It is phoning >>>>>> home or not and it is driven by one property. If they made a change that >>>>>> we can not turn it off then we would really be in trouble but for now we >>>>>> are not and practically speaking I don't expect this would change. >>>>>> >>>>>> I know that this might sound like wishful thinking but in practical >>>>>> terms I really just don't expect this phoning home thing would come back >>>>>> ever. >>>>>> >>>>>> Speaking of alternatives, I think the primary reason Chronicle was used >>>>>> is this (1). >>>>>> >>>>>> "It's goal is good enough performance, predictable footprint, simplicity >>>>>> in terms of implementation and configuration and most importantly >>>>>> minimal impact on producers of log records." >>>>>> >>>>>> While I understand English (I guess, well enough :D), I just don't >>>>>> understand what "good enough performance" is. How is this measured? What >>>>>> is a "predictable footprint"? Was that measured too? How did we quantify >>>>>> that? >>>>>> >>>>>> " Performance safety is accomplished by feeding items to the binary log >>>>>> using a weighted queue and dropping records if the binary log falls >>>>>> sufficiently far behind." >>>>>> >>>>>> This is interesting, if I understand correctly, the messages are >>>>>> weighted and the heavier they are, the more probable it is they will be >>>>>> dropped when it is overloaded? Or vice versa, the tighter ones are >>>>>> dropped first? >>>>>> >>>>>> Have we _ever_ experienced in production that some log events were >>>>>> really dropped? Has anybody ever hit that? >>>>>> >>>>>> When it comes to alternatives, what about logback + slf4j? It has >>>>>> appenders where we want, it is sync / async, we can code some nio >>>>>> appender too I guess, it logs it as text into a file so we do not need >>>>>> any special tooling to review that. For tailing which Chronicle also >>>>>> offers, I guess "tail -f that.log" just does the job? logback even rolls >>>>>> the files after they are big enough so it rolls the files the same way >>>>>> after some configured period / size as Chronicle does (It even >>>>>> compresses the logs). >>>>>> >>>>>> Do we log so much so that battle-tested logback is just absolutely not >>>>>> enough for us? Come on, this is not a rocket science that we need to use >>>>>> a library from the realm of "high frequency trading" to just append >>>>>> queries and audit logs as they are executed. logback can handle the load >>>>>> we have just fine imo ... >>>>>> >>>>>> Or maybe I am completely wrong and we just HAVE TO use Chronicle? >>>>>> >>>>>> (1) >>>>>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/binlog/BinLog.java#L58-L69 >>>>>> >>>>>> On Tue, Sep 17, 2024 at 3:12 AM Brandon Williams <dri...@gmail.com> >>>>>> wrote: >>>>>>> My concern is that we have to keep making sure it's not phoning >>>>>>> home(1,2). >>>>>>> >>>>>>> (1) https://issues.apache.org/jira/browse/CASSANDRA-18538 >>>>>>> (2) https://issues.apache.org/jira/browse/CASSANDRA-19656 >>>>>>> >>>>>>> Kind Regards, >>>>>>> Brandon >>>>>>> >>>>>>> On Mon, Sep 16, 2024 at 7:53 PM Josh McKenzie <jmcken...@apache.org> >>>>>>> wrote: >>>>>>> > >>>>>>> > I think it's FQLTool only right now; I bumped into it recently doing >>>>>>> > the JDK21 compat work. >>>>>>> > >>>>>>> > I'm not concerned about current usage / dependency, but if our usage >>>>>>> > expands this could start to become a problem and that's going to be a >>>>>>> > hard thing to track and mange. >>>>>>> > >>>>>>> > So reading through those issues Stefan, I think it boils down to: >>>>>>> > >>>>>>> > The latest ea is code identical to the stable release >>>>>>> > Subsequent bugfixes get applied to the customer-only stable branch >>>>>>> > and one release forward >>>>>>> > Projects running ea releases would need to cherry-pick those bugfixes >>>>>>> > back or run on the next branch's ea, which could introduce the >>>>>>> > project to API changes or other risks >>>>>>> > >>>>>>> > Assuming that's the case... blech. Our exposure is low, but that >>>>>>> > seems like a real pain. >>>>>>> > >>>>>>> > On Mon, Sep 16, 2024, at 5:16 PM, Benedict wrote: >>>>>>> > >>>>>>> > >>>>>>> > Don’t we essentially just use it as a file format for storing a >>>>>>> > couple of kinds of append-only data? >>>>>>> > >>>>>>> > I was never entirely clear on the value it brought to the project. >>>>>>> > >>>>>>> > >>>>>>> > On 16 Sep 2024, at 22:11, Jordan West <jw...@apache.org> wrote: >>>>>>> > >>>>>>> > >>>>>>> > Thanks for the sleuthing Stefan! This definitely is a bit >>>>>>> > unfortunate. It sounds like a replacement is not really practical so >>>>>>> > I'll ignore that option for now, until a viable alternative is >>>>>>> > proposed. I am -1 on us writing our own without strong, strong >>>>>>> > justification -- primarily because I think the likelihood is we >>>>>>> > introduce more bugs before getting to something stable. >>>>>>> > >>>>>>> > Regarding the remaining options, mostly some thoughts: >>>>>>> > >>>>>>> > - it would be nice to have some specific evidence of other projects >>>>>>> > using the EA versions and what their developers have said about it. >>>>>>> > - it sounds like if we go with the EA route, the onus to test for >>>>>>> > correctness / compatibility increases. They do test but anything >>>>>>> > marked "early access" I think deserves more scrutiny from the C* >>>>>>> > community before release. That could come in the form of more tests >>>>>>> > (or showing that we already have good coverage of where its used). >>>>>>> > - i assume each time we upgrade we would pick the most recently >>>>>>> > released EA version >>>>>>> > >>>>>>> > Jordan >>>>>>> > >>>>>>> > >>>>>>> > On Mon, Sep 16, 2024 at 1:46 PM Štefan Miklošovič >>>>>>> > <smikloso...@apache.org> wrote: >>>>>>> > >>>>>>> > We are using a library called Chronicle Queue (1) and its >>>>>>> > dependencies and we ship them in the distribution tarball. >>>>>>> > >>>>>>> > The version we use in 5.0 / trunk as I write this is 2.23.36. If you >>>>>>> > look closely here (2), there is one more release like this, 2.23.37 >>>>>>> > and after that all these releases have "ea" in their name. >>>>>>> > >>>>>>> > "ea" stands for "early access". The project has changed the >>>>>>> > versioning / development model in such a way that "ea" releases act, >>>>>>> > more or less, as glorified snapshots which are indeed released to >>>>>>> > Maven Central but the "regular" releases are not there. The reason >>>>>>> > behind this is that "regular" releases are published only for >>>>>>> > customers who pay to the company behind this project and they offer >>>>>>> > commercial support for that. >>>>>>> > >>>>>>> > "regular" releases are meant to get all the bug fixes after "ea" is >>>>>>> > published and they are official stable releases. On the other hand >>>>>>> > "ea" releases are the ones where the development happens and every >>>>>>> > now and then, once the developers think that it is time to cut new >>>>>>> > 2.x, they just publish that privately. >>>>>>> > >>>>>>> > I was investigating how this all works here (3) and while they said >>>>>>> > that, I quote (4): >>>>>>> > >>>>>>> > "In my experience this is consumed by a large number of open source >>>>>>> > projects reliably (for our other artifacts too). This development/ea >>>>>>> > branch still goes through an extensive test suite prior to release. >>>>>>> > Releases from this branch will contain the latest features and bug >>>>>>> > fixes." >>>>>>> > >>>>>>> > I am not completely sure if we are OK with this. For the record, Mick >>>>>>> > is not overly comfortable with that and Brandon would prefer to just >>>>>>> > replace it / get rid of this dependency (comments / reasons / >>>>>>> > discussion from (5) to the end) >>>>>>> > >>>>>>> > The question is if we are OK with how things are and if we are then >>>>>>> > what are the rules when upgrading the version of this project in >>>>>>> > Cassandra in the context of "ea" versions they publish. >>>>>>> > >>>>>>> > If we are not OK with this, then the question is what we are going to >>>>>>> > replace it with. >>>>>>> > >>>>>>> > If we are going to replace it, I very briefly took a look and there >>>>>>> > is practically nothing out there which would hit all the buttons for >>>>>>> > us. Chronicle is just perfect for this job and I am not a fan of >>>>>>> > rewriting this at all. >>>>>>> > >>>>>>> > I would like to have this resolved because there is CEP-12 I plan to >>>>>>> > deliver and I hit this and I do not want to base that work on >>>>>>> > something we might eventually abandon. There are some ideas for >>>>>>> > CEP-12 how to bypass this without using Chronicle but I would like to >>>>>>> > firstly hear your opinion. >>>>>>> > >>>>>>> > Regards >>>>>>> > >>>>>>> > (1) https://github.com/OpenHFT/Chronicle-Queue >>>>>>> > (2) https://repo1.maven.org/maven2/net/openhft/chronicle-core/ >>>>>>> > (3) https://github.com/OpenHFT/Chronicle-Core/issues/668 >>>>>>> > (4) >>>>>>> > https://github.com/OpenHFT/Chronicle-Core/issues/668#issuecomment-2322038676 >>>>>>> > (5) >>>>>>> > https://issues.apache.org/jira/browse/CASSANDRA-18712?focusedCommentId=17878254&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17878254 >>>>>>> > >>>>>>> > >>>>>