> On 17 May 2019, at 00:16, Ludwig Krispenz <[email protected]> wrote: > > > On 05/16/2019 02:56 AM, William Brown wrote: >> Replying to both Simon and Ludwig in the one mail ... >> >> To just comment to one point from Ludwig: >> >>> This may sound too negative, I agree that logging deserves attention and >>> improvement, but we need to agree on what we want to achieve. >> I don't think this is negative, at all and I think it's exactly the kind of >> feedback I wanted. :) So thank you! >> >> >>> Hi William, >>> I am more than interested! >>> >>> I'd like to learn it one day anyway. >>> So if there will be no objections and we'll go forward now, >>> I think, it is important to agree on some points of decreasing a bus factor: >>> >>> - As you said, it should have very detailed comments on all of the >>> language features and technics you'll use. >>> - I think it will be nice to have an additional documentation which will >>> describe the basic setup for the development you use. All the toold you >>> need to develop, compile, test and debug. >>> - Also, some nice links for the basic tutorials on Rust types, concepts, >>> etc. >>> - I think, we should have detailed unit tests. It will help to >>> understand the code better and we will have less bugs (hopefully). >> So I'm writing a document for the wiki to cover this *and* rust supports >> tests as part of the system - that will be all be included. Is that >> acceptable? >> >>> And the final and big point I wanted to mention: >>> >>> - We should be prepared for a slow review process because we have quite >>> limited >>> recources in the team and a lot of work should be done (WebUI still has to >>> be refactored to React, >>> and it is only a small piece of the workload). >>> Also, I think, it makes sense to have the smallest Rust PRs that can be >>> put together as an independent unit. >> That's totally reasonable. I'll try to keep it to small parts and will build >> up as we go. See below for ideas on how I'll work toward this. >> >>> But everything is just my opinion and I don't know what others think and >>> if everyone is prepared to join it. I am feeling excited though :) >>> >>> Thanks, >>> Simon >>> >>> P.S. check the contribution guide please. Espesially a part about >>> 'rebase-force-push'. I think it is nice not to force push during >>> the review process (and rebase-squash only after you got an ACK). >> Yep, I saw this change. I'll keep it in mind :) >> >> >>> On 15 May 2019, at 19:37, Ludwig <[email protected]> wrote: >>> >>> Hi William, >>> >>> this is a good starting point to discuss and decide how to move forward >>> with logging (although it looks like you are already some steps ahead). >>> >>> I have no strong ties to existing logging format and if we do it in rust or >>> not I don't care, but I do not want to use existing functionality for the >>> sake af a new unified approach. >>> >>> First we need to define what is the purpose and usage of our logs, what do >>> we want to keep or extend. I see these main area: >>> >>> - statistics >>> >>> - auditing (authentications, modifications) >>> >>> . debugging (for my usage the most important aspect) >>> >>> Next: what are the real problems >>> >>> - performance: yes, it would be nice to have less logging impact on >>> performance, but I haven't seen real complaints and doing it differently >>> does not guarantee better performance, we have debug levels (like tracing) >>> which are not usable because they slow down everything, I do not think this >>> will be resolved by just replacing the library >>> >>> - information: we continue to add information to the logs, and there are >>> still open requests >>> >>> >>> So: If I look at your suggestions I do like some and have concerns with >>> others, and I am not sure if the priorities are right. >>> >>> You list a couple of tickets related to logging, but forgot others, eg: >>> >>> https://pagure.io/389-ds-base/issue/47822 - improve logging of ACL decisions >>> >>> https://pagure.io/389-ds-base/issue/48680 - enable logging for 3rd party >>> libs >>> >>> https://pagure.io/389-ds-base/issue/50002 - improve password policy logging >>> >>> https://pagure.io/389-ds-base/issue/50187 - improve replication logging >> These are all things that you you are correct, that I missed. >> >> So I want to focus on your point about what we want from logs, IE stats, >> auditing, debugging. I think this is a great summary of what we want, but to >> make it more detailed: >> >> * Stats about operations from start to finish >> * Auditing of client operations >> * Debugging of client operations >> * Debugging of internal server machinery and state >> * Reporting of errors outside of the normal operation system. >> >> Today we current have: >> >> access - auditing of operations at a highlevel >> audit - auditing of modifications and writes >> error - debugging of internal server machinery (we tend not to enable the >> levels of debugging operations ...) >> >> So I think there is room to improve. >> >> Now I think that performance so far is only a barrier in terms of preventing >> us from adding more detail, because it slows down operations. This is why >> it's pretty high on the list of "to fix". >> >>> >>> these requests for improvements of logging exist for quite some time, they >>> were accepted as useful and good to have, but we didn't have the capacity >>> to work on them. The work is tedious, going thru ACL or replication code >>> and decide what and how to log better requires good knowledge of the areas >>> and is not fancy - but I wonder if the effort wouldn't be spent better to >>> improve the "what" and not the "how". >>> >>> >>> Now: some comments on your suggestions. >>> >>> - perf: I think everything we can do to reduce locking in logging is a good >>> thing, if we do logging more asynchronous it is ok for me as long as we can >>> see the "real time" of the event and get the logs ordered. If replacing the >>> "log" component by a rust implemention helps it is good. We could in a >>> first step keep the slapi_log_* api and only do it more efficient >> Yes, this was my original thought was to write the rust impl in parallel and >> then have the existing slapi_log_* just translate over to it. >> >>> - merging of logs: I do not like the idea to have everything in one log >>> file, if you say we can provide tools to split, we can also provide tools >>> to merge. The logs have a different format and for good reasons, eg the >>> audit log is ldif like and can easily be used to replay changes, why force >>> it into another format. >>> >>> Also, in my opinion, the different logs have a different life time. You >>> might want to keep all audit logs, keep the error log for the main events >>> for some longer time than the fast growing access log, defining log >>> rotation based on the log type is a benfit in my opinion >>> >>> - operation orientation: there are logging events which are independent of >>> operations, especially in replicatio. I would not want to artificially >>> enforce a "operation context" eg for changelog processing, tombstone >>> purging,..... >>> >>> - format: for machines it might be good that logs are machine readable, but >>> I did spend a lot of times reading logs and lokkinng for patterns and >>> clues, I think as soon as they are in json I will stop doing this. >> These are good counter points, and I think that on some reflection, I agree >> with them. Perhaps there is a better way to think about this. >> >> We could implement this, such that from the "server" side, all log messages >> are sent to the log thread, and then the log thread can take that log event >> and then split out multiple logs that are relevant to the focus areas. >> >> The server would still build up a single "log-unit" structure, and then the >> log thread would turn this into a set of relevant access, audit, error, and >> operation/profilling/debugging messages. That way the server can focus on >> operations, and can log as-it-pleases, then the log thread can actually turn >> that data into useful contextualised information. >> >> For example, an operation could build a structure (hypothetical, has no >> bearing on reality ...) such as: >> >> [ >> { timestamp, operation start }, >> { timestamp, modify }, >> { timestamp, bind dn }, >> { timestamp, ACL decision is .... }, >> { timestamp, nentries = 5 }, >> { timestamp, dn=x modify cn }, >> { timestamp, dn=y modify uid }, >> { timestamp, plugin memberOf err=??? }, >> ] >> >> >> This would be sent as a block to the log thread, and we could then process >> it out such that: >> >> [ >> { timestamp, operation start }, -> access >> { timestamp, modify }, -> access >> { timestamp, bind dn }, -> access >> { timestamp, ACL decision is .... }, >> { timestamp, nentries = 5 }, -> access >> { timestamp, dn=x modify cn }, -> audit >> { timestamp, dn=y modify uid }, -> audit >> { timestamp, plugin memberOf err=??? }, -> error >> ] >> >> Finally, the whole "structured" item, in all it's glorious, gory detail, >> could go to the operation log, along with timings, and we can then profile >> and get performance data from that. >> >> This way we still have the current logging expectations, but can then add >> others and retain high amounts of detail for debugging as we require. >> >> Does this sound like a reasonable approach? > yes >> I think this would be a "final goal", so to formalise the stages: >> >> * Add build tooling and a simple (dummy) log thread as a "getting started". >> Supplement with documentation on wiki. >> * Fill-in the log thread to support an "operation log", and add basic >> operation log hooks in the server. >> * Fill in more operation log points in the server to build detail >> * change slapi_log_err to log to the new rust thread, continue to generate >> error file >> * change slapi_log_audit to log to the new rust thread, continue to generate >> audit file >> * change slapi_log_access to log to the new rust thread, continue to >> generate access file. >> * remove former logging code > I wonder if we really could have one api eg slapi_log_* and different > implementations, keep the current, implement a new one and allow to chose
I don't think I understand this comment, could you explain a bit more what you have in mind? >> >>> >>> This may sound too negative, I agree that logging deserves attention and >>> improvement, but we need to agree on what we want to achieve. >>> >>> >>> Best regards, >>> >>> Ludwig >>> >>> >>> On 05/10/2019 05:13 AM, William Brown wrote: >>>> Hi all, >>>> >>>> So I think it's time for me to write some logging code to improve the >>>> situation. Relevant links before we start: >>>> >>>> https://pagure.io/389-ds-base/issue/49415 >>>> http://www.port389.org/docs/389ds/design/logging-performance-improvement.html >>>> https://pagure.io/389-ds-base/issue/50350 >>>> https://pagure.io/389-ds-base/issue/50361 >>>> >>>> >>>> All of these links touch on issues around logging, and I think they all >>>> combine to create three important points: >>>> >>>> * The performance of logging should be improved >>>> * The amount of details (fine grain) and information in logs should improve >>>> * The structure of the log content should be improved to aid interaction >>>> (possibly even machine parsable) >>>> >>>> I will turn this into a design document, but there are some questions I >>>> would like some input to help answer as part of this process to help set >>>> the direction and tasks to achieve. >>>> >>>> -- Should our logs as they exist today, continue to exist? >>>> >>>> I think that my view on this is "no". I think if we make something better, >>>> we have little need to continue to support our legacy interfaces. Of >>>> course, this would be a large change and it may not sit comfortably with >>>> people. >>>> >>>> A large part of this thinking is that the "new" log interface I want to >>>> add is focused on *operations* rather than auditing accesses or changes, >>>> or over looking at errors. The information of both the current >>>> access/audit/error would largely be melded into a single operation log, >>>> and then with tools like logconv, we >>>> could parse and extract information that would behave the same way as >>>> access/error/audit. >>>> >>>> At the same time - I can see how people *may* want a "realtime" audit of >>>> operations as they occur (IE access log), but this still today is already >>>> limited by having to "wait" for operations to complete. >>>> >>>> In a crash scenario, we would be able to still view the logs that are >>>> queued, so I think there are not so many concerns about losing information >>>> in these cases (in fact we'd probably have more). >>>> >>>> -- What should the operation log look like? >>>> >>>> I think it should be structured, and should be whole-units of information, >>>> related to a single operation. IE only at the conclusion of the operation >>>> is it logged (thus the async!). It should support arbitrary, nested >>>> timers, and would *not* support log levels - it's a detailed log of the >>>> process each query goes through. >>>> >>>> An example could be something like: >>>> >>>> [timestamp] - [conn=id op=id] - start operation >>>> [timestamp] - [conn=id op=id] - start time = time ... >>>> [timestamp] - [conn=id op=id] - started internal search '(some=filter)' >>>> [timestamp] - [conn=id op=id parentop=id] - start nested operation >>>> [timestamp] - [conn=id op=id parentop=id] - start time = time ... >>>> ... >>>> [timestamp] - [conn=id op=id parentop=id] - end time = time... >>>> [timestamp] - [conn=id op=id parentop=id] - duration = diff end - start >>>> [timestamp] - [conn=id op=id parentop=id] - end nested operation - result >>>> -> ... >>>> [timestamp] - [conn=id op=id] - ended internal search '(some=filter)' >>>> ... >>>> [timestamp] - [conn=id op=id] - end time = time >>>> [timestamp] - [conn=id op=id] - duration = diff end - start >>>> >>>> >>>> Due to the structured - blocked nature, there would be no interleaving of >>>> operation messages. therefor the log would appear as: >>>> >>>> [timestamp] - [conn=00 op=00] - start operation >>>> [timestamp] - [conn=00 op=00] - start time = time ... >>>> [timestamp] - [conn=00 op=00] - started internal search '(some=filter)' >>>> [timestamp] - [conn=00 op=00 parentop=01] - start nested operation >>>> [timestamp] - [conn=00 op=00 parentop=01] - start time = time ... >>>> ... >>>> [timestamp] - [conn=00 op=00 parentop=01] - end time = time... >>>> [timestamp] - [conn=00 op=00 parentop=01] - duration = diff end - start >>>> [timestamp] - [conn=00 op=00 parentop=01] - end nested operation - result >>>> -> ... >>>> [timestamp] - [conn=00 op=00] - ended internal search '(some=filter)' >>>> ... >>>> [timestamp] - [conn=00 op=00] - end time = time >>>> [timestamp] - [conn=00 op=00] - duration = diff end - start >>>> [timestamp] - [conn=22 op=00] - start operation >>>> [timestamp] - [conn=22 op=00] - start time = time ... >>>> [timestamp] - [conn=22 op=00] - started internal search '(some=filter)' >>>> [timestamp] - [conn=22 op=00 parentop=01] - start nested operation >>>> [timestamp] - [conn=22 op=00 parentop=01] - start time = time ... >>>> ... >>>> [timestamp] - [conn=22 op=00 parentop=01] - end time = time... >>>> [timestamp] - [conn=22 op=00 parentop=01] - duration = diff end - start >>>> [timestamp] - [conn=22 op=00 parentop=01] - end nested operation - result >>>> -> ... >>>> [timestamp] - [conn=22 op=00] - ended internal search '(some=filter)' >>>> ... >>>> [timestamp] - [conn=22 op=00] - end time = time >>>> [timestamp] - [conn=22 op=00] - duration = diff end - start >>>> >>>> An alternate method for structuring could be a machine readable format >>>> like json: >>>> >>>> { >>>> 'timestamp': 'time', >>>> 'duration': ...., >>>> 'bind': 'dn of who initiated operation', >>>> 'events': [ >>>> 'debug': 'msg', >>>> 'internal_search': { >>>> 'timestamp': 'time', >>>> 'duration': ...., >>>> } >>>> ], >>>> } >>>> >>>> This would make writing tools like logconv much easier, as we wouldn't >>>> need regex or others to parse and understand these logs. >>>> >>>> >>>> -- Should our log interface from slapi_log_err etc change? >>>> >>>> This is a super tricky question ... I think the answer is yes - we are >>>> using log_err for messages that should be associated to an operation - and >>>> additionally, we often disable or lower the level of error messages due to >>>> the low performance. It would be hard to shift behaviour and expectations >>>> around the behaviour of this log, so a better idea would be to fix things >>>> oourselves by adjusting how we log in plugins. We'll probably always need >>>> to support log_err, but then those developers would not gain the benefit >>>> of our improvements. >>>> >>>> We could translate log_err to emit simple text messages into the events >>>> section of the log, but for full benefit, using the operation log >>>> interfaces would be best. >>>> >>>> -- Should we still improve the performance of logs as they exist today? >>>> >>>> Probably yes - but I think there is a good reason to approach this >>>> differently. >>>> >>>> I think we should be log_async.c and log_operation.c in parallel to >>>> existing logging, and have a similer feature gating flag in use like we >>>> have done for other features. We can then build this in parallel and gain >>>> confidence and experience with it. Once complete, we could then change the >>>> log_access, log_audit and log_err macros/functions to call related log_op >>>> calls, or rewrite those logs to have log_async.c execution backends. >>>> >>>> >>>> >>>> SUMMARY: >>>> >>>> * Make slapi_log_op and an async log system in parallel to existing log >>>> infra >>>> * Determine a deprecation or improvement plan for other log types >>>> * Port log types to async, or remove them. >>>> >>>> >>>> >>>> -- >>>> Sincerely, >>>> >>>> William >>>> _______________________________________________ >>>> 389-devel mailing list -- [email protected] >>>> To unsubscribe send an email to [email protected] >>>> Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html >>>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines >>>> List Archives: >>>> https://lists.fedoraproject.org/archives/list/[email protected] >>> _______________________________________________ >>> 389-devel mailing list -- [email protected] >>> To unsubscribe send an email to [email protected] >>> Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html >>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines >>> List Archives: >>> https://lists.fedoraproject.org/archives/list/[email protected] >> — >> Sincerely, >> >> William Brown >> >> Senior Software Engineer, 389 Directory Server >> SUSE Labs >> _______________________________________________ >> 389-devel mailing list -- [email protected] >> To unsubscribe send an email to [email protected] >> Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html >> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines >> List Archives: >> https://lists.fedoraproject.org/archives/list/[email protected] > > -- > Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn, > Commercial register: Amtsgericht Muenchen, HRB 153243, > Managing Directors: Charles Cachera, Michael Cunningham, Michael O'Neill, > Eric Shander > _______________________________________________ > 389-devel mailing list -- [email protected] > To unsubscribe send an email to [email protected] > Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html > List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines > List Archives: > https://lists.fedoraproject.org/archives/list/[email protected] — Sincerely, William Brown Senior Software Engineer, 389 Directory Server SUSE Labs _______________________________________________ 389-devel mailing list -- [email protected] To unsubscribe send an email to [email protected] Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/[email protected]
