[389-devel] Re: Logging future direction and ideas.

William Brown Thu, 16 May 2019 15:44:34 -0700


> On 17 May 2019, at 00:16, Ludwig Krispenz <[email protected]> wrote:
> 
> 
> On 05/16/2019 02:56 AM, William Brown wrote:
>> Replying to both Simon and Ludwig in the one mail ...
>> 
>> To just comment to one point from Ludwig:
>> 
>>> This may sound too negative, I agree that logging deserves attention and 
>>> improvement, but we need to agree on what we want to achieve.
>> I don't think this is negative, at all and I think it's exactly the kind of 
>> feedback I wanted. :) So thank you!
>> 
>> 
>>> Hi William,
>>> I am more than interested!
>>> 
>>> I'd like to learn it one day anyway.
>>> So if there will be no objections and we'll go forward now,
>>> I think, it is important to agree on some points of decreasing a bus factor:
>>> 
>>> - As you said, it should have very detailed comments on all of the
>>>  language features and technics you'll use.
>>> - I think it will be nice to have an additional documentation which will
>>>  describe the basic setup for the development you use. All the toold you
>>>  need to develop, compile, test and debug.
>>> - Also, some nice links for the basic tutorials on Rust types, concepts, 
>>> etc.
>>> - I think, we should have detailed unit tests. It will help to
>>>  understand the code better and we will have less bugs (hopefully).
>> So I'm writing a document for the wiki to cover this *and* rust supports 
>> tests as part of the system - that will be all be included. Is that 
>> acceptable?
>> 
>>> And the final and big point I wanted to mention:
>>> 
>>> - We should be prepared for a slow review process because we have quite 
>>> limited
>>>  recources in the team and a lot of work should be done (WebUI still has to 
>>> be refactored to React,
>>>  and it is only a small piece of the workload).
>>>  Also, I think, it makes sense to have the smallest Rust PRs that can be 
>>> put together as an independent unit.
>> That's totally reasonable. I'll try to keep it to small parts and will build 
>> up as we go. See below for ideas on how I'll work toward this.
>> 
>>> But everything is just my opinion and I don't know what others think and
>>> if everyone is prepared to join it. I am feeling excited though :)
>>> 
>>> Thanks,
>>> Simon
>>> 
>>> P.S. check the contribution guide please. Espesially a part about
>>> 'rebase-force-push'. I think it is nice not to force push during
>>> the review process (and rebase-squash only after you got an ACK).
>> Yep, I saw this change. I'll keep it in mind :)
>> 
>> 
>>> On 15 May 2019, at 19:37, Ludwig <[email protected]> wrote:
>>> 
>>> Hi William,
>>> 
>>> this is a good starting point to discuss and decide how to move forward 
>>> with logging (although it looks like you are already some steps ahead).
>>> 
>>> I have no strong ties to existing logging format and if we do it in rust or 
>>> not I don't care, but I do not want to use existing functionality for the 
>>> sake af a new unified approach.
>>> 
>>> First we need to define what is the purpose and usage of our logs, what do 
>>> we want to keep or extend. I see these main area:
>>> 
>>> - statistics
>>> 
>>> - auditing (authentications, modifications)
>>> 
>>> . debugging (for my usage the most important aspect)
>>> 
>>> Next: what are the real problems
>>> 
>>> - performance: yes, it would be nice to have less logging impact on 
>>> performance, but I haven't seen real complaints and doing it differently 
>>> does not guarantee better performance, we have debug levels (like tracing) 
>>> which are not usable because they slow down everything, I do not think this 
>>> will be resolved by just replacing the library
>>> 
>>> - information:  we continue to add information to the logs, and there are 
>>> still open requests
>>> 
>>> 
>>> So: If I look at your suggestions I do like some and have concerns with 
>>> others, and I am not sure if the priorities are right.
>>> 
>>> You list a couple of tickets related to logging, but forgot others, eg:
>>> 
>>> https://pagure.io/389-ds-base/issue/47822 - improve logging of ACL decisions
>>> 
>>> https://pagure.io/389-ds-base/issue/48680 - enable logging for 3rd party 
>>> libs
>>> 
>>> https://pagure.io/389-ds-base/issue/50002 - improve password policy logging
>>> 
>>> https://pagure.io/389-ds-base/issue/50187 - improve replication logging
>> These are all things that you you are correct, that I missed.
>> 
>> So I want to focus on your point about what we want from logs, IE stats, 
>> auditing, debugging. I think this is a great summary of what we want, but to 
>> make it more detailed:
>> 
>> * Stats about operations from start to finish
>> * Auditing of client operations
>> * Debugging of client operations
>> * Debugging of internal server machinery and state
>> * Reporting of errors outside of the normal operation system.
>> 
>> Today we current have:
>> 
>> access - auditing of operations at a highlevel
>> audit - auditing of modifications and writes
>> error - debugging of internal server machinery (we tend not to enable the 
>> levels of debugging operations ...)
>> 
>> So I think there is room to improve.
>> 
>> Now I think that performance so far is only a barrier in terms of preventing 
>> us from adding more detail, because it slows down operations. This is why 
>> it's pretty high on the list of "to fix".
>> 
>>> 
>>> these requests for improvements of logging exist for quite some time, they 
>>> were accepted as useful and good to have, but we didn't have the capacity 
>>> to work on them. The work is tedious, going thru ACL or replication code 
>>> and decide what and how to log better requires good knowledge of the areas 
>>> and is not fancy - but I wonder if the effort wouldn't be spent better to 
>>> improve the "what" and not the "how".
>>> 
>>> 
>>> Now: some comments on your suggestions.
>>> 
>>> - perf: I think everything we can do to reduce locking in logging is a good 
>>> thing, if we do logging more asynchronous it is ok for me as long as we can 
>>> see the "real time" of the event and get the logs ordered. If replacing the 
>>> "log" component by a rust implemention helps it is good. We could in a 
>>> first step keep the slapi_log_* api and only do it more efficient
>> Yes, this was my original thought was to write the rust impl in parallel and 
>> then have the existing slapi_log_* just translate over to it.
>> 
>>> - merging of logs: I do not like the idea to have everything in one log 
>>> file, if you say we can provide tools to split, we can also provide tools 
>>> to merge. The logs have a different format and for good reasons, eg the 
>>> audit log is ldif like and can easily be used to replay changes, why force 
>>> it into another format.
>>> 
>>> Also, in my opinion, the different logs have a different life time. You 
>>> might want to keep all audit logs, keep the error log for the main events 
>>> for some longer time than the fast growing access log, defining log 
>>> rotation based on the log type is a benfit in my opinion
>>> 
>>> - operation orientation: there are logging events which are independent of 
>>> operations, especially in replicatio. I would not want to artificially 
>>> enforce a "operation context" eg for changelog processing, tombstone 
>>> purging,.....
>>> 
>>> - format: for machines it might be good that logs are machine readable, but 
>>> I did spend a lot of times reading logs and lokkinng for patterns and 
>>> clues, I think as soon as they are in json I will stop doing this.
>> These are good counter points, and I think that on some reflection, I agree 
>> with them. Perhaps there is a better way to think about this.
>> 
>> We could implement this, such that from the "server" side, all log messages 
>> are sent to the log thread, and then the log thread can take that log event 
>> and then split out multiple logs that are relevant to the focus areas.
>> 
>> The server would still build up a single "log-unit" structure, and then the 
>> log thread would turn this into a set of relevant access, audit, error, and 
>> operation/profilling/debugging messages. That way the server can focus on 
>> operations, and can log as-it-pleases, then the log thread can actually turn 
>> that data into useful contextualised information.
>> 
>> For example, an operation could build a structure (hypothetical, has no 
>> bearing on reality ...) such as:
>> 
>> [
>>     { timestamp, operation start },
>>     { timestamp, modify },
>>     { timestamp, bind dn },
>>     { timestamp, ACL decision is .... },
>>     { timestamp, nentries = 5 },
>>     { timestamp, dn=x modify cn },
>>     { timestamp, dn=y modify uid },
>>     { timestamp, plugin memberOf err=??? },
>> ]
>> 
>> 
>> This would be sent as a block to the log thread, and we could then process 
>> it out such that:
>> 
>> [
>>     { timestamp, operation start },  -> access
>>     { timestamp, modify }, -> access
>>     { timestamp, bind dn }, -> access
>>     { timestamp, ACL decision is .... },
>>     { timestamp, nentries = 5 }, -> access
>>     { timestamp, dn=x modify cn }, -> audit
>>     { timestamp, dn=y modify uid }, -> audit
>>     { timestamp, plugin memberOf err=??? }, -> error
>> ]
>> 
>> Finally, the whole "structured" item, in all it's glorious, gory detail, 
>> could go to the operation log, along with timings, and we can then profile 
>> and get performance data from that.
>> 
>> This way we still have the current logging expectations, but can then add 
>> others and retain high amounts of detail for debugging as we require.
>> 
>> Does this sound like a reasonable approach?
> yes
>> I think this would be a "final goal", so to formalise the stages:
>> 
>> * Add build tooling and a simple (dummy) log thread as a "getting started". 
>> Supplement with documentation on wiki.
>> * Fill-in the log thread to support an "operation log", and add basic 
>> operation log hooks in the server.
>> * Fill in more operation log points in the server to build detail
>> * change slapi_log_err to log to the new rust thread, continue to generate 
>> error file
>> * change slapi_log_audit to log to the new rust thread, continue to generate 
>> audit file
>> * change slapi_log_access to log to the new rust thread, continue to 
>> generate access file.
>> * remove former logging code
> I wonder if we really could have one api  eg slapi_log_* and different 
> implementations, keep the current, implement a new one and allow to chose


I don't think I understand this comment, could you explain a bit more what you 
have in mind? 


>> 
>>> 
>>> This may sound too negative, I agree that logging deserves attention and 
>>> improvement, but we need to agree on what we want to achieve.
>>> 
>>> 
>>> Best regards,
>>> 
>>> Ludwig
>>> 
>>> 
>>> On 05/10/2019 05:13 AM, William Brown wrote:
>>>> Hi all,
>>>> 
>>>> So I think it's time for me to write some logging code to improve the 
>>>> situation. Relevant links before we start:
>>>> 
>>>> https://pagure.io/389-ds-base/issue/49415
>>>> http://www.port389.org/docs/389ds/design/logging-performance-improvement.html
>>>> https://pagure.io/389-ds-base/issue/50350
>>>> https://pagure.io/389-ds-base/issue/50361
>>>> 
>>>> 
>>>> All of these links touch on issues around logging, and I think they all 
>>>> combine to create three important points:
>>>> 
>>>> * The performance of logging should be improved
>>>> * The amount of details (fine grain) and information in logs should improve
>>>> * The structure of the log content should be improved to aid interaction 
>>>> (possibly even machine parsable)
>>>> 
>>>> I will turn this into a design document, but there are some questions I 
>>>> would like some input to help answer as part of this process to help set 
>>>> the direction and tasks to achieve.
>>>> 
>>>> -- Should our logs as they exist today, continue to exist?
>>>> 
>>>> I think that my view on this is "no". I think if we make something better, 
>>>> we have little need to continue to support our legacy interfaces. Of 
>>>> course, this would be a large change and it may not sit comfortably with 
>>>> people.
>>>> 
>>>> A large part of this thinking is that the "new" log interface I want to 
>>>> add is focused on *operations* rather than auditing accesses or changes, 
>>>> or over looking at errors. The information of both the current 
>>>> access/audit/error would largely be melded into a single operation log, 
>>>> and then with tools like logconv, we
>>>> could parse and extract information that would behave the same way as 
>>>> access/error/audit.
>>>> 
>>>> At the same time - I can see how people *may* want a "realtime" audit of 
>>>> operations as they occur (IE access log), but this still today is already 
>>>> limited by having to "wait" for operations to complete.
>>>> 
>>>> In a crash scenario, we would be able to still view the logs that are 
>>>> queued, so I think there are not so many concerns about losing information 
>>>> in these cases (in fact we'd probably have more).
>>>> 
>>>> -- What should the operation log look like?
>>>> 
>>>> I think it should be structured, and should be whole-units of information, 
>>>> related to a single operation. IE only at the conclusion of the operation 
>>>> is it logged (thus the async!). It should support arbitrary, nested 
>>>> timers, and would *not* support log levels - it's a detailed log of the 
>>>> process each query goes through.
>>>> 
>>>> An example could be something like:
>>>> 
>>>> [timestamp] - [conn=id op=id] - start operation
>>>> [timestamp] - [conn=id op=id] - start time = time ...
>>>> [timestamp] - [conn=id op=id] - started internal search '(some=filter)'
>>>> [timestamp] - [conn=id op=id parentop=id] - start nested operation
>>>> [timestamp] - [conn=id op=id parentop=id] - start time = time ...
>>>> ...
>>>> [timestamp] - [conn=id op=id parentop=id] - end time = time...
>>>> [timestamp] - [conn=id op=id parentop=id] - duration = diff end - start
>>>> [timestamp] - [conn=id op=id parentop=id] - end nested operation - result 
>>>> -> ...
>>>> [timestamp] - [conn=id op=id] - ended internal search '(some=filter)'
>>>> ...
>>>> [timestamp] - [conn=id op=id] - end time = time
>>>> [timestamp] - [conn=id op=id] - duration = diff end - start
>>>> 
>>>> 
>>>> Due to the structured - blocked nature, there would be no interleaving of 
>>>> operation messages. therefor the log would appear as:
>>>> 
>>>> [timestamp] - [conn=00 op=00] - start operation
>>>> [timestamp] - [conn=00 op=00] - start time = time ...
>>>> [timestamp] - [conn=00 op=00] - started internal search '(some=filter)'
>>>> [timestamp] - [conn=00 op=00 parentop=01] - start nested operation
>>>> [timestamp] - [conn=00 op=00 parentop=01] - start time = time ...
>>>> ...
>>>> [timestamp] - [conn=00 op=00 parentop=01] - end time = time...
>>>> [timestamp] - [conn=00 op=00 parentop=01] - duration = diff end - start
>>>> [timestamp] - [conn=00 op=00 parentop=01] - end nested operation - result 
>>>> -> ...
>>>> [timestamp] - [conn=00 op=00] - ended internal search '(some=filter)'
>>>> ...
>>>> [timestamp] - [conn=00 op=00] - end time = time
>>>> [timestamp] - [conn=00 op=00] - duration = diff end - start
>>>> [timestamp] - [conn=22 op=00] - start operation
>>>> [timestamp] - [conn=22 op=00] - start time = time ...
>>>> [timestamp] - [conn=22 op=00] - started internal search '(some=filter)'
>>>> [timestamp] - [conn=22 op=00 parentop=01] - start nested operation
>>>> [timestamp] - [conn=22 op=00 parentop=01] - start time = time ...
>>>> ...
>>>> [timestamp] - [conn=22 op=00 parentop=01] - end time = time...
>>>> [timestamp] - [conn=22 op=00 parentop=01] - duration = diff end - start
>>>> [timestamp] - [conn=22 op=00 parentop=01] - end nested operation - result 
>>>> -> ...
>>>> [timestamp] - [conn=22 op=00] - ended internal search '(some=filter)'
>>>> ...
>>>> [timestamp] - [conn=22 op=00] - end time = time
>>>> [timestamp] - [conn=22 op=00] - duration = diff end - start
>>>> 
>>>> An alternate method for structuring could be a machine readable format 
>>>> like json:
>>>> 
>>>> {
>>>>     'timestamp': 'time',
>>>>     'duration': ....,
>>>>     'bind': 'dn of who initiated operation',
>>>>     'events': [
>>>>         'debug': 'msg',
>>>>         'internal_search': {
>>>>              'timestamp': 'time',
>>>>              'duration': ....,
>>>>         }
>>>>     ],
>>>> }
>>>> 
>>>> This would make writing tools like logconv much easier, as we wouldn't 
>>>> need regex or others to parse and understand these logs.
>>>> 
>>>> 
>>>> -- Should our log interface from slapi_log_err etc change?
>>>> 
>>>> This is a super tricky question ... I think the answer is yes - we are 
>>>> using log_err for messages that should be associated to an operation - and 
>>>> additionally, we often disable or lower the level of error messages due to 
>>>> the low performance. It would be hard to shift behaviour and expectations 
>>>> around the behaviour of this log, so a better idea would be to fix things 
>>>> oourselves by adjusting how we log in plugins. We'll probably always need 
>>>> to support log_err, but then those developers would not gain the benefit 
>>>> of our improvements.
>>>> 
>>>> We could translate log_err to emit simple text messages into the events 
>>>> section of the log, but for full benefit, using the operation log 
>>>> interfaces would be best.
>>>> 
>>>> -- Should we still improve the performance of logs as they exist today?
>>>> 
>>>> Probably yes - but I think there is a good reason to approach this 
>>>> differently.
>>>> 
>>>> I think we should be log_async.c and log_operation.c in parallel to 
>>>> existing logging, and have a similer feature gating flag in use like we 
>>>> have done for other features. We can then build this in parallel and gain 
>>>> confidence and experience with it. Once complete, we could then change the 
>>>> log_access, log_audit and log_err macros/functions to call related log_op 
>>>> calls, or rewrite those logs to have log_async.c execution backends.
>>>> 
>>>> 
>>>> 
>>>> SUMMARY:
>>>> 
>>>> * Make slapi_log_op and an async log system in parallel to existing log 
>>>> infra
>>>> * Determine a deprecation or improvement plan for other log types
>>>> * Port log types to async, or remove them.
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Sincerely,
>>>> 
>>>> William
>>>> _______________________________________________
>>>> 389-devel mailing list -- [email protected]
>>>> To unsubscribe send an email to [email protected]
>>>> Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
>>>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>>>> List Archives: 
>>>> https://lists.fedoraproject.org/archives/list/[email protected]
>>> _______________________________________________
>>> 389-devel mailing list -- [email protected]
>>> To unsubscribe send an email to [email protected]
>>> Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
>>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>>> List Archives: 
>>> https://lists.fedoraproject.org/archives/list/[email protected]
>> —
>> Sincerely,
>> 
>> William Brown
>> 
>> Senior Software Engineer, 389 Directory Server
>> SUSE Labs
>> _______________________________________________
>> 389-devel mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>> Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>> List Archives: 
>> https://lists.fedoraproject.org/archives/list/[email protected]
> 
> -- 
> Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn,
> Commercial register: Amtsgericht Muenchen, HRB 153243,
> Managing Directors: Charles Cachera, Michael Cunningham, Michael O'Neill, 
> Eric Shander
> _______________________________________________
> 389-devel mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: 
> https://lists.fedoraproject.org/archives/list/[email protected]

—
Sincerely,

William Brown

Senior Software Engineer, 389 Directory Server
SUSE Labs
_______________________________________________
389-devel mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/[email protected]

[389-devel] Re: Logging future direction and ideas.

Reply via email to