This sounds similar to jsmn, a json tokenizer I've been playing with
this week ( http://zserge.bitbucket.org/jsmn.html ). It gives you a
struct per token, and each struct contains the type of the token
(string, primitive, object or array) and offsets to the start and end
of the token in the original char array.  It's turned out to be very
fast in the tests I've been doing.

Brian

On Wed, Oct 31, 2012 at 5:55 PM,  <[email protected]> wrote:
> On Wed, 31 Oct 2012, Rainer Gerhards wrote:
>
>> Hi all,
>>
>> There is the dangling issue that rsyslog has grown out of its current
>> queue subsystem. I am currently considering a refactoring or a complete
>> redesign. I initially wanted to write a large blog post with all details and
>> ideas, but have now opted to split this in a couple of parts - both because
>> I have problems to find time to do the "big one" at once; and also it
>> probably is smarter to get feedback asap.
>>
>> So here is the initial part:
>>
>> http://blog.gerhards.net/2012/10/rsyslog-disk-queues-refactor-or.html
>>
>> This will get anyone interested in the queue subsystem a broad
>> understanding of how it works - and why. Please share any concerns you have
>> about the current system as well as wishes/suggestions on what should
>> improve. Deeply technical information is fine, actually appreciated.
>>
>> I intend to let the discussion run and write the other parts of the blog
>> series when "events warrant it" ;) Due to other projects, I can probably not
>> discuss 10 hours a day, but will try to be as active as possible (which
>> hopefully means "much"). The intent is to come up with a solution that will
>> be good for the next five years to come...
>
>
> Thinking a bit more about the disk format.
>
> We have two competing requirements
>
> 1. making it as fast as possible for rsyslog to read and write the data
>
> 2. making it human readable so that it can be salvaged by a person if
> something goes horribly wrong.
>
> for the former, binary data structures are desirable
>
> for the latter, you want everything in text
>
> For rsyslog, this is greatly simplified by the fact that everything we are
> processing is text, and does not have any embedded newlines.
>
>
> One approach to consider is to not store anything in the file that can be
> re-calculated (i.e. store the rawmessage, a little extra metadata and then
> run it through the parsing stack when you dequeue the message)
>
> This costs a significant amount of CPU, and runs the risk that the parsing
> may not end up being the same (processing a queue file after a restart)
>
> In addition, with version 7 and the ability to set variables and fields in
> structures, the data in a queue file tied to a specific action may have been
> manipulated significantly since the message arrived.
>
> So I don't think this is the way we want to go.
>
>
> Another approach is to define everything as text fields (i.e. name=value\n)
> and then parse it when you read it in.
>
> This is also pretty expensive in CPU.
>
>
> One trick that we can pull to greatly speed up the processing is to play
> pointer games.
>
> If you take a line of text, you can very cheaply walk through it and record
> a pointer to the beginning of each word, replacing spaces will null
> characters. This is FAR faster than copying the data to new memory locations
> and then lets you treat the resulting strings as standard C strings.
>
> I would suggest a variation of this.
>
> store everything as name=value<null> (doing a 'strings file' will return
> name=value\n)
>
> add a header to each message, something along the lines of:
>
> RSYSLOG_HEADER Size=###### <base64 encoded data><null>
>
> where size is the size of the base64 encoded stuff
>
> The binary data encoded in the base64 blob would be along the lines of:
> <offset to rawmessage><offset to timestamp><offset to received time>...
> for the standard properties, followed by
> <offset to name><offset to value> for all the dynamically generated data
>
> where the offset is to the start of the value field in each name=value<nul>
> 'line' for the standard properties.
>
> This would allow rsyslog to _very_ quickly know where everything is, and
> copy the standard properties into the queue record memory structure. for
> dynamic properties it would be a smidge slower as you can't just do strcpy
> of the values from a 'known' location to the location in the message
> structure, you would have to look at the name by the number of bytes
> (<offset to value> - <offset to name> -1 bytes worth of text), and then
> setup the location for it before copying the value into place. But this
> should still be faster than parsing arbitrary text (and if not, just have
> the dynamically generated data fields be parsed when they are read, so far
> they aren't that common, so this cost won't dominate)
>
>
> For data recovery purposes (where a person needs to manually tweak the file
> to recover from problems), do
>
> strings queuefile |sed s/"RSYSLOG_HEADER Size=.*$"/"RSYSLOG_HEADER Size=0"/
>
> to clear the header, and rsyslog can have a fallback mode (or external
> repair tool) that does the slow parsing of everything (if it's an external
> tool, it can create a new header line for each record)
>
> David Lang
>
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T
> LIKE THAT.
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to