[graylog2] Re: RegEx Trouble

Jarred Masterson Fri, 09 May 2014 08:41:14 -0700

Success!  I have at last sorted this out.  As it turns out the issue was 
non obvious due to the way that the HTML spec displays multiple whitespace 
characters.

I ran upon the answer when I decided to give another run at writing a 
working regex for these messages.  I brought up the extractor creation page 
and pasted in the ((?<=\d\.\d\s)\d+(?=\.)) regex as I knew it was very 
close to working and had confirmed that it did in fact work on other 
testing platforms.  To my shock, the "Try!" immediately produced the 
correct result!  I was overjoyed, but a little perplexed since I had tried 
that exact regex before.  I filled out the remaining fields and saved the 
extractor.  As messages began processing I immediately noticed that only 
about 10% of the messages that should have been matched were producing 
results.  To add to this confusion in some instances the extractor was 
pulling a number from further into the string, passing up the set that I 
wanted.

I began to analyze what the sucessfully extracted messages had in common 
that the failed ones lacked.  I noticed that every extraction that 
succeeded had three or more digits in the resulting field even though when 
looking at the displayed message there was no discernible difference in the 
preceding characters that were being matched against in the lookbehind.  I 
began pondering the structure of the messages themselves and decided to 
look at the output of the command that produced them directly.  The 
messages actually come from a FreeBSD system cron job that runs "iostat -x" 
and then pipes the output to logger.  When I ran the command directly on 
the FreeBSD console I realized where the problem lay.  The command output 
is formatted for human readability and the component statistics are 
displayed so that the decimal places always align on the screen. 
 Basically, if there are fewer than three digits, there are multiple spaces 
placed in front of the number.

I altered my regex like so:  ((?<=\d\.\d\s{1,3})\d+(?=\.))   Adding the 
{1,3} in the look behind to account for possible multiple whitespaces. 
 (Thankfully that doesn't raise the lookbehind arbitrary length exception 
from Java, that only applies to open ended lengths such as '+')  Like magic 
the extractor is now working beautifully.  As I look back on this, I have 
noticed that if I do a CSV extraction of the data the whitespace characters 
are preserved, and if I view the page source on a graylog2 search html 
page, all of the white space characters are there.  So the confusion here 
was caused by the fact that when HTML is interpreted by the browser the 
consecutive spaces are collapsed into one unless they are explicitly coded 
not to. With something like a nonbreaking space i.e. &nbsp;

lennart,  I don't know if it is worth it or not, I'm aware that this may be 
an edge case, but it may be helpful to some to display some sort of a UI 
indication on the extractor creation page when a message contains 
consecutive whitespace characters.

On Wednesday, May 7, 2014 10:03:51 AM UTC-6, Jarred Masterson wrote:
>
> To confess upfront, I am a noob with RegEx but I've made some decent 
> progress in the past few days.  I have a couple of extractors working well 
> but I'm running into an issue with one that seems like it should work.
>
> First here is an example line that I am matching against:
> root: da2 75.6 49.7 4743.9 3183.8 6 1.3 6
>
> This is output from FreeBSD iostat -x and I have working extractors for 
> the device name and the first numbered field which is read operations.  I'm 
> on 0.20.1 and I had to pull the digits prior to the decimal place due to 
> the number converter not dealing with floating point numbers. I see from 
> the github commits that this has been fixed in 20.2!
>
> I am trying now to pull the second metric which is the write operations 
> per second and in this case is 75.6.
>
> It seems like this should work:
> (?<=\d+\b)\d+(?=\.)
>
> I've also tried to move the \b around such as (?<=\d+)\b\d+(?=\.)  I am 
> also a little confused as to if I do or do not need to enclose the whole 
> thing in parenthesis. My working extractors are enclosed in () but I get 
> errors when trying that with the above example.
>

-- 
You received this message because you are subscribed to the Google Groups 
"graylog2" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[graylog2] Re: RegEx Trouble

Reply via email to