Success! I have at last sorted this out. As it turns out the issue was
non obvious due to the way that the HTML spec displays multiple whitespace
characters.
I ran upon the answer when I decided to give another run at writing a
working regex for these messages. I brought up the extractor creation page
and pasted in the ((?<=\d\.\d\s)\d+(?=\.)) regex as I knew it was very
close to working and had confirmed that it did in fact work on other
testing platforms. To my shock, the "Try!" immediately produced the
correct result! I was overjoyed, but a little perplexed since I had tried
that exact regex before. I filled out the remaining fields and saved the
extractor. As messages began processing I immediately noticed that only
about 10% of the messages that should have been matched were producing
results. To add to this confusion in some instances the extractor was
pulling a number from further into the string, passing up the set that I
wanted.
I began to analyze what the sucessfully extracted messages had in common
that the failed ones lacked. I noticed that every extraction that
succeeded had three or more digits in the resulting field even though when
looking at the displayed message there was no discernible difference in the
preceding characters that were being matched against in the lookbehind. I
began pondering the structure of the messages themselves and decided to
look at the output of the command that produced them directly. The
messages actually come from a FreeBSD system cron job that runs "iostat -x"
and then pipes the output to logger. When I ran the command directly on
the FreeBSD console I realized where the problem lay. The command output
is formatted for human readability and the component statistics are
displayed so that the decimal places always align on the screen.
Basically, if there are fewer than three digits, there are multiple spaces
placed in front of the number.
I altered my regex like so: ((?<=\d\.\d\s{1,3})\d+(?=\.)) Adding the
{1,3} in the look behind to account for possible multiple whitespaces.
(Thankfully that doesn't raise the lookbehind arbitrary length exception
from Java, that only applies to open ended lengths such as '+') Like magic
the extractor is now working beautifully. As I look back on this, I have
noticed that if I do a CSV extraction of the data the whitespace characters
are preserved, and if I view the page source on a graylog2 search html
page, all of the white space characters are there. So the confusion here
was caused by the fact that when HTML is interpreted by the browser the
consecutive spaces are collapsed into one unless they are explicitly coded
not to. With something like a nonbreaking space i.e.
lennart, I don't know if it is worth it or not, I'm aware that this may be
an edge case, but it may be helpful to some to display some sort of a UI
indication on the extractor creation page when a message contains
consecutive whitespace characters.
On Wednesday, May 7, 2014 10:03:51 AM UTC-6, Jarred Masterson wrote:
>
> To confess upfront, I am a noob with RegEx but I've made some decent
> progress in the past few days. I have a couple of extractors working well
> but I'm running into an issue with one that seems like it should work.
>
> First here is an example line that I am matching against:
> root: da2 75.6 49.7 4743.9 3183.8 6 1.3 6
>
> This is output from FreeBSD iostat -x and I have working extractors for
> the device name and the first numbered field which is read operations. I'm
> on 0.20.1 and I had to pull the digits prior to the decimal place due to
> the number converter not dealing with floating point numbers. I see from
> the github commits that this has been fixed in 20.2!
>
> I am trying now to pull the second metric which is the write operations
> per second and in this case is 75.6.
>
> It seems like this should work:
> (?<=\d+\b)\d+(?=\.)
>
> I've also tried to move the \b around such as (?<=\d+)\b\d+(?=\.) I am
> also a little confused as to if I do or do not need to enclose the whole
> thing in parenthesis. My working extractors are enclosed in () but I get
> errors when trying that with the above example.
>
--
You received this message because you are subscribed to the Google Groups
"graylog2" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.