Hey Jarred,

thanks for you investigation and good that it works for you now! You
are totally right and I opened an issue for that:
https://github.com/Graylog2/graylog2-web-interface/issues/778

Thanks again,
Lennart

On Fri, May 9, 2014 at 5:41 PM, Jarred Masterson
<[email protected]> wrote:
> Success!  I have at last sorted this out.  As it turns out the issue was non
> obvious due to the way that the HTML spec displays multiple whitespace
> characters.
>
> I ran upon the answer when I decided to give another run at writing a
> working regex for these messages.  I brought up the extractor creation page
> and pasted in the ((?<=\d\.\d\s)\d+(?=\.)) regex as I knew it was very close
> to working and had confirmed that it did in fact work on other testing
> platforms.  To my shock, the "Try!" immediately produced the correct result!
> I was overjoyed, but a little perplexed since I had tried that exact regex
> before.  I filled out the remaining fields and saved the extractor.  As
> messages began processing I immediately noticed that only about 10% of the
> messages that should have been matched were producing results.  To add to
> this confusion in some instances the extractor was pulling a number from
> further into the string, passing up the set that I wanted.
>
> I began to analyze what the sucessfully extracted messages had in common
> that the failed ones lacked.  I noticed that every extraction that succeeded
> had three or more digits in the resulting field even though when looking at
> the displayed message there was no discernible difference in the preceding
> characters that were being matched against in the lookbehind.  I began
> pondering the structure of the messages themselves and decided to look at
> the output of the command that produced them directly.  The messages
> actually come from a FreeBSD system cron job that runs "iostat -x" and then
> pipes the output to logger.  When I ran the command directly on the FreeBSD
> console I realized where the problem lay.  The command output is formatted
> for human readability and the component statistics are displayed so that the
> decimal places always align on the screen.  Basically, if there are fewer
> than three digits, there are multiple spaces placed in front of the number.
>
> I altered my regex like so:  ((?<=\d\.\d\s{1,3})\d+(?=\.))   Adding the
> {1,3} in the look behind to account for possible multiple whitespaces.
> (Thankfully that doesn't raise the lookbehind arbitrary length exception
> from Java, that only applies to open ended lengths such as '+')  Like magic
> the extractor is now working beautifully.  As I look back on this, I have
> noticed that if I do a CSV extraction of the data the whitespace characters
> are preserved, and if I view the page source on a graylog2 search html page,
> all of the white space characters are there.  So the confusion here was
> caused by the fact that when HTML is interpreted by the browser the
> consecutive spaces are collapsed into one unless they are explicitly coded
> not to. With something like a nonbreaking space i.e. &nbsp;
>
> lennart,  I don't know if it is worth it or not, I'm aware that this may be
> an edge case, but it may be helpful to some to display some sort of a UI
> indication on the extractor creation page when a message contains
> consecutive whitespace characters.
>
>
>
> On Wednesday, May 7, 2014 10:03:51 AM UTC-6, Jarred Masterson wrote:
>>
>> To confess upfront, I am a noob with RegEx but I've made some decent
>> progress in the past few days.  I have a couple of extractors working well
>> but I'm running into an issue with one that seems like it should work.
>>
>> First here is an example line that I am matching against:
>> root: da2 75.6 49.7 4743.9 3183.8 6 1.3 6
>>
>> This is output from FreeBSD iostat -x and I have working extractors for
>> the device name and the first numbered field which is read operations.  I'm
>> on 0.20.1 and I had to pull the digits prior to the decimal place due to the
>> number converter not dealing with floating point numbers. I see from the
>> github commits that this has been fixed in 20.2!
>>
>> I am trying now to pull the second metric which is the write operations
>> per second and in this case is 75.6.
>>
>> It seems like this should work:
>> (?<=\d+\b)\d+(?=\.)
>>
>> I've also tried to move the \b around such as (?<=\d+)\b\d+(?=\.)  I am
>> also a little confused as to if I do or do not need to enclose the whole
>> thing in parenthesis. My working extractors are enclosed in () but I get
>> errors when trying that with the above example.
>
> --
> You received this message because you are subscribed to the Google Groups
> "graylog2" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"graylog2" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to