Hey Jarred, thanks for you investigation and good that it works for you now! You are totally right and I opened an issue for that: https://github.com/Graylog2/graylog2-web-interface/issues/778
Thanks again, Lennart On Fri, May 9, 2014 at 5:41 PM, Jarred Masterson <[email protected]> wrote: > Success! I have at last sorted this out. As it turns out the issue was non > obvious due to the way that the HTML spec displays multiple whitespace > characters. > > I ran upon the answer when I decided to give another run at writing a > working regex for these messages. I brought up the extractor creation page > and pasted in the ((?<=\d\.\d\s)\d+(?=\.)) regex as I knew it was very close > to working and had confirmed that it did in fact work on other testing > platforms. To my shock, the "Try!" immediately produced the correct result! > I was overjoyed, but a little perplexed since I had tried that exact regex > before. I filled out the remaining fields and saved the extractor. As > messages began processing I immediately noticed that only about 10% of the > messages that should have been matched were producing results. To add to > this confusion in some instances the extractor was pulling a number from > further into the string, passing up the set that I wanted. > > I began to analyze what the sucessfully extracted messages had in common > that the failed ones lacked. I noticed that every extraction that succeeded > had three or more digits in the resulting field even though when looking at > the displayed message there was no discernible difference in the preceding > characters that were being matched against in the lookbehind. I began > pondering the structure of the messages themselves and decided to look at > the output of the command that produced them directly. The messages > actually come from a FreeBSD system cron job that runs "iostat -x" and then > pipes the output to logger. When I ran the command directly on the FreeBSD > console I realized where the problem lay. The command output is formatted > for human readability and the component statistics are displayed so that the > decimal places always align on the screen. Basically, if there are fewer > than three digits, there are multiple spaces placed in front of the number. > > I altered my regex like so: ((?<=\d\.\d\s{1,3})\d+(?=\.)) Adding the > {1,3} in the look behind to account for possible multiple whitespaces. > (Thankfully that doesn't raise the lookbehind arbitrary length exception > from Java, that only applies to open ended lengths such as '+') Like magic > the extractor is now working beautifully. As I look back on this, I have > noticed that if I do a CSV extraction of the data the whitespace characters > are preserved, and if I view the page source on a graylog2 search html page, > all of the white space characters are there. So the confusion here was > caused by the fact that when HTML is interpreted by the browser the > consecutive spaces are collapsed into one unless they are explicitly coded > not to. With something like a nonbreaking space i.e. > > lennart, I don't know if it is worth it or not, I'm aware that this may be > an edge case, but it may be helpful to some to display some sort of a UI > indication on the extractor creation page when a message contains > consecutive whitespace characters. > > > > On Wednesday, May 7, 2014 10:03:51 AM UTC-6, Jarred Masterson wrote: >> >> To confess upfront, I am a noob with RegEx but I've made some decent >> progress in the past few days. I have a couple of extractors working well >> but I'm running into an issue with one that seems like it should work. >> >> First here is an example line that I am matching against: >> root: da2 75.6 49.7 4743.9 3183.8 6 1.3 6 >> >> This is output from FreeBSD iostat -x and I have working extractors for >> the device name and the first numbered field which is read operations. I'm >> on 0.20.1 and I had to pull the digits prior to the decimal place due to the >> number converter not dealing with floating point numbers. I see from the >> github commits that this has been fixed in 20.2! >> >> I am trying now to pull the second metric which is the write operations >> per second and in this case is 75.6. >> >> It seems like this should work: >> (?<=\d+\b)\d+(?=\.) >> >> I've also tried to move the \b around such as (?<=\d+)\b\d+(?=\.) I am >> also a little confused as to if I do or do not need to enclose the whole >> thing in parenthesis. My working extractors are enclosed in () but I get >> errors when trying that with the above example. > > -- > You received this message because you are subscribed to the Google Groups > "graylog2" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "graylog2" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
