Hi David.
 
This idea would be quite useful to my needs. I have ordinarily used
graphics to manually interpret the plots for which the regex will do for
me in an automated fashion. I don't have any useful bits of code to
provide at the moment (your first two captured the majority of my uses).

 
I do have a few questions though.
 
In the first example:
 
Instead of <0> being just 0 crossing, could it be extended to mean the
value for which you are searching? Kind of making it like a context
grep.
 
In the slope example, what is the purpose of the S attribute along with
the {1,3}. Is the S an operator such that it is calculating the slope
for the data points in the {range}? Will there be a minimum and maximum
to the number of points in the {range} that can be used for the slope
calculation? Could the range operator also be used to capture a set of
points around the <X> crossing - versus zero crossing.
 
In the second example:
 
What is the purpose of the + in the example for selecting the peak -
($peak, $left_of_peak, $max, $right_of_peak) = $fft_of_data =~ n/ ((
[[email protected],]+) (MM) ([[email protected],]+)) /; 
I don't know standard Bracket Notation, but it seems that the "["
indicates the inclusive side of the < or > part of the range. However,
when I look at the examples I don't see the same nomenclature. Am I
missing it?
 
In the "skew example" I interpreted $peak to be a piddle, $left_of_peak
to be a single value that is 0.1 standard deviations away. But then
later you used $left_of_peak->dim(0) seemingly to capture the first data
item. So the [...@s,] returns a piddle of values from the S value to the
peak - correct? I don't understand the reasoning for the dim(0)*2
portion for the skew calculation either.
 
Thanks,
Cliff Sobchuk esn 361-8169, 403-262-4010 ext: 361-8169
Fax: 403-262-4010 ext: 361-8170
Nortel Core RF Field Support: All information is Nortel confidential.
 

________________________________

From: David Mertens [mailto:[email protected]] 
Sent: August 6, 2009 1:16 PM
To: perldl
Subject: [Perldl] Numerical Regexp: Call for Ideas and Use Cases


Hello everybody -

Back around June 18 I wrote the list with an idea of creating numerical
regular expressions.  It was entitled 'Slicing with Vector Regexp'.
Hugh got back to me with some ideas, which I greatly appreciated.  Since
that time I've boned up on my perl quite a bit and I'm almost in a
position to start working seriously on the idea.  I have many ideas for
how this might work, and I'm sure that some of them are pretty good, but
I feel like the sorts of uses I have for this tool are limited to the
sorts of problems that I work on, mostly 1-d time series.  The best use
I've imagined would be identifying the neighborhoods of peaks in a FFT'd
signal.  (I'd love to get some ideas for useful regex ideas for higher
dimensional data, if you have any.)

Here are some prompts to get you thinking: Have you ever found yourself
trying to identify a slice that is conceptually easy to describe but
programatically difficult or annoying to encode?  Have you ever had to
wade through 200 plots of data in order to find the one or two
interesting ones?  What were you looking for?  Suppose you could have
written a pattern that weeded out all but 20 of those.  What would that
pattern look like?

I would appreciate any feedback, from a plain description of your
concept to a commented sample code of how you would like to use such
regular expressions.  I've included a few examples to show you how I
think and hopefully get you thinking in your problem domain.  Use
whatever syntax or symbols seem to make sense and be sure to explain
what they mean.  Feel free to define a set of symbols to denote
numerical ranges or short-hand symbols for useful numbers (like the mean
of the data set, or the standard deviation).  It you want to get some
more ideas for notation, check out the mail archives from around Jun 18.

My next steps are (1) assembling a number of ideas and use cases, (2)
hammering out something of a specification, (3) writing up tests, and
(4) implementing it.

Thanks.
David


### Notation to use to get things going ###
A numerical regex will be denoted n// and will function similarly to
m//.  Flags I've thought of could include:


*       /g for globally find all matches, identical to m//g (including
progressive matching and list-contex sensitivity) 
*       /n for number of successful matches (which should be optimized
compared to /g) 
*       /i to populate the $1, $2, etc with piddles containing the index
of the matches instead of the matches themselves
        

Feel free to add flags and notation as you see fit.


### Example Idea 1 ###
# periodic_data is a piddle (filled with periodic data, obviously).
# This example is pretty tame, and most of these expressions could
# be encoded with some simple for-loops in C.  The nice thing about
# these is that they're compact.  Also, the quantifiers in the last
match
# would be annoying to encode in a C-loop.

# First I'll count the number of actual zero values.  I use standard
# math notation, where (1,5] means 1 < x <= 5.  (This does not
# conflict with captures because any interval must have a comma but
# captures will never have a bare comma.)  As far as I'm
# concerned, any matches for exactly zero is worth a warning, since
# the chances of any data being EXACTLY zero are vanishingly
# small:

warn("Found (at least) one zero value.  That seems suspicious, but I
will continue...")
  if($periodic_data =~ n/[0,0]/n > 0);

# The data is periodic however, and I expect around 280 zero-
# crossings if it's smooth and the equipment is working.  While
# value ranges are denoted with parentheses/bracket notation,
# angel brackets denote value crossings.  They're sorta like
# zero-width assertions:

$num_of_zero_crossings = $periodic_data =~ n/<0>/n;
warn ("I encountered $num_of_zero_crossings zero-crossings, but I
expected around 280.  Are you sure the data is good?")
    if ($num_of_zero_crossings < 200 or $num_of_zero_crossings > 350);

# I next use a progressive numerical regexp match.  The scalar/list
# conetexts effect the g flag for n// as they effect the g flag for m//.

# This goes through the data and looks for zero-crossings (denoted by
<0>)
# and gathers between one and three points (standard quantifier
notation)
# with positive slopes (denoted S) on either side of the crossing.

# The match is captured using the parens, making the result available in
$1,
# a piddle.

while ( $periodic_data =~ n/\G.*? (S{1,3} <0> S{1,3}) /g )
  $pos_cross = $1;
  # process positively-sloped zero-crossing
}


### Example Idea 2 ###
# I took the FFT of a real signal and computed its magnitude.
# It should have peaks near the strongest frequencies.  In this case,
# I'm going to extract the data from the highest peak.  Then I'll check
# if it's got multple peaks or if it's just a single peak.

# I will use M to denote a local maximum, m to denote a local minimum.
# Since it is methematically impossible to have two local minima or
local
# maxima in a row, MM will denote the global maximum and mm will
# denote the global minimum of the set.

# I will use @0 to denote the mean of the set, @1 to denote mean + 1
std. 
# dev., -...@1 to denote mean - 1 std. dev., etc.  I will use standard
bracket
# notation, where [4, 7) <-> 4 <= x < 7.  Missing numbers are replaced
# with inf or -inf, as makes sense.  Thus [...@2,] means 'anything greater
than
# or equal to mean + 2 std. dev.

# extract the main peak (Whether to use @0, @0.1, @0.5, or @1 is a
question
# of how sharp the peak is and how  smooth your 'noise' is.)
($peak, $left_of_peak, $max, $right_of_peak) = $fft_of_data =~ n/ ((
[[email protected],]+) (MM) ([[email protected],]+)) /;

# Check if the peak is skewed:
if(    $left_of_peak->dim(0) * 2 < $right_of_peak_dim(0)
    or $right_of_peak->dim(0) * 2 < $left_of_peak->dim(0)) {
  warn "Skewed primary peak\n";
}

# Check for monotinicity:
if ($left_of_peak =~ n/Mm/) {
  warn "Left side of the primary peak is not monotonic\n";
}
if ($right_of_peak =~ n/mM/) {
  warn "Right side of the primary peak is not monotonic\n";
}

# Check for double peak; ^ and $ have usual regex meanings.
# The left-hand data should be increasing up to the peak
# and the right-hand data should be decreasing just after the
# peak.  This means that these slices' maxima should occurr
# right at the edge.  If this is not the case, we have a double-peak.

if($left_of_peak !~ n/MM$/ or $right_of_peak !~ n/^MM/) {
  warn "Double peak (or worse)!\n"
}


### Example Idea 3 ###
It would be nice to have a symbol for 'bad value', perhaps b (B for 'not
bad vaue'), so that you could identify the neighborhood around the bad
value using something like

$neighborhood = $data_with_bv =~ n/(B{1,10}b+B{1,10}/;

_______________________________________________
Perldl mailing list
[email protected]
http://mailman.jach.hawaii.edu/mailman/listinfo/perldl

Reply via email to