Hello everybody -
Back around June 18 I wrote the list with an idea of creating numerical
regular expressions. It was entitled 'Slicing with Vector Regexp'. Hugh
got back to me with some ideas, which I greatly appreciated. Since that
time I've boned up on my perl quite a bit and I'm almost in a position to
start working seriously on the idea. I have many ideas for how this might
work, and I'm sure that some of them are pretty good, but I feel like the
sorts of uses I have for this tool are limited to the sorts of problems that
I work on, mostly 1-d time series. The best use I've imagined would be
identifying the neighborhoods of peaks in a FFT'd signal. (I'd love to get
some ideas for useful regex ideas for higher dimensional data, if you have
any.)
Here are some prompts to get you thinking: Have you ever found yourself
trying to identify a slice that is conceptually easy to describe but
programatically difficult or annoying to encode? Have you ever had to wade
through 200 plots of data in order to find the one or two interesting ones?
What were you looking for? Suppose you could have written a pattern that
weeded out all but 20 of those. What would that pattern look like?
I would appreciate any feedback, from a plain description of your concept to
a commented sample code of how you would like to use such regular
expressions. I've included a few examples to show you how I think and
hopefully get you thinking in your problem domain. Use whatever syntax or
symbols seem to make sense and be sure to explain what they mean. Feel free
to define a set of symbols to denote numerical ranges or short-hand symbols
for useful numbers (like the mean of the data set, or the standard
deviation). It you want to get some more ideas for notation, check out the
mail archives from around Jun 18.
My next steps are (1) assembling a number of ideas and use cases, (2)
hammering out something of a specification, (3) writing up tests, and (4)
implementing it.
Thanks.
David
*### Notation to use to get things going ###*
A numerical regex will be denoted n// and will function similarly to m//.
Flags I've thought of could include:
- /g for *g*lobally find all matches, identical to m//g (including
progressive matching and list-contex sensitivity)
- /n for *n*umber of successful matches (which should be optimized
compared to /g)
- /i to populate the $1, $2, etc with piddles containing the *i*ndex of
the matches instead of the matches themselves
Feel free to add flags and notation as you see fit.
*
### Example Idea 1* ###
# periodic_data is a piddle (filled with periodic data, obviously).
# This example is pretty tame, and most of these expressions could
# be encoded with some simple for-loops in C. The nice thing about
# these is that they're compact. Also, the quantifiers in the last match
# would be annoying to encode in a C-loop.
# First I'll count the number of actual zero values. I use standard
# math notation, where (1,5] means 1 < x <= 5. (This does not
# conflict with captures because any interval must have a comma but
# captures will never have a bare comma.) As far as I'm
# concerned, any matches for exactly zero is worth a warning, since
# the chances of any data being EXACTLY zero are vanishingly
# small:
warn("Found (at least) one zero value. That seems suspicious, but I will
continue...")
if($periodic_data =~ n/[0,0]/n > 0);
# The data is periodic however, and I expect around 280 zero-
# crossings if it's smooth and the equipment is working. While
# value ranges are denoted with parentheses/bracket notation,
# angel brackets denote value crossings. They're sorta like
# zero-width assertions:
$num_of_zero_crossings = $periodic_data =~ n/<0>/n;
warn ("I encountered $num_of_zero_crossings zero-crossings, but I expected
around 280. Are you sure the data is good?")
if ($num_of_zero_crossings < 200 or $num_of_zero_crossings > 350);
# I next use a progressive numerical regexp match. The scalar/list
# conetexts effect the g flag for n// as they effect the g flag for m//.
# This goes through the data and looks for zero-crossings (denoted by <0>)
# and gathers between one and three points (standard quantifier notation)
# with positive slopes (denoted S) on either side of the crossing.
# The match is captured using the parens, making the result available in $1,
# a piddle.
while ( $periodic_data =~ n/\G.*? (S{1,3} <0> S{1,3}) /g )
$pos_cross = $1;
# process positively-sloped zero-crossing
}
*### Example Idea 2 ###*
# I took the FFT of a real signal and computed its magnitude.
# It should have peaks near the strongest frequencies. In this case,
# I'm going to extract the data from the highest peak. Then I'll check
# if it's got multple peaks or if it's just a single peak.
# I will use M to denote a local maximum, m to denote a local minimum.
# Since it is methematically impossible to have two local minima or local
# maxima in a row, MM will denote the global maximum and mm will
# denote the global minimum of the set.
# I will use @0 to denote the mean of the set, @1 to denote mean + 1 std.
# dev., -...@1 to denote mean - 1 std. dev., etc. I will use standard bracket
# notation, where [4, 7) <-> 4 <= x < 7. Missing numbers are replaced
# with inf or -inf, as makes sense. Thus [...@2,] means 'anything greater than
# or equal to mean + 2 std. dev.
# extract the main peak (Whether to use @0, @0.1, @0.5, or @1 is a question
# of how sharp the peak is and how smooth your 'noise' is.)
($peak, $left_of_peak, $max, $right_of_peak) = $fft_of_data =~ n/ ((
[[email protected],]+) (MM) ([[email protected],]+)) /;
# Check if the peak is skewed:
if( $left_of_peak->dim(0) * 2 < $right_of_peak_dim(0)
or $right_of_peak->dim(0) * 2 < $left_of_peak->dim(0)) {
warn "Skewed primary peak\n";
}
# Check for monotinicity:
if ($left_of_peak =~ n/Mm/) {
warn "Left side of the primary peak is not monotonic\n";
}
if ($right_of_peak =~ n/mM/) {
warn "Right side of the primary peak is not monotonic\n";
}
# Check for double peak; ^ and $ have usual regex meanings.
# The left-hand data should be increasing up to the peak
# and the right-hand data should be decreasing just after the
# peak. This means that these slices' maxima should occurr
# right at the edge. If this is not the case, we have a double-peak.
if($left_of_peak !~ n/MM$/ or $right_of_peak !~ n/^MM/) {
warn "Double peak (or worse)!\n"
}
### Example Idea 3 ###
It would be nice to have a symbol for 'bad value', perhaps b (B for 'not bad
vaue'), so that you could identify the neighborhood around the bad value
using something like
$neighborhood = $data_with_bv =~ n/(B{1,10}b+B{1,10}/;
_______________________________________________
Perldl mailing list
[email protected]
http://mailman.jach.hawaii.edu/mailman/listinfo/perldl