Re: [Perldl] Numerical Regexp: Call for Ideas and Use Cases

David Mertens Wed, 19 Aug 2009 13:59:21 -0700

Hi Cliff, thanks for your questions.  Many of the ideas I'm throwing out 
there borrow heavily from standard Perl regexp ideas (and some math 
ideas, too), so if I use one that doesn't make sense, let me know.  
Also, I've put a bit more work into the notation and changed a few 
things, particularly the notation for mean and standard deviation.


 > Instead of <0> being just 0 crossing, could it be extended to mean 
the value for which you are searching? Kind of making it like a context 
grep.

I agree exactly.  <NUMBER> should match the crossing of whatever number 
is specified within the angle brackets.  In regex terms, the <> notation 
is meant to work like a zero-width assertion, much like the \b assertion 
in Perl regexes for word-boundaries.  If you're searching for something 
that crosses 2, you could use <2>.   Ideally, you would be able to use 
any scalar expression inside the angle brackets, so you could write 
<$crossing_value> if you set that variable already.  I've given some 
thought to a two-argument form for this notation, but I don't have my 
notes with me and I'm fuzzy on the details at the moment.

 > In the slope example, what is the purpose of the S attribute along 
with the {1,3}. Is the S an operator such that it is calculating the 
slope for the data points in the {range}? Will there be a minimum and 
maximum to the number of points in the {range} that can be used for the 
slope calculation? Could the range operator also be used to capture a 
set of points around the <X> crossing - versus zero crossing.

S is not an operator or attribute.  S is a metacharacter (ALL characters 
are metacharacters in numerical regexes -- no need to escape them) and 
it stands for 'positively sloped number'.  So for example, S+ means 
'match one or more positively sloped numbers', and S* means 'match zero 
or more positively sloped numbers'.  Thus, S{1,3} means 'match between 
one and three positively sloped numbers'.  However, I am certainly open 
to different ideas.  If you think S{1,3} should mean "match a number 
whose slope is between 1 and 3", I'll consider it.  I've thought about 
that kind of thing before and come up with a solution of my own, which 
is to allow a regex to be applied to multiple dataset simultaneously.  
I'll explain that in more detail if you like.  Of course, I did escape 
one of the metacharacters in one of the examples, it was \G, just to 
keep with notational consistency.  Sorry if this muddied the waters even 
more.

 > What is the purpose of the + in the example for selecting the peak - 
($peak, $left_of_peak, $max, $right_of_peak) = $fft_of_data =~ n/ 
(([[email protected],]+) (MM) ([[email protected],]+)) /;
The + is a quantifier, just like in standard regexes.

 > I don't know standard Bracket Notation, but it seems that the "[" 
indicates the inclusive side of the < or > part of the range. However, 
when I look at the examples I don't see the same nomenclature. Am I 
missing it?

You're right about the interpretation of square brackets in this 
example.  I've decided to have two different uses for both parentheses 
and brackets, depending on context, and both based on standard usages in 
their context's fields.  First, the Bracket Notation from math uses 
parentheses and brackets to specify ranges, so that x is in [3,5) if 3 
<= x < 5.  If you change the opening bracket to a parenthesis, you 
replace the less-than-or-equal-to with just less-than, so (3,5) means 3 
< x < 5.  However, regexes use (matched) parentheses to indicate 
captures and matched brackets to indicate 'character classes'; both of 
these concepts have sensible analogs in numerical regexes.  The key 
disambiguation is the presence (or lack) of a comma.  Ranges must always 
have a comma in them.  Matched parentheses without a comma, such as the 
OUTER parentheses in ([[email protected],]+), are captures.

 > In the "skew example" I interpreted $peak to be a piddle, 
$left_of_peak to be a single value that is 0.1 standard deviations away. 
But then later you used $left_of_peak->dim(0) seemingly to capture the 
first data item. So the [...@s,] returns a piddle of values from the S 
value to the peak - correct? I don't understand the reasoning for the 
dim(0)*2 portion for the skew calculation either.

I guess the confusion here arises from the behavior of the captures.  
I'm going to assume you're familiar with captures in standard Perl 
regexes, so I won't explain them.  In numerical regexes as I envision 
them, all captures return piddles (just like all captures in Perl return 
strings).  They may be single-element piddles, but they are piddles.  
Thus ([[email protected],]+), which occurs twice, captures one or more numbers (the + 
is a quantifier) whose values are all greater than (mean + 0.1 * std 
dev.) and stores the result in $2 or $4 depending on which one you're 
talking about.  These are later stored in $left_of_peak and 
$right_of_peak, respectively.  The expression (MM) captures the global 
maximum and stores it in the single-element piddle $3, and later, $max.  
The operation is nearly identical to this string-parsing regex:
$name = "My name is David C. Mertens";
($full_name, $first_name, $MI, $last_name) = $name =~ 
/((\w+)\s(\w)\.\s(\w+)/;

As for the dim(0)*2 business, this basically checks if the right side is 
more than twice as long as the left, or if the left side is more than 
twice as long as the right, like the distribution you get with Planck's law.

David

_______________________________________________
Perldl mailing list
[email protected]
http://mailman.jach.hawaii.edu/mailman/listinfo/perldl

Re: [Perldl] Numerical Regexp: Call for Ideas and Use Cases

Reply via email to