ID: 30618
User updated by: webmaster at unitedscripters dot com
Reported By: webmaster at unitedscripters dot com
Status: Open
Bug Type: Regexps related
Operating System: Windows XPP
PHP Version: 5.0.2
New Comment:
Setting the PREG_OFFSET_CAPTURE flag in preg_match_all, does that.
I apologize for the wrong submission of an alleged missing feature.
Luckly enough, this is the only wrong submission I sent. I'll be more
careful in the future when I report something at around 5am italian
time after 15 hours of coding!
Previous Comments:
------------------------------------------------------------------------
[2004-10-29 23:39:40] webmaster at unitedscripters dot com
Description:
------------
Object: FINDING INDEX POSITIONS OF A REGULAR EXPRESSION MATCH IS
APPARENTLY A NON-AVAILABLE FEATURE
I might be wrong but apparently PHP lacks a way to spot not only
matches but their _index_ positions within a string.
I at first thought that once found the matches by preg_match_all, all
one had to do to draw also their index positions in the input string,
was to iterate the returned array of matches and recursively grab any
match from the string by strpos, removing the already inspected
substring.
Though it may seem an obvious idea, yet it may not work.
The position in a string searched by a string oriented function is not
necessarily the same poistion searched by a regular expression oriented
function.
Consider this example, input string is:
"A thesaurus for the pupil"
whereas the regular expression searches for:
"/the\\b/"
which is obviusly a word like "the" followed by a word boundary (\\b).
The preg_match_all matches would report, correctly, only the isolated
article "the", for that is followed by a word boundary.
But attempting to retrieve the index position of that match by strpos
would report the index position of THEsaurus.
So do _not_ use strpos in combination with preg_match_all having in
mind the retrieval of the index positions of the matches: that won't
work the expected way.
Reproduce code:
---------------
function foo($string, $regexp){
$found=0;
$indexes=array();
preg_match_all($regexp, $string, $matches);
print("<strong>".$matches[0][0]."</strong>");
$matchSize=sizeof($matches[0]);
for($m=0; $m < $matchSize; $m++){
$found=strlen(substr($string, 0, $found));
preg_match($regexp, $string, $specificMatch, PREG_OFFSET_CAPTURE,
$found);
$indexes[$m]=$found+
strpos(substr($string, $found), $specificMatch[0][0]);/*shortcoming:
it's not a real index*/
$found=$indexes[$m]+strlen($matches[$m]);
};
return $indexes;
}
$in="A thesaurus for the pupil";
print "In string <strong>$in</strong>, match is: ";
$out=foo($in, "/the\\b/");
print "<br>Wrong Index reported: ";
print_r($out);
Expected result:
----------------
The result is correct, it is the feature that we lack and that
_apparently_ we cannot even implement: grabbing the correct index of a
Regular Expression match.
Whatever the case, the feature is needed: javascript has it, the
regular expression oriented function named search(), which reports at
least one index and thus can be used recursively on gradually shrinking
substrings of the input string to retrieve the positions of all the
matches.
If there is a way and I was not aware of it, I apologize. Yet the list
of perl regexps clearly lacks a function for the retrieval of the
indexes.
------------------------------------------------------------------------
--
Edit this bug report at http://bugs.php.net/?id=30618&edit=1