http://www.realsoftware.com/feedback/viewreport.php?reportid=hfjoqjuf
No matter what the encoding of the input is, the string returned by
regex.replace has no text encoding. This causes lots of hard-to-
reproduce problems, and may in fact be the root cause of many of the
other regex bug reports.
In my case, I had a little routine that is extracting tokens from a
string, processing them, and then using regex.replace to eliminate
the token from the string, leaving a remainder, which was then
rechecked to see if it had another token. It was then adding a line
to a listBox (player) if the name of the player was not already there.
Because regex.replace killed the encoding, the comparison routine for
the listbox was *sometimes* silently failing, and duplicate players
were *sometimes* being added. What I think is going on is that the
string comparison routines can compare a NO encoding string with
UTF-8, but they silently fail when comparing a NO encoding string
with UTF-16.
Here is some output tracing from my app that illustrates the problem;
comments refer to the lines after the comment.
// so far, so good. The string I'm parsing is UTF-16
FMOVED2 line [(F148[NEPTUNE]-->W169 F66[NEPTUNE]-->W140)] has
encoding UTF-16
// I extract the elements, and compute the remainder using regex.replace
FMoved2 [66],[NEPTUNE],[140]
Remainder [(F148[NEPTUNE]-->W169 )]
// I look for the player in a routine in the listbox. It finds him
immediately,
// and returns. In case you're wondering, this is an app that
supports a
// venerable play-by-mail game called StarWeb
playerRow(NEPTUNE) UTF-16
0:(NEPTUNE) UTF-16
// the next time through the loop, the string returned by
regex.replace has no
// encoding.
FMOVED2 line [(F148[NEPTUNE]-->W169 )] has NO TextEncoding!
// regex still works, and gives me the token elements
FMoved2 [148],[NEPTUNE],[169]
Remainder [( )]
// but now, when checking to see if I already have the guy, the match
does NOT
// succeed., and a duplicate player gets added!
playerRow(NEPTUNE) NO ENCODING
0:(NEPTUNE) UTF-16
1:(REGULUS) UTF-8
2:(PROCYON) UTF-8
3:(MENTOR) UTF-8
4:(QUASAR) UTF-16
5:(ORION) UTF-16
ADDPLAYER [NEPTUNE],[UNKNOWN],[],[5 ]
// However, compare this with a situation where the encoding of the
data in
// the listbox we are looking for is UTF8
FMOVED2 line [(F199[PROCYON]-->W218 F140[PROCYON]-->W114 F118
[PROCYON]-->W218)] has encoding UTF-16
FMoved2 [118],[PROCYON],[218]
Remainder [(F199[PROCYON]-->W218 F140[PROCYON]-->W114 )]
playerRow(PROCYON) UTF-16
0:(NEPTUNE) UTF-16
1:(REGULUS) UTF-8
2:(PROCYON) UTF-8
FMOVED2 line [(F199[PROCYON]-->W218 F140[PROCYON]-->W114 )] has NO
TextEncoding!
FMoved2 [140],[PROCYON],[114]
Remainder [(F199[PROCYON]-->W218 )]
// here again, we have no encoding, but we find the guy, since he's
UTF-8
// and not UTF-16
playerRow(PROCYON) NO ENCODING
0:(NEPTUNE) UTF-16
1:(REGULUS) UTF-8
2:(PROCYON) UTF-8
FMOVED2 line [(F199[PROCYON]-->W218 )] has NO TextEncoding!
FMoved2 [199],[PROCYON],[218]
Remainder [( )]
playerRow(PROCYON) NO ENCODING
0:(NEPTUNE) UTF-16
1:(REGULUS) UTF-8
2:(PROCYON) UTF-8
Summary: regex.replace returns strings without encoding. Strings
without encoding cannot be compared to UTF-16 strings.
Workaround: record what the textEncoding was before the
regex.replace, and then use defineEncoding (not convertEncoding!) to
reset it. I wonder if this doesn't have some gotchas, however._______________________________________________
Unsubscribe or switch delivery mode:
<http://www.realsoftware.com/support/listmanager/>
Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>