http://www.realsoftware.com/feedback/viewreport.php?reportid=hfjoqjuf

No matter what the encoding of the input is, the string returned by regex.replace has no text encoding. This causes lots of hard-to- reproduce problems, and may in fact be the root cause of many of the other regex bug reports.

In my case, I had a little routine that is extracting tokens from a string, processing them, and then using regex.replace to eliminate the token from the string, leaving a remainder, which was then rechecked to see if it had another token. It was then adding a line to a listBox (player) if the name of the player was not already there.

Because regex.replace killed the encoding, the comparison routine for the listbox was *sometimes* silently failing, and duplicate players were *sometimes* being added. What I think is going on is that the string comparison routines can compare a NO encoding string with UTF-8, but they silently fail when comparing a NO encoding string with UTF-16.

Here is some output tracing from my app that illustrates the problem; comments refer to the lines after the comment.

// so far, so good.  The string I'm parsing is UTF-16

FMOVED2 line [(F148[NEPTUNE]-->W169 F66[NEPTUNE]-->W140)] has encoding UTF-16

// I extract the elements, and compute the remainder using regex.replace

    FMoved2 [66],[NEPTUNE],[140]
    Remainder [(F148[NEPTUNE]-->W169 )]

// I look for the player in a routine in the listbox. It finds him immediately, // and returns. In case you're wondering, this is an app that supports a
// venerable play-by-mail game called StarWeb

playerRow(NEPTUNE) UTF-16
  0:(NEPTUNE) UTF-16

// the next time through the loop, the string returned by regex.replace has no
// encoding.

FMOVED2 line [(F148[NEPTUNE]-->W169 )] has NO TextEncoding!

// regex still works, and gives me the token elements

    FMoved2 [148],[NEPTUNE],[169]
    Remainder [( )]

// but now, when checking to see if I already have the guy, the match does NOT
// succeed., and a duplicate player gets added!

playerRow(NEPTUNE) NO ENCODING
  0:(NEPTUNE) UTF-16
  1:(REGULUS) UTF-8
  2:(PROCYON) UTF-8
  3:(MENTOR) UTF-8
  4:(QUASAR) UTF-16
  5:(ORION) UTF-16
ADDPLAYER [NEPTUNE],[UNKNOWN],[],[5]


// However, compare this with a situation where the encoding of the data in
// the listbox we are looking for is UTF8

FMOVED2 line [(F199[PROCYON]-->W218 F140[PROCYON]-->W114 F118 [PROCYON]-->W218)] has encoding UTF-16
    FMoved2 [118],[PROCYON],[218]
    Remainder [(F199[PROCYON]-->W218 F140[PROCYON]-->W114 )]
playerRow(PROCYON) UTF-16
  0:(NEPTUNE) UTF-16
  1:(REGULUS) UTF-8
  2:(PROCYON) UTF-8
FMOVED2 line [(F199[PROCYON]-->W218 F140[PROCYON]-->W114 )] has NO TextEncoding!
    FMoved2 [140],[PROCYON],[114]
    Remainder [(F199[PROCYON]-->W218  )]

// here again, we have no encoding, but we find the guy, since he's UTF-8
// and not UTF-16

playerRow(PROCYON) NO ENCODING
  0:(NEPTUNE) UTF-16
  1:(REGULUS) UTF-8
  2:(PROCYON) UTF-8
FMOVED2 line [(F199[PROCYON]-->W218  )] has NO TextEncoding!
    FMoved2 [199],[PROCYON],[218]
    Remainder [(  )]
playerRow(PROCYON) NO ENCODING
  0:(NEPTUNE) UTF-16
  1:(REGULUS) UTF-8
  2:(PROCYON) UTF-8

Summary: regex.replace returns strings without encoding. Strings without encoding cannot be compared to UTF-16 strings.

Workaround: record what the textEncoding was before the regex.replace, and then use defineEncoding (not convertEncoding!) to reset it. I wonder if this doesn't have some gotchas, however._______________________________________________
Unsubscribe or switch delivery mode:
<http://www.realsoftware.com/support/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Reply via email to