Regex.replace does not preserve the text encoding

Robert Woodhead Sat, 28 Oct 2006 12:25:50 -0700

http://www.realsoftware.com/feedback/viewreport.php?reportid=hfjoqjuf

No matter what the encoding of the input is, the string returned byregex.replace has no text encoding. This causes lots of hard-to-reproduce problems, and may in fact be the root cause of many of theother regex bug reports.

In my case, I had a little routine that is extracting tokens from astring, processing them, and then using regex.replace to eliminatethe token from the string, leaving a remainder, which was thenrechecked to see if it had another token. It was then adding a lineto a listBox (player) if the name of the player was not already there.

Because regex.replace killed the encoding, the comparison routine forthe listbox was *sometimes* silently failing, and duplicate playerswere *sometimes* being added. What I think is going on is that thestring comparison routines can compare a NO encoding string withUTF-8, but they silently fail when comparing a NO encoding stringwith UTF-16.

Here is some output tracing from my app that illustrates the problem;comments refer to the lines after the comment.


// so far, so good.  The string I'm parsing is UTF-16

FMOVED2 line [(F148[NEPTUNE]-->W169 F66[NEPTUNE]-->W140)] hasencoding UTF-16


// I extract the elements, and compute the remainder using regex.replace

    FMoved2 [66],[NEPTUNE],[140]
    Remainder [(F148[NEPTUNE]-->W169 )]

// I look for the player in a routine in the listbox. It finds himimmediately,// and returns. In case you're wondering, this is an app thatsupports a

// venerable play-by-mail game called StarWeb

playerRow(NEPTUNE) UTF-16
  0:(NEPTUNE) UTF-16

// the next time through the loop, the string returned byregex.replace has no

// encoding.

FMOVED2 line [(F148[NEPTUNE]-->W169 )] has NO TextEncoding!

// regex still works, and gives me the token elements

    FMoved2 [148],[NEPTUNE],[169]
    Remainder [( )]

// but now, when checking to see if I already have the guy, the matchdoes NOT

// succeed., and a duplicate player gets added!

playerRow(NEPTUNE) NO ENCODING
  0:(NEPTUNE) UTF-16
  1:(REGULUS) UTF-8
  2:(PROCYON) UTF-8
  3:(MENTOR) UTF-8
  4:(QUASAR) UTF-16
  5:(ORION) UTF-16
ADDPLAYER [NEPTUNE],[UNKNOWN],[],[5]

// However, compare this with a situation where the encoding of thedata in

// the listbox we are looking for is UTF8

FMOVED2 line [(F199[PROCYON]-->W218 F140[PROCYON]-->W114 F118[PROCYON]-->W218)] has encoding UTF-16

    FMoved2 [118],[PROCYON],[218]
    Remainder [(F199[PROCYON]-->W218 F140[PROCYON]-->W114 )]
playerRow(PROCYON) UTF-16
  0:(NEPTUNE) UTF-16
  1:(REGULUS) UTF-8
  2:(PROCYON) UTF-8

FMOVED2 line [(F199[PROCYON]-->W218 F140[PROCYON]-->W114 )] has NOTextEncoding!

    FMoved2 [140],[PROCYON],[114]
    Remainder [(F199[PROCYON]-->W218  )]

// here again, we have no encoding, but we find the guy, since he'sUTF-8

// and not UTF-16

playerRow(PROCYON) NO ENCODING
  0:(NEPTUNE) UTF-16
  1:(REGULUS) UTF-8
  2:(PROCYON) UTF-8
FMOVED2 line [(F199[PROCYON]-->W218  )] has NO TextEncoding!
    FMoved2 [199],[PROCYON],[218]
    Remainder [(  )]
playerRow(PROCYON) NO ENCODING
  0:(NEPTUNE) UTF-16
  1:(REGULUS) UTF-8
  2:(PROCYON) UTF-8

Summary: regex.replace returns strings without encoding. Stringswithout encoding cannot be compared to UTF-16 strings.

Workaround: record what the textEncoding was before theregex.replace, and then use defineEncoding (not convertEncoding!) toreset it. I wonder if this doesn't have some gotchas, however._______________________________________________

Unsubscribe or switch delivery mode:
<http://www.realsoftware.com/support/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Regex.replace does not preserve the text encoding

Reply via email to