[power-pro] Re: Unicode bugs? Bruce: re ++)

Sheri Tue, 11 Aug 2009 12:02:44 -0700

--- In [email protected], "entropyreduction" 
<alancampbelllists+ya...@...> wrote:
>
> --- In [email protected], "Sheri" <sherip99@> wrote:
> >
> > Unicode.from_num is giving an error that 2 or more arguments are required.
> > 
> > e.g.,
> > local patternu=unicode.from_num(0x2153)
> 
> 
> fixed see unicodePlugin0.73_090811.zip
> in 
> http://tech.groups.yahoo.com/group/power-pro/files/0_TEMP_/AlansPluginProvisional/
>


yes, above works in new unicode edition.

> > 
> > My thought was perhaps user could use unicode to set and unset a global 
> > variable that regex could read. If regex sets the variable, it doesn't mean 
> > unicode is loaded.
> 
> Ok, how about this: you worry seems to be that regex plugin always drags in 
> unicode plugin, even though it's seldom needed.
> 
> So I've compiled the few lines of code from unicode source that's needed to 
> recognise a unicode handle.  Unicode plugin isn't loaded unless a unicode 
> handle is identified within regex.  Just-in-time loading of unicode dll.  
> That do?
> 
> If so no need for regex.allow_unicode_handles, I'll take it out.

As far as I can tell, it isn't currently in.

In addition to my worry that regex was unnecessarily loading the unicode 
plugin, it concerns me that regex is doing extra work. It sounds like it will 
be searching for unicode handles all the time now (despite that it will never 
find any). So I would still like a way to opt-out.

Format/replacement string doesn't work any more. For example:

win.debug(regex.version)
;regex.allow_unicode_handles(0) ;;doesn't work
local subjectu=unicode.from_nums(0x00BC,0x0020,0x2153,0x00A0,0x2154)
win.debug(?"2/3 char as utf8 from backref",regex.pcrematchall(?"\x{2154}", 
unicode.to_utf8(subjectu), "$0", "utf8"))
win.debug(?"2/3 char as utf8 from backref",regex.pcrematchall(?"\x{2154}", 
unicode.to_utf8(subjectu), "$0", "utf8"))
win.debug(?"2/3 char as utf8 from unicode", unicode.from_num(0x2154).to_utf8)

Output using regex 206

2060 2009-01-20
2/3 char as utf8 from backref â"
2/3 char as utf8 from backref â"
2/3 char as utf8 from unicode â"

Output using regex 207

2070 2009-08-11
2/3 char as utf8 from backref â"
2/3 char as utf8 from backref $0
2/3 char as utf8 from unicode â"

I got above output once (i.e., backref worked the first time in a boot), since 
then its been:

2070 2009-08-11
2/3 char as utf8 from backref $0
2/3 char as utf8 from backref $0
2/3 char as utf8 from unicode â"

Also, just tried regexplugintest, and it looks like format string is currently 
broken everywhere, not just with utf8 option. Over 50 conflicts comparing 
output logs. Looks like all the pcrereplace and pcrematchall tests failed.

The value added of supporting unicode handles in regex services is low, so 
rather than fighting to make it work, you could just remove it, esp. if making 
it work is difficult. Anybody using the utf8 option for pcre is going to be 
familiar with the utf8 services of the unicode plugin (and I will add some 
examples to the regex documentation).

> 
> BTW I seem to be parsing config ini file for
> 
> defaultmatchseparator        
> defaultutf8matchseparator  
> 
> But not using them for anything, or remembering result.  Redundant code?

Sounds like they can be safely removed, since you say they are unused anyway. 
No such ini keys are documented.

> > For test purposes I just tried putting a unicode handle into
> > pattern and subject (both worked). I also tried putting a unicode
> > handle in the replacement string, but I got the error:
> > regex.pcreReplace: PCRE exec failed Matching error -3
> > Programminng Error: PCRE_ERROR_BADOPTION
> 
> This seemed to work in regexPlugin207_090811.zip
>  

Stopped giving that error, but unfortunately also stopped giving the right 
answers.

> > The option was "utf8". Also tried "u". Worked fine as long as the 
> > replacement string was not a unicode handle (including if the replacement 
> > string was decoded from a unicode handle to a utf8 string).
> > 
> > Is it safe to chain the utf8 operations as done below?
> > 
> > local 
> > subjectstring=unicode.from_nums(0x00BC,0x0020,0x2153,0x00A0,0x2154).to_utf8
> > ;local replaceu=unicode.from_num(0x2154);; fails
> > local replaceu=unicode.new(" ")
> > unicode.default_get_set_type("numeric")
> > replaceu[0]=0x2154
> > local replacestring=unicode.to_utf8(replaceu)
> > ;local test=regex.pcrereplace(?"\x{2153}", subjectstring, replaceu, "utf8") 
> > ;;fails
> > local test=regex.pcrereplace(?"\x{2153}", subjectstring, replacestring, 
> > "utf8")
> > unicode.messagebox("OK", unicode.from_utf8(test))
> > local test=regex.pcrereplace(?"\x{2154}", test, ?"2/3", "utf8")
> > win.debug(unicode.from_utf8(test).to_ascii)
> 
> Um, that last one's weird.  unicode.from_utf8(test) returns a string.
> I'm surprised it worked (because handle syntax shouldn't work when object 
> (left side) isn't a unicode handle.
>

Looks like a handle if you debug it. I think the table in the doc is in error. 
Elsewhere it says:

version .37: added service create_from_utf8 that returns a handle to a UTF-16 
string.
version .54: renamed create_from_utf8 to from_utf8

Regards,
Sheri

[power-pro] Re: Unicode bugs? Bruce: re ++)

Reply via email to