Am 01.03.2010 19:42, schrieb Patrick Gundlach:
I would expect the positions of 'b' being 2 and 3, resp., as that
are the lengths of the strings as returned by unicode.utf8.len.
However, unicode.utf8.find seems to have another notion of the
length of a string.
It is documented: (Well, sort of, you need to downlaod the slunicode
library and look into 'unittest'.)
Thanks for the pointer!
-- NOTE: find positions are in bytes for all ctypes!
> -- use ascii.sub to cut found ranges!
Hmm, neither do I want to cut something nor do I have a range available.
I just want to count. Attached is my attempt of a utf8 aware find
function based on the utf8 aware parts of slnunicode. Comments and
improvements are welcome!
-- this is a) faster b) more reliable
But leaves this simple case uncovered. :/
Best regards,
Stephan Hennig
function utf8_find(str, pattern, start)
local len_pat = unicode.utf8.len(pattern)
local s = unicode.utf8.sub(str, start)
-- search for first occurence of pattern
local s = unicode.utf8.match(s, "^.-" .. pattern)
local fin = s and start + unicode.utf8.len(s) - 1
return fin and fin - len_pat + 1, fin
end
function showMatches(s, pattern)
io.write("pattern '" .. pattern .. "' at positions")
local start, fin = 0, 0
while true do
start, fin = utf8_find(s, pattern, start + 1)
if not start then break end
io.write(" (" .. start .. "," .. fin .. ")")
end
io.write("\n")
end
io.input("words.utf8")
for line in io.lines() do
print("line = " .. line)
print("len(line) = " .. unicode.utf8.len(line))
showMatches(line, "ä")
showMatches(line, "ö")
showMatches(line, "öö")
print()
end
#böö#bb#
ö#ä#öööbbb#ööb##ö