Am 01.03.2010 19:42, schrieb Patrick Gundlach:

I would expect the positions of 'b' being 2 and 3, resp., as that
are the lengths of the strings as returned by unicode.utf8.len.
However, unicode.utf8.find seems to have another notion of the
length of a string.

It is documented: (Well, sort of, you need to downlaod the slunicode
library and look into 'unittest'.)

Thanks for the pointer!


--      NOTE: find positions are in bytes for all ctypes!
> -- use ascii.sub to cut found ranges!

Hmm, neither do I want to cut something nor do I have a range available. I just want to count. Attached is my attempt of a utf8 aware find function based on the utf8 aware parts of slnunicode. Comments and improvements are welcome!


--      this is a) faster b) more reliable

But leaves this simple case uncovered. :/

Best regards,
Stephan Hennig
function utf8_find(str, pattern, start)
   local len_pat = unicode.utf8.len(pattern)
   local s = unicode.utf8.sub(str, start)
   -- search for first occurence of pattern
   local s = unicode.utf8.match(s, "^.-" .. pattern)
   local fin = s and start + unicode.utf8.len(s) - 1
   return fin and fin - len_pat + 1, fin
end


function showMatches(s, pattern)
   io.write("pattern '" .. pattern .. "' at positions")
   local start, fin = 0, 0
   while true do
      start, fin = utf8_find(s, pattern, start + 1)
      if not start then break end
      io.write(" (" .. start .. "," .. fin .. ")")
   end
   io.write("\n")
end


io.input("words.utf8")
for line in io.lines() do
   print("line = " .. line)
   print("len(line) = " .. unicode.utf8.len(line))
   showMatches(line, "ä")
   showMatches(line, "ö")
   showMatches(line, "öö")
   print()
end
#böö#bb#
ö#ä#öööbbb#ööb##ö

Reply via email to