Duncan Murdoch wrote: > On 07/08/2007 6:29 PM, Herve Pages wrote: [...] >> Same for serialization: >> >>> save(string0, file="string0.rda") >>> load("string0.rda") >>> string0 >> [1] "ABCD" > > Of these, I'd say the serialization is the only case where it would be > reasonable to fix the behaviour. R depends on C run-time functions for > most of the string operations, and they'll stop at a null. So if this > isn't documented behaviour, it should be, but it's not reasonable to > rewrite the C run-time string functions just to handle such weird > objects. Functions like "grep" require thousands of lines of code, not > written by us, and in my opinion maintaining changes to it is not > something the R project should take on.
I was not (of course) suggesting to fix all the string manipulation functions. I'm just wondering why R would try to support embedded nuls in the first place given that they can only be a source of troubles. What about this: > string0 [1] "ABCD\0F" > string0 == "ABCD" [1] TRUE string0 is obviously different from "ABCD"! Maybe it's easier to change the semantic of rawToChar() so it doesn't return a string with embedded nuls. More generally speaking, base functions should always return "clean" strings. > > As to serialization: there's a comment in the source that embedded > nulls are handled by it, and that's true up to R-patched, but not in > R-devel. Looks like someone has introduced a bug. > > Duncan Murdoch >> >> One comment about the nchar man page: >> 'chars' The number of human-readable characters. >> >> "human-readable" seems to be used for "everything but a nul" here >> which can be confusing. >> For example one would generally think of ascii codes 1 to 31 as non >> "human-readable" but >> nchar() seems to disagree: >> >>> string1 <- rawToChar(as.raw(1:31)) >>> string1 >> [1] >> "\001\002\003\004\005\006\a\b\t\n\v\f\r\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037" >> >>> nchar(string1, type="chars") >> [1] 31 > > No, "human-readable" also has other meanings in multi-byte encodings. If > an e-acute is encoded in two bytes in your locale, it still only counts > as one human-readable character. > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel