On 07/08/2007 6:29 PM, Herve Pages wrote: > Duncan Murdoch wrote: >> On 07/08/2007 5:06 PM, Herve Pages wrote: >>> Hi, >>> >>> ?rawToChar >>> 'rawToChar' converts raw bytes either to a single character string >>> or a character vector of single bytes. (Note that a single >>> character string could contain embedded nuls.) >>> >>> Allowing embedded nuls in a string might be an interesting experiment >>> but it >>> seems to cause some troubles to most of the string manipulation >>> functions. >>> >>> A string with an embedded 0: >>> >>> raw0 <- as.raw(c(65:68, 0 , 70)) >>> string0 <- rawToChar(raw0) >>> >>>> string0 >>> [1] "ABCD\0F" >>> >>> nchar() should return 6: >>>> nchar(string0) >>> [1] 4 >> You don't state your R version. The default type of counting in nchar() >> has recently changed from "bytes" (where 6 is correct) to "chars" (where >> 4 is correct). > > > Oops, sorry: > >> sessionInfo() > R version 2.6.0 Under development (unstable) (2007-07-02 r42107) > x86_64-unknown-linux-gnu > > locale: > LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > loaded via a namespace (and not attached): > [1] rcompgen_0.1-15 > > > And indeed: > raw0 <- as.raw(c(65:68, 0 , 70)) > string0 <- rawToChar(raw0) > >> nchar(string0, type="chars") > [1] 4 >> nchar(string0, type="bytes") > [1] 6 > > > In addition to the string functions already mentioned before, it's worth > noting that > 'paste' doesn't seem to be "embedded nul aware" neither: > >> paste(string0, "G", sep="") > [1] "ABCDG" > > Same for serialization: > >> save(string0, file="string0.rda") >> load("string0.rda") >> string0 > [1] "ABCD"
Of these, I'd say the serialization is the only case where it would be reasonable to fix the behaviour. R depends on C run-time functions for most of the string operations, and they'll stop at a null. So if this isn't documented behaviour, it should be, but it's not reasonable to rewrite the C run-time string functions just to handle such weird objects. Functions like "grep" require thousands of lines of code, not written by us, and in my opinion maintaining changes to it is not something the R project should take on. As to serialization: there's a comment in the source that embedded nulls are handled by it, and that's true up to R-patched, but not in R-devel. Looks like someone has introduced a bug. Duncan Murdoch > > One comment about the nchar man page: > 'chars' The number of human-readable characters. > > "human-readable" seems to be used for "everything but a nul" here which can > be confusing. > For example one would generally think of ascii codes 1 to 31 as non > "human-readable" but > nchar() seems to disagree: > >> string1 <- rawToChar(as.raw(1:31)) >> string1 > [1] > "\001\002\003\004\005\006\a\b\t\n\v\f\r\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037" >> nchar(string1, type="chars") > [1] 31 No, "human-readable" also has other meanings in multi-byte encodings. If an e-acute is encoded in two bytes in your locale, it still only counts as one human-readable character. ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel