On 4 Aug 2011, at 11:50PM, Simon Urbanek wrote: > > On Aug 4, 2011, at 5:26 PM, Michael Lachmann wrote: > >> Hi, >> >> I was trying to have R read files faster with readChar(). That was before I >> noticed that readChar() is not that bad! In any case, below I suggest a few >> simple changes that will make readChar slightly faster. >> >> I followed readChar(useBytes=T), and tried to identify all O(N) operations, >> where N is the size of the file. The assumption is that for LARGE files we >> want to avoid any O(N) operations, and any O(N) memory allocations. >> > > I'm not sure it's really worth bothering with such optimizations, on my > machine I get
No it isn't worth it, you're right. Though 100MB is much smaller than my average file size. But you're right, readChar is quite efficient. > >> system.time(readChar("large.file",1e8,T)) > user system elapsed > 0.295 0.048 0.343 > > so that is a fraction of a second for 100MB of data. Besides, why in the > world would you want to use character vectors to store bytes? It's much more > efficient to work with raw vectors instead... > > AFAICS the only lesson from the list below is that we could add in an > internal function for safe_mkCharLenCE that doesn't check for NULs. But that > seems a little contrived for a very special case. Everything else is either > increasing complexity at the cost of safety or could lead to internal > inconsistencies (hashing is mandatory, otherwise you can't compare CHARSXPs). You're right. Then the harder changes aren't worth it, unless you manage to save on one of the allocations. > > BTW: your code doesn't do what you think it does - you'll force a pretty ugly > buffer overflow - But here I don't agree. memset and strlen can be dropped. > perfect illustration why optimization should be done only if really needed, > otherwise you are just likely to introduce bugs... Right again. I shouldn't really have bothered... but I did. Michael > Cheers, > Simon > > > >> Here they are: >> >> 1. In readFixedString in envir.c, an N sized vector is >> allocated, and memset to 0. O(N) >> >> 2. The file is read into the buffer with con->read O(N) (but this probably >> can't be dropped) >> >> 3. mkChar is called, which calls mkCharLenCE(name, strlen(name), CE_NATIVE); >> strlen is O(N) >> >> 4. In mkCharLenCE, a loop along the string looks for 0s to tell if the >> string includes NULs (notice that because strlen was called before, that >> can't really happen) O(N) >> >> 5. A hashcode is computed for the string to see if it is already in memory. >> That is an O(N) operation. >> >> 6. A Charsxp of size N is allocated >> >> 7. The data is copied to the Charsxp - O(N). >> >> So, as far as I could tell, in addition to the reading operation, 5 O(N) >> operations are done, and double the memory of what is needed is allocated. >> >> A couple of these operations are easy to drop: >> >> 1. One could only zero the memory beyond what was read, in case not N chars >> were read. >> >> 2. We know the length of the string, so we can call mkCharLenCE directly >> from readFixedString with the right length. >> >> Others could maybe be dropped. >> >> 3. Does one really need to look for 0s? >> In readFixedString there is a comment: >> /* String may contain nuls which we now (R >= 2.8.0) assume to be >> padding and ignore silently */ >> 4. If a file was just read, is it likely that it is in the hash? Is it worth >> paying the time for those people who read in the same file twice? >> >> Finally about the allocation. >> >> Could the Charsxp be allocated to begin with, and the data read straight >> into it? >> Then we'd save one extra allocation, and a memcpy. For that one would need >> something like mkEmptyCharLen. >> >> One could also allocate a slighly bigger memory region, and then pass that >> so that instead of allocating it a new the old pointer is used (?). >> >> In any case, here is an updated readFixedString(), which would drop 2 O(N) >> operations. >> >> --- >> static SEXP >> readFixedString(Rconnection con, int len, int useBytes) >> { >> SEXP ans; >> char *buf; >> int m; >> const void *vmax = vmaxget(); >> >> if(utf8locale && !useBytes) { >> int i, clen; >> char *p, *q; >> >> p = buf = (char *) R_alloc(MB_CUR_MAX*len+1, sizeof(char)); >> memset(buf, 0, MB_CUR_MAX*len+1); >> for(i = 0; i < len; i++) { >> q = p; >> m = con->read(p, sizeof(char), 1, con); >> if(!m) { if(i == 0) return R_NilValue; else break;} >> clen = utf8clen(*p++); >> if(clen > 1) { >> m = con->read(p, sizeof(char), clen - 1, con); >> if(m < clen - 1) error(_("invalid UTF-8 input >> in readChar()")); >> p += clen - 1; >> /* NB: this only checks validity of multi-byte characters */ >> if((int)mbrtowc(NULL, q, clen, NULL) < 0) >> error(_("invalid UTF-8 input in >> readChar()")); >> } >> } >> } else { >> buf = (char *) R_alloc(len+1, sizeof(char)); >> //memset() was here >> m = con->read(buf, sizeof(char), len, con); >> if(m < len ) >> memset(buf, m+1, len+1); // changed >> if(len && !m) return R_NilValue; >> } >> /* String may contain nuls which we now (R >= 2.8.0) assume to be >> padding and ignore silently */ >> ans = mkCharLenCE(buf, len, CE_NATIVE); // changed (one could also use >> no. read bytes as size) >> vmaxset(vmax); >> return ans; >> } >> -- >> >> The other changes are also not that hard - I'd do them if people think such >> changes should be included.... >> >> >> Thanks for listening, >> >> Michael Lachmann >> >> ______________________________________________ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> > > > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel