On 4 Aug 2011, at 11:50PM, Simon Urbanek wrote:

> 
> On Aug 4, 2011, at 5:26 PM, Michael Lachmann wrote:
> 
>> Hi,
>> 
>> I was trying to have R read files faster with readChar(). That was before I 
>> noticed that readChar() is not that bad! In any case, below I suggest a few 
>> simple changes that will make readChar slightly faster.
>> 
>> I followed readChar(useBytes=T), and tried to identify all O(N) operations, 
>> where N is the size of the file. The assumption is that for LARGE files we 
>> want to avoid any O(N) operations, and any O(N) memory allocations.
>> 
> 
> I'm not sure it's really worth bothering with such optimizations, on my 
> machine I get

No it isn't worth it, you're right. Though 100MB is much smaller than my 
average file size. But you're right, readChar is quite efficient.

> 
>> system.time(readChar("large.file",1e8,T))
>   user  system elapsed 
>  0.295   0.048   0.343 
> 
> so that is a fraction of a second for 100MB of data. Besides, why in the 
> world would you want to use character vectors to store bytes? It's much more 
> efficient to work with raw vectors instead...
> 
> AFAICS the only lesson from the list below is that we could add in an 
> internal function for safe_mkCharLenCE that doesn't check for NULs. But that 
> seems a little contrived for a very special case. Everything else is either 
> increasing complexity at the cost of safety or could lead to internal 
> inconsistencies (hashing is mandatory, otherwise you can't compare CHARSXPs).

You're right. Then the harder changes aren't worth it, unless you manage to 
save on one of the allocations. 

> 
> BTW: your code doesn't do what you think it does - you'll force a pretty ugly 
> buffer overflow - 

But here I don't agree. memset and strlen can be dropped. 

> perfect illustration why optimization should be done only if really needed, 
> otherwise you are just likely to introduce bugs...


Right again. I shouldn't really have bothered... but I did.

Michael


> Cheers,
> Simon
> 
> 
> 
>> Here they are:
>> 
>> 1. In readFixedString in envir.c, an N sized vector is 
>> allocated, and memset to 0. O(N)
>> 
>> 2. The file is read into the buffer with con->read O(N) (but this probably 
>> can't be dropped)
>> 
>> 3. mkChar is called, which calls mkCharLenCE(name, strlen(name), CE_NATIVE);
>>  strlen is O(N)
>> 
>> 4. In mkCharLenCE, a loop along the string looks for 0s to tell if the 
>> string includes NULs (notice that because strlen was called before, that 
>> can't really happen) O(N)
>> 
>> 5. A hashcode is computed for the string to see if it is already in memory. 
>> That is an O(N) operation.
>> 
>> 6. A Charsxp of size N is allocated
>> 
>> 7. The data is copied to the Charsxp - O(N).
>> 
>> So, as far as I could tell, in addition to the reading operation, 5 O(N) 
>> operations are done, and double the memory of what is needed is allocated.
>> 
>> A couple of these operations are easy to drop:
>> 
>> 1. One could only zero the memory beyond what was read, in case not N chars 
>> were read.
>> 
>> 2. We know the length of the string, so we can call mkCharLenCE directly 
>> from readFixedString with the right length. 
>> 
>> Others could maybe be dropped.
>> 
>> 3. Does one really need to look for 0s?
>> In readFixedString there is a comment:
>>   /* String may contain nuls which we now (R >= 2.8.0) assume to be
>>      padding and ignore silently */
>> 4. If a file was just read, is it likely that it is in the hash? Is it worth 
>> paying the time for those people who read in the same file twice? 
>> 
>> Finally about the allocation.
>> 
>> Could the Charsxp be allocated to begin with, and the data read straight 
>> into it?
>> Then we'd save one extra allocation, and a memcpy. For that one would need 
>> something like mkEmptyCharLen.
>> 
>> One could also allocate a slighly bigger memory region, and then pass that 
>> so that instead of allocating it a new the old pointer is used (?).
>> 
>> In any case, here is an updated readFixedString(), which would drop 2 O(N) 
>> operations.
>> 
>> ---
>> static SEXP
>>      readFixedString(Rconnection con, int len, int useBytes)
>> {
>>      SEXP ans;
>>      char *buf;
>>      int  m;
>>      const void *vmax = vmaxget();
>> 
>>      if(utf8locale && !useBytes) {
>>              int i, clen;
>>              char *p, *q;
>> 
>>              p = buf = (char *) R_alloc(MB_CUR_MAX*len+1, sizeof(char));
>>              memset(buf, 0, MB_CUR_MAX*len+1);
>>              for(i = 0; i < len; i++) {
>>                      q = p;
>>                      m = con->read(p, sizeof(char), 1, con);
>>                      if(!m) { if(i == 0) return R_NilValue; else break;}
>>                      clen = utf8clen(*p++);
>>                      if(clen > 1) {
>>                              m = con->read(p, sizeof(char), clen - 1, con);
>>                              if(m < clen - 1) error(_("invalid UTF-8 input 
>> in readChar()"));
>>                              p += clen - 1;
>>              /* NB: this only checks validity of multi-byte characters */
>>                              if((int)mbrtowc(NULL, q, clen, NULL) < 0)
>>                                      error(_("invalid UTF-8 input in 
>> readChar()"));
>>                      }
>>              }
>>      } else {
>>              buf = (char *) R_alloc(len+1, sizeof(char));
>>  //memset() was here
>>              m = con->read(buf, sizeof(char), len, con);
>>              if(m < len )
>>                      memset(buf, m+1, len+1); // changed
>>              if(len && !m) return R_NilValue;
>>      }
>>      /* String may contain nuls which we now (R >= 2.8.0) assume to be
>>      padding and ignore silently */
>>      ans = mkCharLenCE(buf, len, CE_NATIVE); // changed (one could also use 
>> no. read bytes as size)
>>      vmaxset(vmax);
>>      return ans;
>> }
>> --
>> 
>> The other changes are also not that hard - I'd do them if people think such 
>> changes should be included....
>> 
>> 
>> Thanks for listening,
>> 
>> Michael Lachmann
>> 
>> ______________________________________________
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> 
>> 
> 
> 
> 

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to