On Oct 28, 2010, at 9:48 AM, Andrew Piskorski wrote:

> On Thu, Oct 28, 2010 at 12:15:56AM -0400, Simon Urbanek wrote:
> 
>>> Reason I ask, is I've written some R code which allocates two long
>>> lists, and then calls a C function with .Call.  My C code writes to
>>> those two pre-allocated lists,
> 
>> That's bad! All arguments are essentially read-only so you should
>> never write into them! 
> 
> I don't see how.  (So, what am I missing?)  The R docs themselves
> state that the main point of using .Call rather than .C is that .Call
> does not do any extra copying and gives one direct access to the R
> objects.  (This is indeed very useful, e.g. to reorder a large matrix
> in seconds rather than hours.)
> 

Exactly - direct access without copying which means that you are responsible 
for not modifying anything you don't own. Again, remember that R has copy by 
value semantics so functions can never modify their arguments (at least from 
user's point of view).


> I could allocate the two lists in my C code, but so far it was more 
> convenient to so in R.  

You don't just allocate them, you also assign them to an environment which is 
where the trouble starts. Let's look at a very simple example:

/* do NOT do that kids!! */
SEXP foo(SEXP x) {
  REAL(x)[0] = 1;
  return x;
}

The expected behavior if R was not performing any tricks behind the scenes 
should be in theory:

> a = 0
> .Call("foo", a)
[1] 1
> a
[1] 0

The reason is that in the S language all arguments are passed by value so 
.Call("foo", a) really means .Call("foo", 0) so you only change the "0" but not 
a. However, R attempts to prevent copying so both the environment holding "a" 
*and* the argument passed to .Call will share memory.
Now, why is it a bad idea to modify arguments? This is why (this is actually 
run in R):

> a = 0
> b = a
> .Call("foo", a)
[1] 1
> a
[1] 1
> b
[1] 1

Because R assumes that you don't mess with the arguments, it also optimizes b 
to point to the same object as a which you then modify. Therefor the moment you 
start modifying argument all bets are off, because you cannot know which 
objects have been optimized to share the same memory so you don't know what 
else you'll modify. (More on how you can detect it further down).

There are also rational problems with that:
> .Call("foo", 0)
[1] 1
How can you change a "0" constant to 1 ?!?


> What possible difference in behavior can there be between the two approaches?
> 

The only way to allocate vectors is with things like numeric(10) but you may 
*not* assign it anywhere - that's why .C uses construct like .C(numeric(10), 
...) to create result space for DUP=FALSE but the only reason to do so is 
because it has no choice. You could call .Call(numeric(10), ...) but that sort 
of defeats the purpose and is somewhat dangerous from user's point of view 
since your C code would assume that you don't pass anything else (like a 
variable or a constant) but a "malicious" user could pass anything...


>> R has pass-by-value(!) semantics, so semantically you code has
>> nothing to do with the result.1 and result.2 variables since only
>> their *values* are guaranteed to be passed (possibly a copy).
> 
> Clearly C code called from .Call must be allowed to construct R
> objects, as that's how much of R itself is implemented, and further
> down, it's what you recommend I should do instead.
> 
> But why does it follow that C code must never modify an object
> initially allocated by R code?  Are you saying there is some special
> magic difference in the state of an object allocated by R's C code
> vs. one allocated by R code?  If so, what is it?
> 

It's magic of all objects - regardless where they are allocated - and it is 
essentially the NAMED bits that decide whether an object is to be copied or 
not. The object you passed from R was not "yours" in that it was shared with 
the environment you assigned it to (using result.1 <- ..) and your function. If 
you allocate it in C you know that it's not owned by anyone else so you can 
safely modify it.

Now, we can go more into the internals and you can actually use NAMED to detect 
the cases. I'm still not recommending it for the use you mentioned (mostly 
because it may change without notice), but it should give you the full picture. 
Let's modify the example above by adding Rprintf("NAMED=%d\n", NAMED(x));

Here are the different cases:

> .Call("foo", numeric(1))
NAMED=0
[1] 1
# numeric(1) is a direct allocation so it has no reference

> a = numeric(1)
> .Call("foo", a)
NAMED=1
[1] 1
# numeric(1) was direct allocation then assigned to a - so it has one reference

> b = a
> .Call("foo", a)
NAMED=2
[1] 1
# the numeric(1) value in both a and b has now two references
# note that it is not a real reference count - it has only the three states 
above, so removing b doesn't help

> .Call("foo", 1)
NAMED=2
[1] 1
# constants are always flagged to duplicate because they all could share memory 
(the real story is a bit different but that's one explanation ;))

So if you wanted to optimize you could treat the above cases differently and, 
yes, using a=numeric(1); .Call("foo",a) *should* have NAMED=1 and thus be safe 
to modify - but I would worry about any code that doesn't check that since it 
can have unwanted effects without anyone noticing.


> What is the potential problem here, that the garbage collector will suddenly 
> run while my C code is in the middle of writing to an R list? Yes, if the gc 
> is going to move the object elsewhere, that would be very bad.

GC doesn't move anything - it only releases unreferenced objects.


>  But it looks to me like that cannot happen, because lots of the R 
> implementation itself would fail badly if it did.
> 
> E.g.:  The PROTECT call is used to increment reference counts,

There are no reference counts in R, PROTECT just adds the object to the 
protection stack (which is that same as adding it to any list or vector that is 
protected).


> but I see no guarantees that it is atomic with the operations that allocate 
> objects.  I see no mutexes or other barriers in C code to prevent the gc from 
> running, thus implying that it *can't* run until the C function completes. 
> And R is single threaded, of course.  But what about signal handlers, could 
> they ever invoke R's gc?

C code cannot be interrupted exactly for this reason. However, gc can occur in 
any call to R API which is why PROTECT is needed in those cases.


> Also, I was initially surprised not to find any matrix C APIs, but grepping 
> for examples (sorry, I don't remember exactly which functions) showed me that 
> the apparently accepted way to do matrix operations from C is to simply 
> assume R's column-first dense matrix order, and access the 2D matrix as a 
> flat 1D vector.  (Which is easy.)
> 

Yes, that's what most sane programs handling matrices do ;).


>> The fact that internally R attempts to avoid copying for performance
>> reasons is the only reason why your code may have appeared to work,
>> but it's invalid!
> 
> I will probably change my code to allocate a new list from the C code
> and return that, as you recommend.  My main reason for doing the
> allocation in R was just that it was simpler, especially given the
> very limited documentation of R's C API.
> 
> But, I didn't see anything in the "Writing R Extensions" doc saying
> that what my code is doing is "invalid", and more importantly, I don't
> see why it would or should be invalid...
> 
> I'd still like to better understand why you think doing the initial
> allocation of an object in R rather than C code is such a problem.  So
> far, I don't see any way that the R interpreter could ever tell the
> difference.
> 
> Wait, or is the only objection here that I'm using C in a way that
> makes pass-by-reference semantics visible to my R code?  Which will
> work completely correctly, but is not the The Proper R Way?
> 

See above - it breaks the assumptions that R makes so you can change things you 
don't intend to. Also the internal optimizations may change in the future so I 
would not count on it.


> I don't actually need pass-by-reference behavior here at all, but I
> can imagine cases where I might want it, so I'd like to understand
> your objections better.  Is using C to implement pass-by-reference
> actually Broken, or merely Ugly?  From my reasons above, I think it
> will always work correctly and thus is not Broken.  But of course
> given R's devotion to pass-by-value, it could be considered
> unacceptably Ugly.
> 


I hope it sheds some light on it.

Cheers,
Simon

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to