On Fri, 2011-10-28 at 09:52 -0700, Muhammad Waliji wrote: > >From the user's perspective, DT2 <- DT should either be a new copy or > a new reference. Anything in between is confusing.
Agreed. With picky caveat: even in base it's not at this point the copy is taken. It's later: copy-on-write. It's setkey and := that don't copy on write, not the (earlier) <-. > How about this - add a new argument to data.table(), say max.cols. > max.cols defaults to a couple orders of magnitude above the initial > number of columns. data.table allocates enough memory for max.cols > column pointers. If you try to add more than max.cols columns, it is > either an error, or it creates a copy and produces a warning. Very nice idea. To over allocate by default so that := can add columns fully by reference most of the time seems good to me since there's a very low cost to over allocating the vector of column pointers. Create the (shallow copy) and issue a warning, I'm thinking, not error. The "max.cols" names seems a bit absolute, could it be "alloc.cols"? We could have alloc(DT,2,ncol) or rowalloc(DT,n) and colalloc(DT,n), or realloc(...) so users can over alloc themselves before a loop that adds columns or inserts rows. tables() could also report truenrow, and truencol as well as nrow and ncol. What should alloc.cols be, by default? How about: max(100,2*ncol) What about as.data.table.data.frame()? Should that over-allocate, too, or for speed just change the class attribute as it does now. Maybe checking NAMED would work, in addition. If NAMED was 0, no need to warn. Only when NAMED was 1 (or 2) - (not too hot on NAMED) - would the warning be necessary. > > On Fri, Oct 28, 2011 at 1:10 AM, Matthew Dowle > <[email protected]> wrote: > Interesting one. Adding columns is a bit different to deleting > and > modifying columns. Here's how it works. Could make changes, > could > document it, or both, what do people think? > > Just like data.frame there is a list vector holding pointers > to the > column vectors. A delete column op is done with a memmove to > budge up > the column pointers above the column by one place. That leaves > a gap at > the end. The length attribute of that vector (ncol(DT)) is > then > decremented and the spare 4 bytes (or 8 on 64bit) are left > unused at the > end. > > An add column can't be fully by reference because the list > vector is > full. A new list vector has to be allocated, one slot larger, > the old > pointers memcpy'd over, and the last spot assigned the pointer > to the > new column vector. This copying is negligible because it's a > small list > of pointers fitting well within one page. [Unless, there are > many 1000's > of columns, which is why it's done as efficiently as possible > using > memcpy]. > > Aside : There is little known (I guess) distinction between > length and > truelength in R internals. Base R doesn't use it, but we could > in > data.table. A delete column sets length but leaves truelength > one > larger. When the next add column comes along, it could just do > the budge > up and insert the column. That may not be so advantageous for > (a small > number) of columns, but the same logic could work for > insert() and > delete()ing rows. Of course, this would mean whether a > visible copy or > not is taken depends on what happened previously, rather than > the > syntax. That's something we've disliked before, in the same > way we > dislike drop=TRUE behaviour and so dropped drop. One way to > approach > this might be to advise ":= add *may* not copy. Best to assume > it > doesn't; use copy()". If you get in the habbit of > "DT2=copy(DT)" then > that'll take a deep copy at the time and you're safe. > > To illustrate the partial (maybe shallow copy is better word), > consider > the following : > > > DT = data.table(1:2,3:4) > > DT2=DT > > DT2[,y:=10L] > V1 V2 y > [1,] 1 3 10 > [2,] 2 4 10 > > DT > V1 V2 > [1,] 1 3 > [2,] 2 4 > > DT2 > V1 V2 y > [1,] 1 3 10 > [2,] 2 4 10 > > DT2[1,V1:=99L] > V1 V2 y > [1,] 99 3 10 > [2,] 2 4 10 > > DT > V1 V2 > [1,] 99 3 > [2,] 2 4 > > > > Matthew > > > On Thu, 2011-10-27 at 11:46 -0700, Muhammad Waliji wrote: > > I think this is a bug. DT.2 <- DT.1 doesn't seem to make a > copy in > > all cases. > > > > > > > DT.1 <- data.table(x=1, y=1) > > > DT.2 <- DT.1 > > > > > > # Both DT.1 and DT.2 are changed. > > > DT.2[, y := NULL] > > x > > [1,] 1 > > > DT.1 > > x > > [1,] 1 > > > DT.2 > > x > > [1,] 1 > > > > > > # Only DT.2 is changed > > > DT.2[, y := x] > > x y > > [1,] 1 1 > > > DT.1 > > x > > [1,] 1 > > > DT.2 > > x y > > [1,] 1 1 > > > > > > > _______________________________________________ > > datatable-help mailing list > > [email protected] > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
