On Fri, 2011-10-28 at 17:42 -0700, Muhammad Waliji wrote: > On Fri, Oct 28, 2011 at 5:32 PM, Matthew Dowle > <[email protected]> wrote: > > On Fri, 2011-10-28 at 09:52 -0700, Muhammad Waliji wrote: > > >From the user's perspective, DT2 <- DT should either be a > new copy or > > a new reference. Anything in between is confusing. > > > Agreed. With picky caveat: even in base it's not at this point > the copy > is taken. It's later: copy-on-write. It's setkey and := that > don't copy > on write, not the (earlier) <-. > > > Hmm, I would prefer for these to have the same behavior.
Not sure I follow, please expand. > > > > How about this - add a new argument to data.table(), say > max.cols. > > max.cols defaults to a couple orders of magnitude above the > initial > > number of columns. data.table allocates enough memory for > max.cols > > column pointers. If you try to add more than max.cols > columns, it is > > either an error, or it creates a copy and produces a > warning. > > > Very nice idea. To over allocate by default so that := can add > columns > fully by reference most of the time seems good to me since > there's a > very low cost to over allocating the vector of column > pointers. Create > the (shallow copy) and issue a warning, I'm thinking, not > error. The > "max.cols" names seems a bit absolute, could it be > "alloc.cols"? We > could have alloc(DT,2,ncol) or rowalloc(DT,n) and > colalloc(DT,n), or > realloc(...) so users can over alloc themselves before a loop > that adds > columns or inserts rows. tables() could also report truenrow, > and > truencol as well as nrow and ncol. What should alloc.cols be, > by > default? How about: max(100,2*ncol) > > > Fine with me. > > What about as.data.table.data.frame()? Should that > over-allocate, too, > or for speed just change the class attribute as it does now. > > > Yeah, I think any method of creating a data table should > over-allocate. If people want the speed gains, they can set > explicitly set alloc.cols. > > > > Maybe checking NAMED would work, in addition. If NAMED was 0, > no need to > warn. Only when NAMED was 1 (or 2) - (not too hot on NAMED) - > would the > warning be necessary. > > > > > > On Fri, Oct 28, 2011 at 1:10 AM, Matthew Dowle > > <[email protected]> wrote: > > Interesting one. Adding columns is a bit different > to deleting > > and > > modifying columns. Here's how it works. Could make > changes, > > could > > document it, or both, what do people think? > > > > Just like data.frame there is a list vector holding > pointers > > to the > > column vectors. A delete column op is done with a > memmove to > > budge up > > the column pointers above the column by one place. > That leaves > > a gap at > > the end. The length attribute of that vector > (ncol(DT)) is > > then > > decremented and the spare 4 bytes (or 8 on 64bit) > are left > > unused at the > > end. > > > > An add column can't be fully by reference because > the list > > vector is > > full. A new list vector has to be allocated, one > slot larger, > > the old > > pointers memcpy'd over, and the last spot assigned > the pointer > > to the > > new column vector. This copying is negligible > because it's a > > small list > > of pointers fitting well within one page. [Unless, > there are > > many 1000's > > of columns, which is why it's done as efficiently as > possible > > using > > memcpy]. > > > > Aside : There is little known (I guess) distinction > between > > length and > > truelength in R internals. Base R doesn't use it, > but we could > > in > > data.table. A delete column sets length but leaves > truelength > > one > > larger. When the next add column comes along, it > could just do > > the budge > > up and insert the column. That may not be so > advantageous for > > (a small > > number) of columns, but the same logic could work > for > > insert() and > > delete()ing rows. Of course, this would mean > whether a > > visible copy or > > not is taken depends on what happened previously, > rather than > > the > > syntax. That's something we've disliked before, in > the same > > way we > > dislike drop=TRUE behaviour and so dropped drop. One > way to > > approach > > this might be to advise ":= add *may* not copy. Best > to assume > > it > > doesn't; use copy()". If you get in the habbit of > > "DT2=copy(DT)" then > > that'll take a deep copy at the time and you're > safe. > > > > To illustrate the partial (maybe shallow copy is > better word), > > consider > > the following : > > > > > DT = data.table(1:2,3:4) > > > DT2=DT > > > DT2[,y:=10L] > > V1 V2 y > > [1,] 1 3 10 > > [2,] 2 4 10 > > > DT > > V1 V2 > > [1,] 1 3 > > [2,] 2 4 > > > DT2 > > V1 V2 y > > [1,] 1 3 10 > > [2,] 2 4 10 > > > DT2[1,V1:=99L] > > V1 V2 y > > [1,] 99 3 10 > > [2,] 2 4 10 > > > DT > > V1 V2 > > [1,] 99 3 > > [2,] 2 4 > > > > > > > Matthew > > > > > > On Thu, 2011-10-27 at 11:46 -0700, Muhammad Waliji > wrote: > > > I think this is a bug. DT.2 <- DT.1 doesn't seem > to make a > > copy in > > > all cases. > > > > > > > > > > DT.1 <- data.table(x=1, y=1) > > > > DT.2 <- DT.1 > > > > > > > > # Both DT.1 and DT.2 are changed. > > > > DT.2[, y := NULL] > > > x > > > [1,] 1 > > > > DT.1 > > > x > > > [1,] 1 > > > > DT.2 > > > x > > > [1,] 1 > > > > > > > > # Only DT.2 is changed > > > > DT.2[, y := x] > > > x y > > > [1,] 1 1 > > > > DT.1 > > > x > > > [1,] 1 > > > > DT.2 > > > x y > > > [1,] 1 1 > > > > > > > > > > > _______________________________________________ > > > datatable-help mailing list > > > [email protected] > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
