Ditto - having dups, but spitting out an error on all ambiguous operations seems like a robust strategy.
On Fri, Nov 8, 2013 at 2:02 PM, Steve Lianoglou <[email protected]>wrote: > Hi, > > I wanted to point out that I'm in Arun's camp on this one: > > On Fri, Nov 8, 2013 at 7:09 AM, Arunkumar Srinivasan > <[email protected]> wrote: > > > In my opinion, the dup-names should be allowed *only* during creation of > > data.table, and setting names (using `setnames`, `setattr` or the bad > form > > `names(dt) <- `). Other than that, *ALL* operations should fail (end up > in > > error), and that includes subsetting operation. The `setnames` gives the > > option for the user to set the names back before writing to a file, > should > > he choose to keep it at the end. > > > > I think it's much better this way (strict, but avoids confusion). For > > example, in data.frames, doing DF$x (when x occurs twice) implicitly > prints > > only the first (no warning/error). Also, split(DF$x, DF$x) uses the first > > column and so does split(DF, DF$x). > > As an opinionated footnote: I can acquiesce that since data.frames > allow duplicated column names, I *guess* data.table should *allow* > them, however as is clear (to me) from this long chain of > "possibilities" that one can do, I strongly feel that computing over a > data.table w/ duplicated columns is a fundamentally broken idea as it > is ambiguous as to what the right behavior should be ... forget about > even the (surely fun) book-keeping code required to make it happen. > > You want to import a table with duplicate names? Fine (we should warn > on import if it was `fread` or `as.data.table`d). > > You want to set some names to duplicates? Fine -- warn there too. > > Want to do any computation inside the data.table via `j` or as a > column in `by`? Throw an error and punt the problem to the user to > figure out how they would like to disambiguate the first column named > "a" from the 10th one -- I don't think we need another FAQ explaining > what "the right" way that this should be done is, and why we picked > it. > > Or if you really want to compute over a data.table with duplicate > names, you might be better served by having the table in "long" format > -- perhaps that's why there are duplicate column names to begin with > (I'm guessing -- I still don't think I would ever want to have duped > names on purpose) > > My two cents, > > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > _______________________________________________ > datatable-help mailing list > [email protected] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
