Wow ... did we just reach a consensus? :-) -steve
On Fri, Nov 8, 2013 at 12:08 PM, Eduard Antonyan <[email protected]> wrote: > Ditto - having dups, but spitting out an error on all ambiguous operations > seems like a robust strategy. > > > On Fri, Nov 8, 2013 at 2:02 PM, Steve Lianoglou <[email protected]> > wrote: >> >> Hi, >> >> I wanted to point out that I'm in Arun's camp on this one: >> >> On Fri, Nov 8, 2013 at 7:09 AM, Arunkumar Srinivasan >> <[email protected]> wrote: >> >> > In my opinion, the dup-names should be allowed *only* during creation of >> > data.table, and setting names (using `setnames`, `setattr` or the bad >> > form >> > `names(dt) <- `). Other than that, *ALL* operations should fail (end up >> > in >> > error), and that includes subsetting operation. The `setnames` gives the >> > option for the user to set the names back before writing to a file, >> > should >> > he choose to keep it at the end. >> > >> > I think it's much better this way (strict, but avoids confusion). For >> > example, in data.frames, doing DF$x (when x occurs twice) implicitly >> > prints >> > only the first (no warning/error). Also, split(DF$x, DF$x) uses the >> > first >> > column and so does split(DF, DF$x). >> >> As an opinionated footnote: I can acquiesce that since data.frames >> allow duplicated column names, I *guess* data.table should *allow* >> them, however as is clear (to me) from this long chain of >> "possibilities" that one can do, I strongly feel that computing over a >> data.table w/ duplicated columns is a fundamentally broken idea as it >> is ambiguous as to what the right behavior should be ... forget about >> even the (surely fun) book-keeping code required to make it happen. >> >> You want to import a table with duplicate names? Fine (we should warn >> on import if it was `fread` or `as.data.table`d). >> >> You want to set some names to duplicates? Fine -- warn there too. >> >> Want to do any computation inside the data.table via `j` or as a >> column in `by`? Throw an error and punt the problem to the user to >> figure out how they would like to disambiguate the first column named >> "a" from the 10th one -- I don't think we need another FAQ explaining >> what "the right" way that this should be done is, and why we picked >> it. >> >> Or if you really want to compute over a data.table with duplicate >> names, you might be better served by having the table in "long" format >> -- perhaps that's why there are duplicate column names to begin with >> (I'm guessing -- I still don't think I would ever want to have duped >> names on purpose) >> >> My two cents, >> >> -steve >> >> -- >> Steve Lianoglou >> Computational Biologist >> Bioinformatics and Computational Biology >> Genentech >> _______________________________________________ >> datatable-help mailing list >> [email protected] >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > _______________________________________________ > datatable-help mailing list > [email protected] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
