Re: [datatable-help] Unexpected behavior in setnames()

Arunkumar Srinivasan Sat, 02 Nov 2013 08:42:36 -0700

> Now the issue you ran into is that you didn't realize that you were using 
> non-standard naming (or even wanted to, but we can't guess what you want :)). 
>


I've to disagree here. This is one of the things `data.table` is *extremely* 
good at. The error/warning messages are precise and under most circumstances 
provides the solution (as to what you ought to do) by spotting the mistake 
exactly.

> Once you understand that there is nothing wrong with duplicate names, it 
> should be clear that the appropriate warning spot is when you use them 
> potentially incorrectly, and not when you set them.


This, I believe is also not entirely true, at least in this scenario. For 
example, an error happens when assigning a duplicate column using `:=`, for 
example.


DT <- data.table(x=1:5, y=6:10)
> DT[, c("y", "y") := 1L]
Error in `[.data.table`(DT, , `:=`(c("y", "y"), 1L)) :  
  Can't assign to the same column twice in the same query (duplicates detected).


So, it's only natural to expect a warning/error in other cases as well. In 
general, prevention is better - it's nicer to catch it earlier, spit a 
warning/error rather than letting it on to only catch later.

Overall, I agree keeping duplicate names may help some users. But then, the 
potential side-effects should be marked with warnings/errors distinctly, in all 
cases (and preferably documented). Ex: grouping/aggregating is once such 
scenario (Ricardo's bug report) where we can not possibly know which column to 
use..  

Arun


On Saturday, November 2, 2013 at 4:30 PM, Eduard Antonyan wrote:

> Thanks Alexandre. I added (a non-committal) FR about this - 
> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5037&group_id=240&atid=978,
>  which will likely go in the direction this thread goes.
> To address your points:  
>     1. If user decides to have column with duplicate names, yes, their job 
> will become harder, but that's a user decision and everyone else who doesn't 
> use duplicate names does not lose flexibility and doesn't need to use column 
> numbers or whatnot.  
>     2. I agree that this should be documented better and appropriate warnings 
> should be added.  
> One of the cool things about data.table that's very different from data.frame 
> is that you can have arbitrary column names. Whether they include spaces, 
> crazy symbols or are duplicate - it'll all be valid. This is very useful for 
> reading and writing/presenting arbitrary data.
> This does mean though that if (and *only* if) you choose to use non standard 
> names you'll need to do more work.
> Now the issue you ran into is that you didn't realize that you were using 
> non-standard naming (or even wanted to, but we can't guess what you want :)). 
> And a warning in the right place can help you out and also let non-standard 
> users proceed.  
> Once you understand that there is nothing wrong with duplicate names, it 
> should be clear that the appropriate warning spot is when you use them 
> potentially incorrectly, and not when you set them.
> For reference there are a *lot* of different ways to get duplicate names, to 
> name a few besides setnames and creating one straight up - cbinding similarly 
> named data.tables, merging, having default named columns and grouping (e.g. 
> dt[, sum(smth), by = V1]), freading, etc.
> My 2 cents here.
>  
> There are several reasons why I don’t think, IMHO, allowing multiple columns 
> with the same name is a good idea:  
>   
> - It will force the code to use column numbers to access all the data in a 
> predictable fashion (since depending on your code you might now know which of 
> the two columns with the same name will be the first), so we’ll lose all the 
> delicious syntactic sugar painstakingly added to data.table.
>  
> - For people learning data.table and having data.frame or even the concept of 
> a relational table as a reference, this is a definite WTF and will cause 
> confusion and complicate troubleshooting. I speak from experience on this 
> matter. :)  
>  
> Even though there might be some situations where this might be a plus, I 
> imagine they are few and far between and could be worked around. I could be 
> wrong, it’s been know to happen :) - but I have never seen and can’t even 
> imagine a situation where multiple columns with the same name would be 
> essential. So in the balance I consider keeping this behavior as a bad 
> trade-off for most users.  
>  
> Having said that, this is a design decision and it's up to the data.table 
> demigods to decide. :)
>  
> BTW, is there any part of the data.table documentation that covers this? If 
> you choose to maintain this property, I strongly suggest it be documented 
> somewhere that most beginners would read.
>  
> In my personal example, I ran into this problem after a rather long 
> troubleshooting of a very esoteric problem that was happening in  my code. I 
> was renaming a column to a name that already existed, and this broke things 
> in a completely different part of my code. If ‘setnames()’ had at least 
> warned me that a duplicate column name was created, I would have been able to 
> detect the source cause much faster.  
>  
> --  
> Alexandre Sieira
> CISA, CISSP, ISO 27001 Lead Auditor
>  
> "The truth is rarely pure and never simple."
> Oscar Wilde, The Importance of Being Earnest, 1895, Act I  
>  
> On 1 de novembro de 2013 at 21:10:45, Arunkumar Srinivasan 
> ([email protected] (mailto://[email protected])) wrote:
>  
> > Hm, I've not encountered that use myself, can't comment there. Probably 
> > then it should be allowed everywhere except where deciding which column 
> > could be an issue? Ex: subsetting/aggregating/grouping/by-without-by etc.. 
> > should result in error (if one has the time, one could do this by checking 
> > if the duplicate column is in use actually or not and then issue an 
> > error/warning).  
> >  
> > At the moment, I'm not convinced that it's worth that much trouble to help 
> > data presentation.  
> >  
> > Arun  
> >  
> >  
> > On Saturday, November 2, 2013 at 12:05 AM, Eduard Antonyan wrote:
> >  
> > > Because it's very useful for e.g. data presentation purposes.
> > >  
> > >  
> > > On Fri, Nov 1, 2013 at 6:02 PM, Arunkumar Srinivasan 
> > > <[email protected] (mailto:[email protected])> wrote:
> > > > Yes, it chooses the first. But we won't be able to perform any 
> > > > operation as intended. So why allow duplicate names (ex: in `setnames` 
> > > > as Alexandre asks)?  
> > > >  
> > > > Arun  
> > > >  
> > > >  
> > > > On Friday, November 1, 2013 at 11:57 PM, Eduard Antonyan wrote:
> > > >  
> > > > > I think currently it chooses the first "x", but it's definitely a 
> > > > > good idea to add a warning there.
> > > > >  
> > > > >  
> > > > > On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan 
> > > > > <[email protected] (mailto:[email protected])> wrote:
> > > > > > Ricardo added a bug report here on this topic: 
> > > > > > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975
> > > > > >   
> > > > > > But I don't think having duplicate names is an easy-to-implement 
> > > > > > concept. For ex:
> > > > > >  
> > > > > > dt <- data.table(x=1:3, x=4:6, y=c(1,1,2))  
> > > > > > dt[, print(.SD), by=y]
> > > > > >    x
> > > > > > 1: 1
> > > > > > 2: 2
> > > > > >    x
> > > > > > 1: 3
> > > > > >  
> > > > > >  
> > > > > > .SD loses the second "x". Also, some other questions become 
> > > > > > difficult to handle. Ex:   
> > > > > >  
> > > > > > dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1))  
> > > > > > dt[, list(x=x/x[1], y=y), by=x]
> > > > > >  
> > > > > >  
> > > > > > Which "x" should be choose for which operation?  
> > > > > >  
> > > > > > Arun  
> > > > > >  
> > > > > >  
> > > > > > On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote:
> > > > > >  
> > > > > > > Having duplicate names is allowed and not that unusual in 
> > > > > > > data.table framework, so there is no need to signal anything 
> > > > > > > here.  
> > > > > > >  
> > > > > > > A different question is whether there should be a warning here:  
> > > > > > >  
> > > > > > >   dt = data.table(a = 1, a = 2)  
> > > > > > >   dt[, a]
> > > > > > >  
> > > > > > > and I think that'd be a pretty good FR to have.  
> > > > > > >  
> > > > > > >  
> > > > > > > On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira 
> > > > > > > <[email protected] (mailto:[email protected])> 
> > > > > > > wrote:
> > > > > > > > I found this behavior during a debugging session:   
> > > > > > > >  
> > > > > > > > > d = data.table(a=1, b=2, c=3)  
> > > > > > > > > setnames(d, "a", "b")
> > > > > > > > > d
> > > > > > > >    b b c
> > > > > > > > 1: 1 2 3
> > > > > > > >  
> > > > > > > > Shouldn’t setnames() check if the new column names already 
> > > > > > > > exist before renaming, and signal an error or at least a 
> > > > > > > > warning if they do?  
> > > > > > > > --   
> > > > > > > > Alexandre Sieira
> > > > > > > > CISA, CISSP, ISO 27001 Lead Auditor
> > > > > > > >  
> > > > > > > > "The truth is rarely pure and never simple."
> > > > > > > > Oscar Wilde, The Importance of Being Earnest, 1895, Act I  
> > > > > > > > _______________________________________________
> > > > > > > > datatable-help mailing list
> > > > > > > > [email protected] 
> > > > > > > > (mailto:[email protected])
> > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > > > > >  
> > > > > > > _______________________________________________  
> > > > > > > datatable-help mailing list
> > > > > > > [email protected] 
> > > > > > > (mailto:[email protected])
> > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > >  
> >

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] Unexpected behavior in setnames()

Reply via email to