> Now the issue you ran into is that you didn't realize that you were using
> non-standard naming (or even wanted to, but we can't guess what you want :)).
>
I've to disagree here. This is one of the things `data.table` is *extremely*
good at. The error/warning messages are precise and under most circumstances
provides the solution (as to what you ought to do) by spotting the mistake
exactly.
> Once you understand that there is nothing wrong with duplicate names, it
> should be clear that the appropriate warning spot is when you use them
> potentially incorrectly, and not when you set them.
This, I believe is also not entirely true, at least in this scenario. For
example, an error happens when assigning a duplicate column using `:=`, for
example.
DT <- data.table(x=1:5, y=6:10)
> DT[, c("y", "y") := 1L]
Error in `[.data.table`(DT, , `:=`(c("y", "y"), 1L)) :
Can't assign to the same column twice in the same query (duplicates detected).
So, it's only natural to expect a warning/error in other cases as well. In
general, prevention is better - it's nicer to catch it earlier, spit a
warning/error rather than letting it on to only catch later.
Overall, I agree keeping duplicate names may help some users. But then, the
potential side-effects should be marked with warnings/errors distinctly, in all
cases (and preferably documented). Ex: grouping/aggregating is once such
scenario (Ricardo's bug report) where we can not possibly know which column to
use..
Arun
On Saturday, November 2, 2013 at 4:30 PM, Eduard Antonyan wrote:
> Thanks Alexandre. I added (a non-committal) FR about this -
> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5037&group_id=240&atid=978,
> which will likely go in the direction this thread goes.
> To address your points:
> 1. If user decides to have column with duplicate names, yes, their job
> will become harder, but that's a user decision and everyone else who doesn't
> use duplicate names does not lose flexibility and doesn't need to use column
> numbers or whatnot.
> 2. I agree that this should be documented better and appropriate warnings
> should be added.
> One of the cool things about data.table that's very different from data.frame
> is that you can have arbitrary column names. Whether they include spaces,
> crazy symbols or are duplicate - it'll all be valid. This is very useful for
> reading and writing/presenting arbitrary data.
> This does mean though that if (and *only* if) you choose to use non standard
> names you'll need to do more work.
> Now the issue you ran into is that you didn't realize that you were using
> non-standard naming (or even wanted to, but we can't guess what you want :)).
> And a warning in the right place can help you out and also let non-standard
> users proceed.
> Once you understand that there is nothing wrong with duplicate names, it
> should be clear that the appropriate warning spot is when you use them
> potentially incorrectly, and not when you set them.
> For reference there are a *lot* of different ways to get duplicate names, to
> name a few besides setnames and creating one straight up - cbinding similarly
> named data.tables, merging, having default named columns and grouping (e.g.
> dt[, sum(smth), by = V1]), freading, etc.
> My 2 cents here.
>
> There are several reasons why I don’t think, IMHO, allowing multiple columns
> with the same name is a good idea:
>
> - It will force the code to use column numbers to access all the data in a
> predictable fashion (since depending on your code you might now know which of
> the two columns with the same name will be the first), so we’ll lose all the
> delicious syntactic sugar painstakingly added to data.table.
>
> - For people learning data.table and having data.frame or even the concept of
> a relational table as a reference, this is a definite WTF and will cause
> confusion and complicate troubleshooting. I speak from experience on this
> matter. :)
>
> Even though there might be some situations where this might be a plus, I
> imagine they are few and far between and could be worked around. I could be
> wrong, it’s been know to happen :) - but I have never seen and can’t even
> imagine a situation where multiple columns with the same name would be
> essential. So in the balance I consider keeping this behavior as a bad
> trade-off for most users.
>
> Having said that, this is a design decision and it's up to the data.table
> demigods to decide. :)
>
> BTW, is there any part of the data.table documentation that covers this? If
> you choose to maintain this property, I strongly suggest it be documented
> somewhere that most beginners would read.
>
> In my personal example, I ran into this problem after a rather long
> troubleshooting of a very esoteric problem that was happening in my code. I
> was renaming a column to a name that already existed, and this broke things
> in a completely different part of my code. If ‘setnames()’ had at least
> warned me that a duplicate column name was created, I would have been able to
> detect the source cause much faster.
>
> --
> Alexandre Sieira
> CISA, CISSP, ISO 27001 Lead Auditor
>
> "The truth is rarely pure and never simple."
> Oscar Wilde, The Importance of Being Earnest, 1895, Act I
>
> On 1 de novembro de 2013 at 21:10:45, Arunkumar Srinivasan
> ([email protected] (mailto://[email protected])) wrote:
>
> > Hm, I've not encountered that use myself, can't comment there. Probably
> > then it should be allowed everywhere except where deciding which column
> > could be an issue? Ex: subsetting/aggregating/grouping/by-without-by etc..
> > should result in error (if one has the time, one could do this by checking
> > if the duplicate column is in use actually or not and then issue an
> > error/warning).
> >
> > At the moment, I'm not convinced that it's worth that much trouble to help
> > data presentation.
> >
> > Arun
> >
> >
> > On Saturday, November 2, 2013 at 12:05 AM, Eduard Antonyan wrote:
> >
> > > Because it's very useful for e.g. data presentation purposes.
> > >
> > >
> > > On Fri, Nov 1, 2013 at 6:02 PM, Arunkumar Srinivasan
> > > <[email protected] (mailto:[email protected])> wrote:
> > > > Yes, it chooses the first. But we won't be able to perform any
> > > > operation as intended. So why allow duplicate names (ex: in `setnames`
> > > > as Alexandre asks)?
> > > >
> > > > Arun
> > > >
> > > >
> > > > On Friday, November 1, 2013 at 11:57 PM, Eduard Antonyan wrote:
> > > >
> > > > > I think currently it chooses the first "x", but it's definitely a
> > > > > good idea to add a warning there.
> > > > >
> > > > >
> > > > > On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan
> > > > > <[email protected] (mailto:[email protected])> wrote:
> > > > > > Ricardo added a bug report here on this topic:
> > > > > > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975
> > > > > >
> > > > > > But I don't think having duplicate names is an easy-to-implement
> > > > > > concept. For ex:
> > > > > >
> > > > > > dt <- data.table(x=1:3, x=4:6, y=c(1,1,2))
> > > > > > dt[, print(.SD), by=y]
> > > > > > x
> > > > > > 1: 1
> > > > > > 2: 2
> > > > > > x
> > > > > > 1: 3
> > > > > >
> > > > > >
> > > > > > .SD loses the second "x". Also, some other questions become
> > > > > > difficult to handle. Ex:
> > > > > >
> > > > > > dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1))
> > > > > > dt[, list(x=x/x[1], y=y), by=x]
> > > > > >
> > > > > >
> > > > > > Which "x" should be choose for which operation?
> > > > > >
> > > > > > Arun
> > > > > >
> > > > > >
> > > > > > On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote:
> > > > > >
> > > > > > > Having duplicate names is allowed and not that unusual in
> > > > > > > data.table framework, so there is no need to signal anything
> > > > > > > here.
> > > > > > >
> > > > > > > A different question is whether there should be a warning here:
> > > > > > >
> > > > > > > dt = data.table(a = 1, a = 2)
> > > > > > > dt[, a]
> > > > > > >
> > > > > > > and I think that'd be a pretty good FR to have.
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira
> > > > > > > <[email protected] (mailto:[email protected])>
> > > > > > > wrote:
> > > > > > > > I found this behavior during a debugging session:
> > > > > > > >
> > > > > > > > > d = data.table(a=1, b=2, c=3)
> > > > > > > > > setnames(d, "a", "b")
> > > > > > > > > d
> > > > > > > > b b c
> > > > > > > > 1: 1 2 3
> > > > > > > >
> > > > > > > > Shouldn’t setnames() check if the new column names already
> > > > > > > > exist before renaming, and signal an error or at least a
> > > > > > > > warning if they do?
> > > > > > > > --
> > > > > > > > Alexandre Sieira
> > > > > > > > CISA, CISSP, ISO 27001 Lead Auditor
> > > > > > > >
> > > > > > > > "The truth is rarely pure and never simple."
> > > > > > > > Oscar Wilde, The Importance of Being Earnest, 1895, Act I
> > > > > > > > _______________________________________________
> > > > > > > > datatable-help mailing list
> > > > > > > > [email protected]
> > > > > > > > (mailto:[email protected])
> > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > datatable-help mailing list
> > > > > > > [email protected]
> > > > > > > (mailto:[email protected])
> > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help