My 2 cents here.

There are several reasons why I don’t think, IMHO, allowing multiple columns 
with the same name is a good idea:
 
- It will force the code to use column numbers to access all the data in a 
predictable fashion (since depending on your code you might now know which of 
the two columns with the same name will be the first), so we’ll lose all the 
delicious syntactic sugar painstakingly added to data.table.

- For people learning data.table and having data.frame or even the concept of a 
relational table as a reference, this is a definite WTF and will cause 
confusion and complicate troubleshooting. I speak from experience on this 
matter. :)

Even though there might be some situations where this might be a plus, I 
imagine they are few and far between and could be worked around. I could be 
wrong, it’s been know to happen :) - but I have never seen and can’t even 
imagine a situation where multiple columns with the same name would be 
essential. So in the balance I consider keeping this behavior as a bad 
trade-off for most users.

Having said that, this is a design decision and it's up to the data.table 
demigods to decide. :)

BTW, is there any part of the data.table documentation that covers this? If you 
choose to maintain this property, I strongly suggest it be documented somewhere 
that most beginners would read.

In my personal example, I ran into this problem after a rather long 
troubleshooting of a very esoteric problem that was happening in  my code. I 
was renaming a column to a name that already existed, and this broke things in 
a completely different part of my code. If ‘setnames()’ had at least warned me 
that a duplicate column name was created, I would have been able to detect the 
source cause much faster.

-- 
Alexandre Sieira
CISA, CISSP, ISO 27001 Lead Auditor

"The truth is rarely pure and never simple."
Oscar Wilde, The Importance of Being Earnest, 1895, Act I

On 1 de novembro de 2013 at 21:10:45, Arunkumar Srinivasan 
([email protected]) wrote:

Hm, I've not encountered that use myself, can't comment there. Probably then it 
should be allowed everywhere except where deciding which column could be an 
issue? Ex: subsetting/aggregating/grouping/by-without-by etc.. should result in 
error (if one has the time, one could do this by checking if the duplicate 
column is in use actually or not and then issue an error/warning). 

At the moment, I'm not convinced that it's worth that much trouble to help data 
presentation.

Arun

On Saturday, November 2, 2013 at 12:05 AM, Eduard Antonyan wrote:

Because it's very useful for e.g. data presentation purposes.


On Fri, Nov 1, 2013 at 6:02 PM, Arunkumar Srinivasan <[email protected]> 
wrote:
Yes, it chooses the first. But we won't be able to perform any operation as 
intended. So why allow duplicate names (ex: in `setnames` as Alexandre asks)?

Arun

On Friday, November 1, 2013 at 11:57 PM, Eduard Antonyan wrote:

I think currently it chooses the first "x", but it's definitely a good idea to 
add a warning there.


On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan <[email protected]> 
wrote:
Ricardo added a bug report here on this topic: 
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975
But I don't think having duplicate names is an easy-to-implement concept. For 
ex:

dt <- data.table(x=1:3, x=4:6, y=c(1,1,2))
dt[, print(.SD), by=y]
   x
1: 1
2: 2
   x
1: 3

.SD loses the second "x". Also, some other questions become difficult to 
handle. Ex: 

dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1))
dt[, list(x=x/x[1], y=y), by=x]

Which "x" should be choose for which operation?

Arun

On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote:

Having duplicate names is allowed and not that unusual in data.table framework, 
so there is no need to signal anything here.

A different question is whether there should be a warning here:

  dt = data.table(a = 1, a = 2)
  dt[, a]

and I think that'd be a pretty good FR to have.


On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira <[email protected]> 
wrote:
I found this behavior during a debugging session: 

> d = data.table(a=1, b=2, c=3)
> setnames(d, "a", "b")
> d
   b b c
1: 1 2 3

Shouldn’t setnames() check if the new column names already exist before 
renaming, and signal an error or at least a warning if they do?

-- 
Alexandre Sieira
CISA, CISSP, ISO 27001 Lead Auditor

"The truth is rarely pure and never simple."
Oscar Wilde, The Importance of Being Earnest, 1895, Act I

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help





_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Reply via email to