Re: [datatable-help] Flagging duplicate (non-unique) values based on specifications

Matt Dowle Fri, 04 Oct 2013 09:30:31 -0700

It's more efficient to ask questions like this on Stack Overflow please :

http://stackoverflow.com/questions/tagged/data.table<http://stackoverflow.com/questions/tagged/data.table?sort=active&pagesize=50>You can edit the question there, and people can add or remove quickcomments.

In v1.8.10 on CRAN you can pass 'by' to unique and duplicated (thanks toSteve). This would simplify the question and make it easier to answer.


Matt

On 04/10/13 16:57, limno.sam wrote:

Hi,

I'm working with about 60 data sets which need to have duplicate
(non-unique) values removed.

The data sets have 22 unique column names (the same for each data set):
[1] "LakeID"                    "LakeName"
"SourceVariableName"
  [4] "SourceVariableDescription" "SourceFlags"
"LagosVariableID"
  [7] "LagosVariableName"         "Value"                     "Units"
[10] "CensorCode"                "DetectionLimit"            "Date"
[13] "LabMethodName"             "LabMethodInfo"             "SampleType"
[16] "SamplePosition"            "SampleDepth"               "MethodInfo"
[19] "BasinType"                 "Subprogram"                "Comments"
[22] "Dup"

I am interested in flagging observations that are duplicate (replicate)
values. I am defining observations that are NOT duplicate as unique for
"LakeID" "LagosVariableID" "Value" "Date" "SamplePosition" and "SampleDepth
for each row.

Note that the "Dup" column is where I want to flag whether or not an
observation is duplicate (NA= not duplicate, 1= duplicate)

I have tried the follow code, where Final.Export= the data set with the 22
columns listed above:

library(data.table)
#flag the unique (non-duplicate) values as NA
data1=data.table(Final.Export,key=c('LakeID','Date','LagosVariableID','SampleDepth','SamplePosition','Value')
data1=data1[unique(data1[,key(data1),with=FALSE]),mult='first']
data1$Dup=NA
#flag the duplicate values as "1"
data2=data.table(Final.Export,key=c('LakeID','Date','LagosVariableID','SampleDepth','SamplePosition','Value')
data2=data2[duplicated(data2[,key(data2),with=FALSE]),mult='first']
data2$Dup=1
#check to see if adds to total
(length(data1$Value))+((length(data2$Value)))
length(data2$Value)
length(Final.Export$Value) #adds up to total
#bind the tables
Final.Export1=rbind(data1,data2,use.names=TRUE)

The code works for flagging the duplicate observations, however, the values
for several of the variables in the original data frame "Final.Export" are
converted to NA in "Final.Export1."

Any ideas how to prevent that from happening?



--
View this message in context: 
http://r.789695.n4.nabble.com/Flagging-duplicate-non-unique-values-based-on-specifications-tp4677610.html
Sent from the datatable-help mailing list archive at Nabble.com.
_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] Flagging duplicate (non-unique) values based on specifications

Reply via email to