Hello
I am recently began to work with R, so I am not so experienced.
But anyway I cannot find a clear way to process my dataframe which is a bigger
one.
It shows similar to this
> name=c("A","B","C","B","C","C","C","B","C")
> nicknames=c("A1","B1","C1","B2","C2","C3","C4","B3","C5")
> value=c(4,5,9,2,7,6,3,6,7)
> table=data.frame(cbind(name,nickname,value))
> table=data.frame(cbind(name,nicknames,value))
> table
name nicknames value
1 A A1 4
2 B B1 5
3 C C1 9
4 B B2 2
5 C C2 7
6 C C3 6
7 C C4 3
8 B B3 6
9 C C5 7
So I have to rearrange it in the next way:
- the first column should contain just unduplicated data, I did this, it is OK
and it will look like
1 A
2 B
3 C
- the second column should contain different 'nicknames' which correspond to
the single A, B or C
name nickname value
1 A A1
2 B B1,B2,B3
3 C C1,C2,C3,C4,C5
-the third one should contain the mean value of the numbers which correspond to
the same A, B or C
1 A A1 mean(4)
2 B B1,B2,B3 mean(5,2,6)
3 C C1,C2,C3,C4,C5 mean(9,7,6,3,7)
I did this using a loop 'for'.
to be clear I created tree dataframes which correspond to each of columns, and
finally will combine them
> ulist=which(!duplicated(table$name)) # I extract the list of positions in
> which I don't have duplications
> name1=data.frame(table$name[ulist]) # I extract the list of unique names
> nicknames1=data.frame(row.names(1:length(ulist))) # I create a dataframe of
> dimension equal to unique list length
> value1=data.frame(row.names(1:length(ulist))) # I create a dataframe of
> dimension equal to unique list length
> for(i in 1:length(ulist)) {
position=which(as.character(name1[i,1])==table$name)
nicknames1[i,1]=toString(table$nicknames[position])
value1[i,1]=mean(as.numeric(table$value[position]))
}
> fin=cbind(name1,nicknames1,value1)
> colnames(fin)=c("NAME","NICKNAME","VALUE")
> fin
NAME NICKNAME VALUE
1 A A1 3.000000
2 B B1, B2, B3 3.333333
3 C C1, C2, C3, C4, C5 5.200000
it works successfully. But in general I work with dataframes of high dimensions
(tens thousands or more rows).
So my loop works too slow (i.e., a dataframe of 20000 rows and 3 columns is
processed in about 10 minutes).
I intend to integrate it into a function, so it is obvious that time will be
even longer.
If someone can advise me any possibility to modify which I have done or to the
way I can do it, please give me a message.
King regards to all guys who develop and maintain R sources for such dummies as
me
Alex Levitchi
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.