On Sat, 2005-10-15 at 23:54 +0800, ronggui wrote:
> It seems my last post not sent successfully ,so I post again.
>
> -------------
> the data file has such structure:
>
> 1992 6245 49 . . 20 1
> 0 0 8.739536 0 . . .
> . . . . . "alabama"
> . 0 .
> 1993 7677 58 . . 15 1
> 0 0 8.945984 1 . 0 .2064476
> -5 0 . 0 8.739536 "alabama"
> 9 0 0
> 1992 13327 57 36 58 16 0
> 0 0 9.497547 0 47 . .
> . . . 0 . "arizona"
> . 0 .
> 1993 19860 57 36 58 16 1
> 1 0 9.896463 1 47 0 .3989162
> 0 1 0 1 9.497547 "arizona"
> 0 1 1
> 1992 10422 37 28 58 20 0
> 0 0 9.251675 0 43 . .
> . . . -1 . "arizona state"
> . 0 .
>
> ------snip-----
>
> the data descriptions is:
>
> variable names:
>
> year apps top25 ver500 mth500 stufac bowl btitle
>
> finfour lapps d93 avg500 cfinfour clapps cstufac cbowl
>
> cavg500 cbtitle lapps_1 school ctop25 bball cbball
>
> Obs: 118
>
> 1. year 1992 or 1993
> 2. apps # applics for admission
> 3. top25 perc frosh class in 25th high sch percen
> 4. ver500 perc frosh >= 500 on verbal SAT
> 5. mth500 perc frosh >= 500 on math SAT
> 6. stufac student-faculty ratio
> 7. bowl = 1 if bowl game in prev year
> 8. btitle = 1 if men's cnf chmps prev year
> 9. finfour = 1 if men's final 4 prev year
> 10. lapps log(apps)
> 11. d93 =1 if year = 1993
> 12. avg500 (ver500+mth500)/2
> 13. cfinfour change in finfour
> 14. clapps change in lapps
> 15. cstufac change in stufac
> 16. cbowl change in bowl
> 17. cavg500 change in avg500
> 18. cbtitle change in btitle
> 19. lapps_1 lapps lagged
> 20. school university name
> 21. ctop25 change in top25
> 22. bball =1 if btitle or finfour
> 23. cbball change in bball
>
>
> so the each four lines represent one case,can some variables are numeric and
> some are character.
> I though the scan can read it in ,but it seems somewhat tricky as the mixed
> type of variables.any suggestions?
There may be an easier way, but here is one possible approach:
First, use scan to read in the data. Set the 'what' argument to a list
of atomic data types, based upon your specs above. Also, set the
'na.names' argument to '.'.
This will read in the multiple lines for each record, into a single
record based upon there being 23 elements per record. That is based upon
'length(what)'. Note also the 'multi.line' argument in scan().
data <- scan("data.txt",
what = c(rep(list(numeric(0)), 19),
list(character(0)),
rep(list(numeric(0)), 3)),
na.strings = ".")
'data' is now a list of values, where each list element is a proper
column from your original data file. Now use as.data.frame(), which will
take each list element and turn it into a column in a data frame.
preserving the data types.
data <- as.data.frame(data)
Now, read in the column names for the data frame from a text file,
containing your field names above, and set the data frame column names
to these.
Names <- scan("names.txt", what = character(0))
names(data) <- Names
Now review the structure of 'data':
> data
year apps top25 ver500 mth500 stufac bowl btitle finfour lapps
1 1992 6245 49 NA NA 20 1 0 0 8.739536
2 1993 7677 58 NA NA 15 1 0 0 8.945984
3 1992 13327 57 36 58 16 0 0 0 9.497547
4 1993 19860 57 36 58 16 1 1 0 9.896463
5 1992 10422 37 28 58 20 0 0 0 9.251675
d93 avg500 cfinfour clapps cstufac cbowl cavg500 cbtitle lapps_1
1 0 NA NA NA NA NA NA NA NA
2 1 NA 0 0.2064476 -5 0 NA 0 8.739536
3 0 47 NA NA NA NA NA 0 NA
4 1 47 0 0.3989162 0 1 0 1 9.497547
5 0 43 NA NA NA NA NA -1 NA
school ctop25 bball cbball
1 alabama NA 0 NA
2 alabama 9 0 0
3 arizona NA 0 NA
4 arizona 0 1 1
5 arizona state NA 0 NA
HTH,
Marc Schwartz
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html