There is no column separator for what I am referring to as "flat files" which is the challenge. My apologies if my terminology is off. By “flat file” I mean that every field has a constant width for every row. So, a file with 3 variables might have an patient id character field that is 8 wide, followed by an age in a numeric field 3 wide, followed by a gender as an integer that is 1 wide. It would look like this. Sometimes there are spaces if fields are empty or if the variables are smaller than the space allotted. pat000010381 pat000020292 pat000030571
which needs to be separated into Patid age gender pat00001 038 1 pat00002 029 2 pat00003 057 1 We have Stat/Transfer and it doesn’t do flat files as far as I can tell. As I said, we can do it quickly and easily in SAS. Most datasets ship with a SAS program for conversion and Adam Damico has written the R package SAScii to parse SAS load files into an R script that goes to read.fwf(). But I am trying to see if it is possible to allow fread to take a vector to split the file. For the above it would be something like passing sep=(8,3,1). All Medicare claims data and most health related national surveys come this way. For an example, the bottom of this file shows how such a file is loaded https://www.hcup-us.ahrq.gov/db/nation/nis/tools/pgms/SASLoad_NIS_2011_Core .SAS This is the first 20% of the list: *** Read data elements from the ASCII file ***; INPUT @1 AGE N3PF. @4 AGEDAY N3PF. @7 AMONTH N2PF. @9 ASOURCE N2PF. @11 ASOURCEUB92 $CHAR1. @12 ASOURCE_X $CHAR3. @15 ATYPE N2PF. @17 AWEEKEND N2PF. @19 DIED N2PF. @21 DISCWT N11P7F. On 4/20/14, 7:36 AM, "Michael Smith" <[email protected]> wrote: >Not sure exactly what you mean by "flat file." I previously assumed you >mean fixed width formatted data, but now you say they are concatenated >and there are no spaces. So what's the column separator? Tab, comma, ...? > >If everything else fails, try Stat/Transfer. > >M > >On 04/20/2014 03:04 PM, Mark Danese wrote: >> Thanks Michael. The flat file format doesn’t have spaces between >>fields. >> They are all concatenated. It may be possible to use sed with a vector >>of >> widths, but I am not a command-line person (yet). >> >> It just may be one of those things that isn’t easy to implement in >>fread. >> In healthcare in the US there are still a lot of flat files out there. >>We >> usually use SAS but I am trying to get away from that. And R can read >> flat files(read.fwf), but it is pretty slow. From what I understand, >> read.fwf actually does insert commas and then reads the file. So, it >> might be possible to hack read.fwf and fread together somehow. >> >> My first experience with fread was to read in a 1.6 GB file in 30 >>seconds. >> That was pretty impressive. >> >> >> On 4/19/14, 5:07 AM, "Michael Smith" <[email protected]> wrote: >> >>> Probably you could do this from the Linux command line using `sed`, >>>i.e. >>> to replace several spaces with a comma. >>> >>> https://www.google.com/search?q=sed+replace+space+with+comma >>> >>> If you're on Windows, you probably can do the same using Cygwin. >>> >>> M >>> >>> >>> On 04/19/2014 12:37 AM, Mark Danese wrote: >>>> Is it possible to pass a vector of column widths to have fread read >>>>in a >>>> flat file? I saw that someone suggested using csvkit to add commas >>>>and >>>> then use data table, but that is beyond my skill set. >>>> >>>> >>>> _______________________________________________ >>>> datatable-help mailing list >>>> [email protected] >>>> >>>> >>>>https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable- >>>>he >>>> lp >>>> >> _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
