1.8.8 is the most recent version on CRAN so I have now installed 1.8.9 from R-Forge now and the sample csv I was using does indeed work attempting to do the best it can with the mucked up header. Maybe this is sufficient and a skip is not needed but the fact is that there is no facility to skip over the bad header had I wanted to.
On Sun, May 12, 2013 at 6:29 AM, Matthew Dowle <[email protected]> wrote: > On 12.05.2013 00:47, Gabor Grothendieck wrote: >> >> Not with the csv I tried. The header is messed up (most of the header >> fields are missing) and it misconstrues it as data. > > > That was fixed a while ago in v1.8.9, from NEWS : > > " [fread] If some column names are blank they are now given default names > rather than causing the header row to be read as a data row " > > >> The automation is great but some way to force its behavior when you >> know what it should do seems essential since heuristics can't be >> expected to work in all cases. > > > I suspect the heuristics in v1.8.9 work on all your examples so far, but ok > point taken. > > fread allows control of 'autostart' already. This is a line number (default > 30) within the regular data block used to detect the separator and search > upwards from to find the first data row and/or column names. > > Will add 'skip' then. It'll be like setting autostart=skip+1 but turning off > the search upwards part. Line skip+1 will be used to detect the separator > when sep="auto" and used as column names according to > header="auto"|TRUE|FALSE as usual. It'll be an error to specify both > autostart and skip in the same call. If that sounds ok? > > Matthew > > > >> >> On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle >> <[email protected]> wrote: >>> >>> >>> Hi, >>> >>> Does the auto skip feature of fread cover both of those? From ?fread : >>> >>> " Once the separator is found on line autostart, the number of columns >>> is >>> determined. Then the file is searched backwards from autostart until a >>> row >>> is found that doesn't have that number of columns, or the start of file >>> is >>> reached. Thus, the first data row is found and any human readable banners >>> are automatically skipped. This feature can be particularly useful for >>> loading a set of files which may not all have consistently sized banners. >>> " >>> >>> There were also some issue with header=FALSE in the first release (1.8.8) >>> which have since been fixed in 1.8.9. >>> >>> Matthew >>> >>> >>> >>> On 11.05.2013 23:16, Gabor Grothendieck wrote: >>>> >>>> >>>> I would find it useful if fread had a skip= argument as in read.table >>>> since I have files from time to time that have garbage at the top. >>>> Another situation I find from time to time is that the header is >>>> messed up but one can still read the file if one can skip over the >>>> header and specify header = FALSE. >>>> >>>> An extra feature that would be nice but less important would be if one >>>> could specify skip = "string" and have it skip all lines until it >>>> found one with "string": in it and then start reading from the matched >>>> row onward. Normally the string would be chosen to be a string found >>>> in the header and not likely found prior to the header. read.xls in >>>> gdata has a similar feature and I find it quite handy at times. >>>> >>>> -- >>>> Statistics & Software Consulting >>>> GKX Group, GKX Associates Inc. >>>> tel: 1-877-GKX-GROUP >>>> email: ggrothendieck at gmail.com >>>> _______________________________________________ >>>> datatable-help mailing list >>>> [email protected] >>>> >>>> >>>> >>>> >>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
