Looks really nice. On Sun, May 12, 2013 at 6:01 PM, Matthew Dowle <[email protected]> wrote: > > And skip="string" is also now added and gdata credited (nice idea!) > >> input = "some,bad,data\n\nsome,cols\n1,2\n3,4\n\n\nreal >> data:\nA,B,C\n1,3,5\n2,4,6\n" >> cat(input) > > some,bad,data > > some,cols > 1,2 > 3,4 > > > real data: > A,B,C > 1,3,5 > 2,4,6 >> >> fread(input, skip="B,C") > > A B C > 1: 1 3 5 > 2: 2 4 6 >> >> fread(input) # autostart handles this case already (since the "real >> data:" line doesn't contain 2 * sep) > > A B C > 1: 1 3 5 > 2: 2 4 6 >> >> fread(input, skip="some,cols") # using skip="string" to get the middle >> table > > some cols > 1: 1 2 > 2: 3 4 > Warning message: > In fread(input, skip = "some,cols") : > Stopped reading at empty line, 2 lines after the 'skip' string was found, > but text exists afterwards (discarded): real data: > > > Further example : > >> input = "some,bad,data\n\nsome,cols\n1,2\n3,4\n\nreal data:\nA B\n1 3\n2 >> 4\n" >> cat(input) > > some,bad,data > > some,cols > 1,2 > 3,4 > > real data: > A B > 1 3 > 2 4 >> >> fread(input) # with space as separator autostart can't distinguish the >> "real data:" line. header wouldn't help here. > > real data: > 1: A B > 2: 1 3 > 3: 2 4 >> >> fread(input, skip="B") # skip="string" needed (skip=n onerous). Nice! > > A B > 1: 1 3 > 2: 2 4 >> >> > > Matthew > > > > On 12.05.2013 18:33, Matthew Dowle wrote: >> >> Since I'm in the fread code at the moment I added 'skip' (rev 864). >> 4 tests added : >> >>> input = "some,bad,data\nA,B,C\n1,3,5\n2,4,6\n" >>> fread(input) >> >> some bad data >> 1: A B C >> 2: 1 3 5 >> 3: 2 4 6 >>> >>> fread(input, skip=1) >> >> A B C >> 1: 1 3 5 >> 2: 2 4 6 >>> >>> fread(input, skip=2) >> >> V1 V2 V3 >> 1: 1 3 5 >> 2: 2 4 6 >>> >>> fread(input, skip=2, header=TRUE) >> >> 1 3 5 >> 1: 2 4 6 >>> >>> >> >> >> On 12.05.2013 14:24, Gabor Grothendieck wrote: >>> >>> Sorry, I did indeed miss the portion of the reply at the very bottom. >>> Yes, that seems good. >>> >>> On Sun, May 12, 2013 at 9:01 AM, Matthew Dowle >>> <[email protected]> wrote: >>>> >>>> >>>> Hi, >>>> >>>> I suspect you may not have scrolled further down in my reply where I >>>> wrote >>>> more? >>>> >>>> Matthew >>>> >>>> >>>> >>>> On 12.05.2013 13:26, Gabor Grothendieck wrote: >>>>> >>>>> >>>>> 1.8.8 is the most recent version on CRAN so I have now installed 1.8.9 >>>>> from R-Forge now and the sample csv I was using does indeed work >>>>> attempting to do the best it can with the mucked up header. Maybe >>>>> this is sufficient and a skip is not needed but the fact is that there >>>>> is no facility to skip over the bad header had I wanted to. >>>>> >>>>> On Sun, May 12, 2013 at 6:29 AM, Matthew Dowle >>>>> <[email protected]> wrote: >>>>>> >>>>>> >>>>>> On 12.05.2013 00:47, Gabor Grothendieck wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Not with the csv I tried. The header is messed up (most of the >>>>>>> header >>>>>>> fields are missing) and it misconstrues it as data. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> That was fixed a while ago in v1.8.9, from NEWS : >>>>>> >>>>>> " [fread] If some column names are blank they are now given default >>>>>> names >>>>>> rather than causing the header row to be read as a data row " >>>>>> >>>>>> >>>>>>> The automation is great but some way to force its behavior when you >>>>>>> know what it should do seems essential since heuristics can't be >>>>>>> expected to work in all cases. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> I suspect the heuristics in v1.8.9 work on all your examples so far, >>>>>> but >>>>>> ok >>>>>> point taken. >>>>>> >>>>>> fread allows control of 'autostart' already. This is a line number >>>>>> (default >>>>>> 30) within the regular data block used to detect the separator and >>>>>> search >>>>>> upwards from to find the first data row and/or column names. >>>>>> >>>>>> Will add 'skip' then. It'll be like setting autostart=skip+1 but >>>>>> turning >>>>>> off >>>>>> the search upwards part. Line skip+1 will be used to detect the >>>>>> separator >>>>>> when sep="auto" and used as column names according to >>>>>> header="auto"|TRUE|FALSE as usual. It'll be an error to specify both >>>>>> autostart and skip in the same call. If that sounds ok? >>>>>> >>>>>> Matthew >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle >>>>>>> <[email protected]> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Does the auto skip feature of fread cover both of those? From >>>>>>>> ?fread : >>>>>>>> >>>>>>>> " Once the separator is found on line autostart, the number of >>>>>>>> columns >>>>>>>> is >>>>>>>> determined. Then the file is searched backwards from autostart until >>>>>>>> a >>>>>>>> row >>>>>>>> is found that doesn't have that number of columns, or the start of >>>>>>>> file >>>>>>>> is >>>>>>>> reached. Thus, the first data row is found and any human readable >>>>>>>> banners >>>>>>>> are automatically skipped. This feature can be particularly useful >>>>>>>> for >>>>>>>> loading a set of files which may not all have consistently sized >>>>>>>> banners. >>>>>>>> " >>>>>>>> >>>>>>>> There were also some issue with header=FALSE in the first release >>>>>>>> (1.8.8) >>>>>>>> which have since been fixed in 1.8.9. >>>>>>>> >>>>>>>> Matthew >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 11.05.2013 23:16, Gabor Grothendieck wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I would find it useful if fread had a skip= argument as in >>>>>>>>> read.table >>>>>>>>> since I have files from time to time that have garbage at the top. >>>>>>>>> Another situation I find from time to time is that the header is >>>>>>>>> messed up but one can still read the file if one can skip over the >>>>>>>>> header and specify header = FALSE. >>>>>>>>> >>>>>>>>> An extra feature that would be nice but less important would be if >>>>>>>>> one >>>>>>>>> could specify skip = "string" and have it skip all lines until it >>>>>>>>> found one with "string": in it and then start reading from the >>>>>>>>> matched >>>>>>>>> row onward. Normally the string would be chosen to be a string >>>>>>>>> found >>>>>>>>> in the header and not likely found prior to the header. read.xls in >>>>>>>>> gdata has a similar feature and I find it quite handy at times. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Statistics & Software Consulting >>>>>>>>> GKX Group, GKX Associates Inc. >>>>>>>>> tel: 1-877-GKX-GROUP >>>>>>>>> email: ggrothendieck at gmail.com >>>>>>>>> _______________________________________________ >>>>>>>>> datatable-help mailing list >>>>>>>>> [email protected] >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Statistics & Software Consulting >>>>> GKX Group, GKX Associates Inc. >>>>> tel: 1-877-GKX-GROUP >>>>> email: ggrothendieck at gmail.com >>>> >>>> >>>> >> _______________________________________________ >> datatable-help mailing list >> [email protected] >> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > >
-- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
