Re: [datatable-help] fread: skip

Matthew Dowle Sun, 12 May 2013 15:01:39 -0700


And skip="string" is also now added and gdata credited (nice idea!)

input = "some,bad,data\n\nsome,cols\n1,2\n3,4\n\n\nrealdata:\nA,B,C\n1,3,5\n2,4,6\n"
cat(input)

some,bad,data

some,cols
1,2
3,4


real data:
A,B,C
1,3,5
2,4,6

fread(input, skip="B,C")

   A B C
1: 1 3 5
2: 2 4 6

fread(input) # autostart handles this case already (since the "realdata:" line doesn't contain 2 * sep)

   A B C
1: 1 3 5
2: 2 4 6

fread(input, skip="some,cols") # using skip="string" to get themiddle table

   some cols
1:    1    2
2:    3    4
Warning message:
In fread(input, skip = "some,cols") :

Stopped reading at empty line, 2 lines after the 'skip' string wasfound, but text exists afterwards (discarded): real data:



Further example :

input = "some,bad,data\n\nsome,cols\n1,2\n3,4\n\nreal data:\nA B\n13\n2 4\n"
cat(input)

some,bad,data

some,cols
1,2
3,4

real data:
A B
1 3
2 4

fread(input) # with space as separator autostart can't distinguishthe "real data:" line. header wouldn't help here.

   real data:
1:    A     B
2:    1     3
3:    2     4

fread(input, skip="B") # skip="string" needed (skip=n onerous).Nice!

   A B
1: 1 3
2: 2 4


Matthew


On 12.05.2013 18:33, Matthew Dowle wrote:

Since I'm in the fread code at the moment I added 'skip' (rev 864).
4 tests added :
input = "some,bad,data\nA,B,C\n1,3,5\n2,4,6\n"
fread(input)
   some bad data
1:    A   B    C
2:    1   3    5
3:    2   4    6
fread(input, skip=1)
   A B C
1: 1 3 5
2: 2 4 6
fread(input, skip=2)
   V1 V2 V3
1:  1  3  5
2:  2  4  6
fread(input, skip=2, header=TRUE)
   1 3 5
1: 2 4 6
On 12.05.2013 14:24, Gabor Grothendieck wrote:
Sorry, I did indeed miss the portion of the reply at the verybottom.
Yes, that seems good.

On Sun, May 12, 2013 at 9:01 AM, Matthew Dowle
<[email protected]> wrote:
Hi,
I suspect you may not have scrolled further down in my reply whereI wrote
more?

Matthew



On 12.05.2013 13:26, Gabor Grothendieck wrote:
1.8.8 is the most recent version on CRAN so I have now installed1.8.9
from R-Forge now and the sample csv I was using does indeed work
attempting to do the best it can with the mucked up header.Maybethis is sufficient and a skip is not needed but the fact is thatthere
is no facility to skip over the bad header had I wanted to.

On Sun, May 12, 2013 at 6:29 AM, Matthew Dowle
<[email protected]> wrote:
On 12.05.2013 00:47, Gabor Grothendieck wrote:
Not with the csv I tried. The header is messed up (most of theheader
fields are missing) and it misconstrues it as data.
That was fixed a while ago in v1.8.9, from NEWS :
" [fread] If some column names are blank they are now givendefault
names
   rather than causing the header row to be read as a data row "
The automation is great but some way to force its behavior whenyou
know what it should do seems essential since heuristics can't be
expected to work in all cases.
I suspect the heuristics in v1.8.9 work on all your examples sofar, but
ok
point taken.
fread allows control of 'autostart' already. This is a linenumber
(default
30) within the regular data block used to detect the separatorand search
upwards from to find the first data row and/or column names.
Will add 'skip' then. It'll be like setting autostart=skip+1 butturning
off
the search upwards part. Line skip+1 will be used to detect theseparator
when sep="auto" and used as column names according to
header="auto"|TRUE|FALSE as usual. It'll be an error to specifyboth
autostart and skip in the same call.  If that sounds ok?

Matthew
On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle
<[email protected]> wrote:
Hi,
Does the auto skip feature of fread cover both of those? From?fread :
" Once the separator is found on line autostart, the numberof
columns
is
determined. Then the file is searched backwards from autostartuntil a
row
is found that doesn't have that number of columns, or the startof file
is
reached. Thus, the first data row is found and any humanreadable
banners
are automatically skipped. This feature can be particularlyuseful forloading a set of files which may not all have consistentlysized
banners.
"
There were also some issue with header=FALSE in the firstrelease
(1.8.8)
which have since been fixed in 1.8.9.

Matthew



On 11.05.2013 23:16, Gabor Grothendieck wrote:
I would find it useful if fread had a skip= argument as inread.tablesince I have files from time to time that have garbage at thetop.Another situation I find from time to time is that the headerismessed up but one can still read the file if one can skip overthe
header and specify header = FALSE.
An extra feature that would be nice but less important wouldbe if onecould specify skip = "string" and have it skip all lines untilitfound one with "string": in it and then start reading from thematchedrow onward. Normally the string would be chosen to be astring foundin the header and not likely found prior to the header.read.xls ingdata has a similar feature and I find it quite handy attimes.
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
_______________________________________________
datatable-help mailing list
[email protected]







https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
_______________________________________________
datatable-help mailing list
[email protected]

https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] fread: skip

Reply via email to