Re: [datatable-help] fread: skip

Matthew Dowle Sun, 12 May 2013 08:20:50 -0700

For that I think all that needs to be done (now) is adding somethingvery similar to these few lines (from read.table) into fread at R levelafter the data has been read in :


       if (colClasses[i] == "factor")
           as.factor(data[[i]])
       else if (colClasses[i] == "Date")
           as.Date(data[[i]])
       else if (colClasses[i] == "POSIXct")
           as.POSIXct(data[[i]])
       else methods::as(data[[i]], colClasses[i])

Although I don't quite see why read.table explicity deals with factor,Date and POSIXct separately, rather than leaving them to the methods::ascatch all at the end.

But reading dates (for example) as character and then converting toDate at R level is going to be relatively slow due to the intermediatecharacter vector and adding all the unique strings to R's global cache.Direct reading of dates (e.g. by using Simon U's fasttime package) couldbe built in at C level at a later date just for speed, without breakingsyntax or output types. In the meantime it would work at least. That'sthe thinking, anyway.

I found some discussion in R News 4.1 about Excel dates and times, butnot on colClasses or that mapping specifically. Currently in fread ifa colClasses name isn't recognised as a basic type likeinteger|numeric|double|integer64|character, then it's read as characterand (to be done) as long as there's an as.() method for it that'll takecare of it. Reading numbers (such as offset from epoch) and then as()on that numeric|integer column isn't something I'd considered before (isthat what you mean?)


Matthew


On 12.05.2013 15:44, Gabor Grothendieck wrote:

That looks great. It occurred to me in looking at this that onethingthat might be useful would be to provide some conversion routinesthat
can be specified as classes in the colClass vector that will convert
numbers from Excel representing Dates or date/times to Date and
POSIXct class respectively. (The mapping is discussed in R News4/1.)
On Sun, May 12, 2013 at 10:14 AM, Matthew Dowle
<[email protected]> wrote:
Agreed too. colClasses was committed yesterday as luck would haveit.
?fread now has :

   colClasses : A character vector of classes (named or unnamed), as
read.csv. Or, type list enables setting ranges of columns by numeric
position. colClasses in fread is intended for rare overrides, notforroutine use. fread will only promote a column to a higher type ifcolClassesrequests it. It won't downgrade a column to a lower type since NAswouldresult. You have to coerce such columns afterwards yourself, if youreally
require data loss.

The tests so far are as follows :

input = 'A,B,C\n01,foo,3.140\n002,bar,6.28000\n'

test(952, fread(input, colClasses=c(C="character")),
data.table(A=1:2,B=c("foo","bar"),C=c("3.140","6.28000")))
test(953, fread(input, colClasses=c(C="character",A="numeric")),
data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000")))
test(954, fread(input, colClasses=c(C="character",A="double")),
data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000")))
test(955, fread(input, colClasses=list(character="C",double="A")),
data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000")))
test(956, fread(input, colClasses=list(character=2:3,double="A")),
data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000")))
test(957, fread(input, colClasses=list(character=1:3)),
data.table(A=c("01","002"),B=c("foo","bar"),C=c("3.140","6.28000")))
test(958, fread(input, colClasses="character"),
data.table(A=c("01","002"),B=c("foo","bar"),C=c("3.140","6.28000")))
test(959, fread(input,colClasses=c("character","double","numeric")),
data.table(A=c("01","002"),B=c("foo","bar"),C=c(3.14,6.28)))

test(960, fread(input, colClasses=c("character","double")),
error="colClasses is unnamed and length 2 but there are 3 columns.See")test(961, fread(input, colClasses=1:3), error="colClasses is nottype list
or character vector")
test(962, fread(input, colClasses=list(1:3)), error="colClasses istype list
but has no names")
test(963, fread(input, colClasses=list(character="D")),error="Column name
'D' in colClasses not found in data")
test(964, fread(input, colClasses=c(D="character")), error="Columnname 'D'
in colClasses not found in data")
test(965, fread(input, colClasses=list(character=0)), error="Columnnumber 0
(colClasses..1...1.) is out of range .1,ncol=3.")
test(966, fread(input, colClasses=list(character=2:4)),error="Column number
4 (colClasses..1...3.) is out of range .1,ncol=3.")

More detailed/trace info is provided when verbose=TRUE.
On embedded quotes there are known and documented problems still toresolve.The issue there is subtle: when reading character columns, part offread'sspeed comes from pointing mkCharLen() directly to the field inmemory mappedregion of RAM i.e. the field isn't copied into any intermediatebuffer atall. But for embedded quotes (either doubled or escaped) we do needto copyto a buffer so we can remove the doubled quote, or escape character(i.e.change the field) before calling mkCharLen(). That's not a problemper se,but just a new twist to the C code to implement. In order to notslow down,it need only copy that field to a buffer if a doubled or escapedquote was
actually present in that particular field.

Matthew



On 12.05.2013 14:24, Gabor Grothendieck wrote:
Sorry, I did indeed miss the portion of the reply at the verybottom.
Yes, that seems good.
What about colClasses too? I would think that there would becases
where an automatic approach might not give the result wanted.  For
example, order numbers might all be numeric but you would want to
store them as character in case there are leading zeros.  In other
cases similar fields might validly have leading zeros but you would
want them regarded as numeric so there is no way to distinguish the
two cases except by having the user indicate their intention.

Also, there exist cases where
- fields are unquoted,
- fields are quoted and doubling the quotes are used to indicate an
actual quote and
- where fields are quoted but a backslash quote it used to denotean
actual quote.
Ideally all these situations could be handled through somecombinationof automatic and specified arguments. In the case of R'sread.table
it cannot handle the back slashed quote case but handles the others
mentioned.


On Sun, May 12, 2013 at 9:01 AM, Matthew Dowle
<[email protected]> wrote:
Hi,
I suspect you may not have scrolled further down in my reply whereI
wrote
more?

Matthew



On 12.05.2013 13:26, Gabor Grothendieck wrote:
1.8.8 is the most recent version on CRAN so I have now installed1.8.9
from R-Forge now and the sample csv I was using does indeed work
attempting to do the best it can with the mucked up header.Maybethis is sufficient and a skip is not needed but the fact is thatthere
is no facility to skip over the bad header had I wanted to.

On Sun, May 12, 2013 at 6:29 AM, Matthew Dowle
<[email protected]> wrote:
On 12.05.2013 00:47, Gabor Grothendieck wrote:
Not with the csv I tried. The header is messed up (most of theheader
fields are missing) and it misconstrues it as data.
That was fixed a while ago in v1.8.9, from NEWS :
" [fread] If some column names are blank they are now givendefault
names
   rather than causing the header row to be read as a data row "
The automation is great but some way to force its behavior whenyouknow what it should do seems essential since heuristics can'tbe
expected to work in all cases.
I suspect the heuristics in v1.8.9 work on all your examples sofar,
but
ok
point taken.
fread allows control of 'autostart' already. This is a linenumber
(default
30) within the regular data block used to detect the separatorand
search
upwards from to find the first data row and/or column names.

Will add 'skip' then. It'll be like setting autostart=skip+1 but
turning
off
the search upwards part. Line skip+1 will be used to detect the
separator
when sep="auto" and used as column names according to
header="auto"|TRUE|FALSE as usual. It'll be an error to specifyboth
autostart and skip in the same call.  If that sounds ok?

Matthew
On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle
<[email protected]> wrote:
Hi,
Does the auto skip feature of fread cover both of those? From?fread
:
" Once the separator is found on line autostart, the numberof
columns
is
determined. Then the file is searched backwards from autostartuntil
a
row
is found that doesn't have that number of columns, or thestart of
file
is
reached. Thus, the first data row is found and any humanreadable
banners
are automatically skipped. This feature can be particularlyuseful
for
loading a set of files which may not all have consistentlysized
banners.
"
There were also some issue with header=FALSE in the firstrelease
(1.8.8)
which have since been fixed in 1.8.9.

Matthew



On 11.05.2013 23:16, Gabor Grothendieck wrote:
I would find it useful if fread had a skip= argument as in
read.table
since I have files from time to time that have garbage at thetop.Another situation I find from time to time is that the headerismessed up but one can still read the file if one can skipover the
header and specify header = FALSE.
An extra feature that would be nice but less important wouldbe if
one
could specify skip = "string" and have it skip all linesuntil itfound one with "string": in it and then start reading fromthe
matched
row onward. Normally the string would be chosen to be astring
found
in the header and not likely found prior to the header.read.xls ingdata has a similar feature and I find it quite handy attimes.
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
_______________________________________________
datatable-help mailing list
[email protected]









https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] fread: skip

Reply via email to