Hello,
When reading a file with very small numbers in scientific notation, fread bumps
the column type to "character":
> tmp <- fread(files[1], verbose = TRUE)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 'autostart')
... sep='\t'
Found 5 columns
First row with 5 fields occurs on line 1 (either column names or first row of
data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 188308
Subtracted 1 for last eol and any trailing empty lines, leaving 188307 data rows
Type codes: 33302 (first 5 rows)
Type codes: 33302 (+middle 5 rows)
Type codes: 33302 (+last 5 rows)
Bumping column 5 from REAL to STR on data row 361, field contains '1.46761e-313'
0.000s ( 0%) Memory map (rerun may be quicker)
0.000s ( 0%) sep and header detection
0.020s ( 13%) Count rows (wc -l)
0.000s ( 0%) Column type detection (first, middle and last 5 rows)
0.020s ( 13%) Allocation of 188307x5 result (xMB) in RAM
0.110s ( 73%) Reading data
0.000s ( 0%) Allocation for type bumps (if any), including gc time if
triggered
0.000s ( 0%) Coercing data already read in type bumps (if any)
0.000s ( 0%) Changing na.strings to NA
0.150s Total
Warning message:
In fread(files[1], verbose = TRUE) :
Bumped column 5 to type character on data row 361, field contains
'1.46761e-313'. Coercing previously read values in this column from integer or
numeric back to character which may not be lossless; e.g., if '00' and '000'
occurred before they will now be just '0', and there may be inconsistencies
with treatment of ',,' and ',NA,' too (if they occurred in this column before
the bump). If this matters please rerun and set 'colClasses' to 'character' for
this column. Please note that column type detection uses the first 5 rows, the
middle 5 rows and the last 5 rows, so hopefully this message should be very
rare. If reporting to datatable-help, please rerun and include the output from
verbose=TRUE.
Perhaps there is some cutoff at e-300, since the preceding number
'3.34402e-299' is read in okay.
I can get round this by specifying the column as character using the colClasses
argument, then coercing to numeric after the data has been read in. However it
would be better if fread could read the data in as numeric in the first place,
as read.table does (though much more slowly in my example).
A simple example where type is detected as numeric then bumped to character
(Which rows are used as the middle 5? Does not seem to be rows 7-11 as I would
expect...)
> dat <- data.frame(one = LETTERS[1:17], two = 1:17)
> ## use strings here to replicate what I have in my data file
> dat$two[c(1, 9)] <- c("3.34402e-299", "1.46761e-313")
> write.table(dat, file = "test.txt", quote = FALSE, row.names = FALSE)
> fread("test.txt", verbose = TRUE)
...
Type codes: 32 (first 5 rows)
Type codes: 32 (+middle 5 rows)
Type codes: 32 (+last 5 rows)
Bumping column 2 from REAL to STR on data row 9, field contains '1.46761e-313'
...
Another example where type is detected as character from the first 5 rows
> dat$two[1:2] <- c("3.34402e-299", "1.46761e-313")
> write.table(dat, file = "test.txt", quote = FALSE, row.names = FALSE)
> fread("test.txt", verbose = TRUE)
...
Type codes: 33 (first 5 rows)
Type codes: 33 (+middle 5 rows)
Type codes: 33 (+last 5 rows)
...
So aside from the issue of which rows are used for type detection, it does seem
that 3.34402e-299 is detected as numeric whilst 1.46761e-313 is detected as
character. Compare vs. read.table:
> tmp <- read.table("test.txt", header = TRUE)
> lapply(tmp, class)
$one
[1] "factor"
$two
[1] "numeric"
Best wishes,
Heather
---
Package: data.table
Version: 1.8.9
Maintainer: Matthew Dowle <[email protected]>
Built: R 3.0.1; x86_64-pc-linux-gnu; 2013-06-26 21:24:22 UTC; unix
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] data.table_1.8.9
loaded via a namespace (and not attached):
[1] compiler_3.0.1 tools_3.0.1
_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help