[datatable-help] fread coercion of very small number to character

Heather Turner Mon, 02 Sep 2013 07:52:42 -0700

Hello,

When reading a file with very small numbers in scientific notation, fread bumps 
the column type to "character":


> tmp <- fread(files[1], verbose = TRUE)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 'autostart') 
... sep='\t'
Found 5 columns
First row with 5 fields occurs on line 1 (either column names or first row of 
data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 188308
Subtracted 1 for last eol and any trailing empty lines, leaving 188307 data rows
Type codes: 33302 (first 5 rows)
Type codes: 33302 (+middle 5 rows)
Type codes: 33302 (+last 5 rows)
Bumping column 5 from REAL to STR on data row 361, field contains '1.46761e-313'
   0.000s (  0%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   0.020s ( 13%) Count rows (wc -l)
   0.000s (  0%) Column type detection (first, middle and last 5 rows)
   0.020s ( 13%) Allocation of 188307x5 result (xMB) in RAM
   0.110s ( 73%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if 
triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.000s (  0%) Changing na.strings to NA
   0.150s        Total
Warning message:
In fread(files[1], verbose = TRUE) :
  Bumped column 5 to type character on data row 361, field contains 
'1.46761e-313'. Coercing previously read values in this column from integer or 
numeric back to character which may not be lossless; e.g., if '00' and '000' 
occurred before they will now be just '0', and there may be inconsistencies 
with treatment of ',,' and ',NA,' too (if they occurred in this column before 
the bump). If this matters please rerun and set 'colClasses' to 'character' for 
this column. Please note that column type detection uses the first 5 rows, the 
middle 5 rows and the last 5 rows, so hopefully this message should be very 
rare. If reporting to datatable-help, please rerun and include the output from 
verbose=TRUE.

Perhaps there is some cutoff at e-300, since the preceding number 
'3.34402e-299' is read in okay.

I can get round this by specifying the column as character using the colClasses 
argument, then coercing to numeric after the data has been read in. However it 
would be better if fread could read the data in as numeric in the first place, 
as read.table does (though much more slowly in my example).

A simple example where type is detected as numeric then bumped to character 
(Which rows are used as the middle 5? Does not seem  to be rows 7-11 as I would 
expect...)

> dat <- data.frame(one = LETTERS[1:17], two = 1:17)
> ## use strings here to replicate what I have in my data file
> dat$two[c(1, 9)] <- c("3.34402e-299", "1.46761e-313") 
> write.table(dat, file = "test.txt", quote = FALSE, row.names = FALSE)
> fread("test.txt", verbose = TRUE)

...
Type codes: 32 (first 5 rows)
Type codes: 32 (+middle 5 rows)
Type codes: 32 (+last 5 rows)
Bumping column 2 from REAL to STR on data row 9, field contains '1.46761e-313'
...

Another example where type is detected as character from the first 5 rows

> dat$two[1:2] <- c("3.34402e-299", "1.46761e-313") 
> write.table(dat, file = "test.txt", quote = FALSE, row.names = FALSE)
> fread("test.txt", verbose = TRUE)

...
Type codes: 33 (first 5 rows)
Type codes: 33 (+middle 5 rows)
Type codes: 33 (+last 5 rows)
...

So aside from the issue of which rows are used for type detection, it does seem 
that 3.34402e-299 is detected as numeric whilst 1.46761e-313 is detected as 
character. Compare vs. read.table:

> tmp <- read.table("test.txt", header = TRUE)
> lapply(tmp, class)
$one
[1] "factor"

$two
[1] "numeric"

Best wishes,

Heather

---
Package: data.table
 Version: 1.8.9
 Maintainer: Matthew Dowle <[email protected]>
 Built: R 3.0.1; x86_64-pc-linux-gnu; 2013-06-26 21:24:22 UTC; unix

R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] data.table_1.8.9

loaded via a namespace (and not attached):
[1] compiler_3.0.1 tools_3.0.1

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

[datatable-help] fread coercion of very small number to character

Reply via email to