Re: [datatable-help] New function fread() in v1.8.7

Matthew Dowle Fri, 28 Dec 2012 14:59:50 -0800


Or, 2 new functions :


    fread.table
    fread.csv

that would be what you expected. They would call fread first and dothe modifications afterwards such as convert character columns tofactors, call make.names on the names etc. That way we don't clutterfread's argument list with arguments/options we only need for drop-incompatibility.

If a user wanted to drop them in to be picked up by existing code, theycould just mask it themselves in .GlobalEnv by executing :


    read.table = fread.table

Since a user may want data.table just for fread, and not wish to changeto data.table syntax, I suppose fread.table and fread.csv should returndata.frame rather than data.table, too.


Just thinking out loud ...


On 28.12.2012 22:38, Matthew Dowle wrote:

It wasn't at the front of my mind to make it a drop in replacement.
Maybe it should be since it's not like data.table itself where a drop
in replacement for data.frame wasn't possible. If fread is supposedto
be a drop in replacement then it shouldn't output integer64 types,
shouldn't produce list columns for dual delimited files and
stringsAsFactors should be TRUE by default not FALSE, as well.

Perhaps an as.read.table=TRUE/FALSE option, then?


On 28.12.2012 22:21, Hideyoshi Maeda wrote:
No problem for the confirm…Thanks again for fixing it.

As for the file itself having "Date and Time", you are right….i just
assumed that this function was designed to replace/speed up the
read.csv function, i.e. work in exactly the same way but faster.
Thanks for letting me know about the make.names call though.
On 28 Dec 2012, at 22:06, Matthew Dowle <[email protected]>wrote:
Great. Thanks for confirm.
The file itself has "Date and Time" as the column name doesn't iti.e. with spaces not dots? fread retains exactly what's in the file,whereas read.csv runs the column names through base::make.names()which converts the spaces to dots to make the column namessyntactically valid, iiuc. data.table's general policy is to allowspaces and other unusual characters in columns names and retain themthroughout (forgiving the odd bug now fixed caused by some make.namescalls which should have been make.unique).
To do the same as read.csv :

   DT = fread(...)
   setnames(DT,make.names(names(DT)))

Not sure I understood correctly and I didn't test.


On 28.12.2012 21:36, Hideyoshi Maeda wrote:
The sep argument now works thank you!
But just out of curiosity…not a major problem of sorts but byusingfread(file.path,sep=",") on my csv file, the column names includes"."
as shown in my original email… but the output result automatically
removes the "." in the column name…is there a way to stop it from
doing that?, i.e. the first column becomes "Data and Time" whenusingfread, rather than the original "Date.and.Time" when usingread.csv
On 26 Dec 2012, at 22:21, Matthew Dowle <[email protected]>wrote:
sep is now passed through and have added your example as a test.
Hope ok now.

Thanks,
Matthew

On 24.12.2012 14:18, Hideyoshi Maeda wrote:
using autostart=1 gives the following error

Error in fread(file.path, autostart = 1) :
' ends field 2 on line 1 when detecting types: Date and
Time,Open,High,Low,Close,Volume
2007/01/01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
On 24 Dec 2012, at 13:48, Matthew Dowle <[email protected]>wrote:
Yes autostart is the line it detects separators, then itsearches upwards to find the first row with the same number ofcolumns. If that row is all character then it deems that as thecolumn name row. So if you start autostart on 1, it's already atthe top and it might catch the right separator by avoiding thedata rows for separator detection.
On 24.12.2012 11:52, Hideyoshi Maeda wrote:
Thanks for the quick response.
I wasn't sure if I understood you correctly, but isn't theproblem
the way that autostart finds separators?
and in my example, it had headers, so I think it would need tostartfrom row 2 wouldn't it, i.e. the first row that has non-headervalues?
Thanks
On 24 Dec 2012, at 11:44, Matthew Dowle<[email protected]> wrote:
Hi,
Ah yes, haven't hooked up the sep override yet, apologies,will fix.Maybe setting autostart to the row number of the header row(probably 1)
might work.

Thanks,
Matthew


On 24.12.2012 11:08, Hideyoshi Maeda wrote:
oups…forgot to add the output from the verbose part…here itis...
Detected eol as \r\n (CRLF) in that order, the Windowsstandard.Starting format detection on line 30 (the last non blankline in the
first 30)
Detected sep as '/' and 3 columns
Type codes: 003
Found first row with 3 fields occuring on line 1 (eithercolumn names
or first row of data)
The first data row has some non character fields. Treatingas a data
row and using default column names.
Count of eol after pos: 1143699
Subtracted 1 for last eol and any trailing empty lines,leaving
1143698 data rows
0.153s ( 21%) Memory map (quicker if you rerun)
0.000s (  0%) Format detection
0.095s ( 13%) Count rows (wc -l)
0.001s (  0%) Allocation of 1143698x3 result (xMB) in RAM
0.480s ( 66%) Reading data
0.000s ( 0%) Bumping column type midread and coercing dataalready read
0.002s (  0%) Changing na.strings to NA
0.731s        Total
On 24 Dec 2012, at 11:04, Hideyoshi Maeda<[email protected]> wrote:
Hi Matthew,
I am using the new `data.table` `fread()` function to readmy csv files, which has the format as follows when using theread.csv function
       Date.and.Time Open High  Low Close Volume
1 2007/01/01 22:51:00 5683 5683 5673  5673     64
2 2007/01/01 22:52:00 5675 5676 5674  5674     17
3 2007/01/01 22:53:00 5674 5674 5673  5674     42
The value of the first column is all of: `2007/01/0122:53:00`, the next 5 columns are separated with commas.
but when reading the same file using fread i get thefollowing output
   V1 V2                                             V3
1 2007  1 01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
2 2007  1 01 22:52:00,5675.00,5676.00,5674.00,5674.00,17
3 2007  1 01 22:53:00,5674.00,5674.00,5673.00,5674.00,42
This is because the autodetect is using the "/" as aseparator...
I tried overriding this using the `sep=","` argument butthis does not seem to be used in the function anywhere.
Furthremore when using verbose I get the following output,which suggests that I was right in thinking that "/" is usedas a separator rather than ",".
Is there any way to fix this, so that it correctly readsall 6 columns separately?
Thanks

HLM
On 21 Dec 2012, at 18:28, Matthew Dowle<[email protected]> wrote:
Hi datatablers,

Feedback and bug reports much appreciated :

=====
New function fread(), a fast and friendly file reader.
* header, skip, nrows, sep and colClasses are all autodetected.* integers>2^31 are detected and read natively asbit64::integer64.
* accepts filenames, URLs and "A,B\n1,2\n3,4" directly
* new implementation entirely in C
* with a 50MB .csv, 1 million rows x 6 columns :
read.csv("test.csv") #30-60 secread.table("test.csv",<all known tricks, known nrows>) #10 secfread("test.csv") #3 sec
* airline data: 658MB csv (7 million rows x 29 columns)
read.table("2008.csv",<all known tricks, known nrows>) #360 secfread("2008.csv") #50 secSee ?fread. Many thanks to Chris Neff and Garrett See forideas,
discussions and beta testing.
=====
1.8.7 is passing checks on Unix and Windows (but not Macyet) :
install.packages("data.table",repos="http://R-Forge.R-project.org";)
require(data.table)
?fread
fread("your biggest baddest file")
Oddly, R-Forge appears to be compiling Win64 with -O2optimization ratherthan -O3 (but -O3 on Win32 ok), so speedups might not beas great on Win64until that can be resolved on R-Forge, unless you compileyourself. -O3has some optimizations that fread may benefit from. Butinterested to hear.
Seasons greatings!

Matthew


_______________________________________________
datatable-help mailing list
[email protected]

https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] New function fread() in v1.8.7

Reply via email to