Re: [datatable-help] New function fread() in v1.8.7

Matthew Dowle Mon, 24 Dec 2012 05:48:23 -0800

Yes autostart is the line it detects separators, then it searchesupwards to find the first row with the same number of columns. If thatrow is all character then it deems that as the column name row. So ifyou start autostart on 1, it's already at the top and it might catch theright separator by avoiding the data rows for separator detection.


On 24.12.2012 11:52, Hideyoshi Maeda wrote:

Thanks for the quick response.

I wasn't sure if I understood you correctly, but isn't the problem
the way that autostart finds separators?

and in my example, it had headers, so I think it would need to start
from row 2 wouldn't it, i.e. the first row that has non-headervalues?
Thanks
On 24 Dec 2012, at 11:44, Matthew Dowle <[email protected]>wrote:
Hi,

Ah yes, haven't hooked up the sep override yet, apologies, will fix.
Maybe setting autostart to the row number of the header row(probably 1)
might work.

Thanks,
Matthew


On 24.12.2012 11:08, Hideyoshi Maeda wrote:
oups…forgot to add the output from the verbose part…here it is...

Detected eol as \r\n (CRLF) in that order, the Windows standard.
Starting format detection on line 30 (the last non blank line inthe
first 30)
Detected sep as '/' and 3 columns
Type codes: 003
Found first row with 3 fields occuring on line 1 (either columnnames
or first row of data)
The first data row has some non character fields. Treating as adata
row and using default column names.
Count of eol after pos: 1143699
Subtracted 1 for last eol and any trailing empty lines, leaving
1143698 data rows
  0.153s ( 21%) Memory map (quicker if you rerun)
  0.000s (  0%) Format detection
  0.095s ( 13%) Count rows (wc -l)
  0.001s (  0%) Allocation of 1143698x3 result (xMB) in RAM
  0.480s ( 66%) Reading data
0.000s ( 0%) Bumping column type midread and coercing dataalready read
  0.002s (  0%) Changing na.strings to NA
  0.731s        Total
On 24 Dec 2012, at 11:04, Hideyoshi Maeda<[email protected]> wrote:
Hi Matthew,
I am using the new `data.table` `fread()` function to read my csvfiles, which has the format as follows when using the read.csvfunction
          Date.and.Time Open High  Low Close Volume
  1 2007/01/01 22:51:00 5683 5683 5673  5673     64
  2 2007/01/01 22:52:00 5675 5676 5674  5674     17
  3 2007/01/01 22:53:00 5674 5674 5673  5674     42
The value of the first column is all of: `2007/01/01 22:53:00`,the next 5 columns are separated with commas.
but when reading the same file using fread i get the followingoutput
      V1 V2                                             V3
  1 2007  1 01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
  2 2007  1 01 22:52:00,5675.00,5676.00,5674.00,5674.00,17
  3 2007  1 01 22:53:00,5674.00,5674.00,5673.00,5674.00,42

This is because the autodetect is using the "/" as a separator...
I tried overriding this using the `sep=","` argument but this doesnot seem to be used in the function anywhere.
Furthremore when using verbose I get the following output, whichsuggests that I was right in thinking that "/" is used as aseparator rather than ",".
Is there any way to fix this, so that it correctly reads all 6columns separately?
Thanks

HLM
On 21 Dec 2012, at 18:28, Matthew Dowle <[email protected]>wrote:
Hi datatablers,

Feedback and bug reports much appreciated :

=====
New function fread(), a fast and friendly file reader.
* header, skip, nrows, sep and colClasses are all auto detected.
* integers>2^31 are detected and read natively asbit64::integer64.
* accepts filenames, URLs and "A,B\n1,2\n3,4" directly
* new implementation entirely in C
* with a 50MB .csv, 1 million rows x 6 columns :
read.csv("test.csv") # 30-60secread.table("test.csv",<all known tricks, known nrows>) # 10secfread("test.csv") # 3sec
* airline data: 658MB csv (7 million rows x 29 columns)
read.table("2008.csv",<all known tricks, known nrows>) # 360secfread("2008.csv") # 50sec
See ?fread. Many thanks to Chris Neff and Garrett See for ideas,
discussions and beta testing.
=====

1.8.7 is passing checks on Unix and Windows (but not Mac yet) :
install.packages("data.table",repos="http://R-Forge.R-project.org";)
require(data.table)
?fread
fread("your biggest baddest file")
Oddly, R-Forge appears to be compiling Win64 with -O2optimization ratherthan -O3 (but -O3 on Win32 ok), so speedups might not be as greaton Win64until that can be resolved on R-Forge, unless you compileyourself. -O3has some optimizations that fread may benefit from. Butinterested to hear.
Seasons greatings!

Matthew


_______________________________________________
datatable-help mailing list
[email protected]

https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] New function fread() in v1.8.7

Reply via email to