using autostart=1 gives the following error Error in fread(file.path, autostart = 1) : ' ends field 2 on line 1 when detecting types: Date and Time,Open,High,Low,Close,Volume 2007/01/01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
On 24 Dec 2012, at 13:48, Matthew Dowle <[email protected]> wrote: > > Yes autostart is the line it detects separators, then it searches upwards to > find the first row with the same number of columns. If that row is all > character then it deems that as the column name row. So if you start > autostart on 1, it's already at the top and it might catch the right > separator by avoiding the data rows for separator detection. > > On 24.12.2012 11:52, Hideyoshi Maeda wrote: >> Thanks for the quick response. >> >> I wasn't sure if I understood you correctly, but isn't the problem >> the way that autostart finds separators? >> >> and in my example, it had headers, so I think it would need to start >> from row 2 wouldn't it, i.e. the first row that has non-header values? >> >> Thanks >> >> On 24 Dec 2012, at 11:44, Matthew Dowle <[email protected]> wrote: >> >>> >>> Hi, >>> >>> Ah yes, haven't hooked up the sep override yet, apologies, will fix. >>> Maybe setting autostart to the row number of the header row (probably 1) >>> might work. >>> >>> Thanks, >>> Matthew >>> >>> >>> On 24.12.2012 11:08, Hideyoshi Maeda wrote: >>>> oups…forgot to add the output from the verbose part…here it is... >>>> >>>> Detected eol as \r\n (CRLF) in that order, the Windows standard. >>>> Starting format detection on line 30 (the last non blank line in the >>>> first 30) >>>> Detected sep as '/' and 3 columns >>>> Type codes: 003 >>>> Found first row with 3 fields occuring on line 1 (either column names >>>> or first row of data) >>>> The first data row has some non character fields. Treating as a data >>>> row and using default column names. >>>> Count of eol after pos: 1143699 >>>> Subtracted 1 for last eol and any trailing empty lines, leaving >>>> 1143698 data rows >>>> 0.153s ( 21%) Memory map (quicker if you rerun) >>>> 0.000s ( 0%) Format detection >>>> 0.095s ( 13%) Count rows (wc -l) >>>> 0.001s ( 0%) Allocation of 1143698x3 result (xMB) in RAM >>>> 0.480s ( 66%) Reading data >>>> 0.000s ( 0%) Bumping column type midread and coercing data already read >>>> 0.002s ( 0%) Changing na.strings to NA >>>> 0.731s Total >>>> >>>> >>>> On 24 Dec 2012, at 11:04, Hideyoshi Maeda <[email protected]> >>>> wrote: >>>> >>>>> Hi Matthew, >>>>> >>>>> I am using the new `data.table` `fread()` function to read my csv files, >>>>> which has the format as follows when using the read.csv function >>>>> >>>>> Date.and.Time Open High Low Close Volume >>>>> 1 2007/01/01 22:51:00 5683 5683 5673 5673 64 >>>>> 2 2007/01/01 22:52:00 5675 5676 5674 5674 17 >>>>> 3 2007/01/01 22:53:00 5674 5674 5673 5674 42 >>>>> >>>>> The value of the first column is all of: `2007/01/01 22:53:00`, the next >>>>> 5 columns are separated with commas. >>>>> >>>>> but when reading the same file using fread i get the following output >>>>> >>>>> V1 V2 V3 >>>>> 1 2007 1 01 22:51:00,5683.00,5683.00,5673.00,5673.00,64 >>>>> 2 2007 1 01 22:52:00,5675.00,5676.00,5674.00,5674.00,17 >>>>> 3 2007 1 01 22:53:00,5674.00,5674.00,5673.00,5674.00,42 >>>>> >>>>> This is because the autodetect is using the "/" as a separator... >>>>> >>>>> I tried overriding this using the `sep=","` argument but this does not >>>>> seem to be used in the function anywhere. >>>>> >>>>> Furthremore when using verbose I get the following output, which suggests >>>>> that I was right in thinking that "/" is used as a separator rather than >>>>> ",". >>>>> >>>>> Is there any way to fix this, so that it correctly reads all 6 columns >>>>> separately? >>>>> >>>>> Thanks >>>>> >>>>> HLM >>>>> >>>>> On 21 Dec 2012, at 18:28, Matthew Dowle <[email protected]> wrote: >>>>> >>>>>> >>>>>> Hi datatablers, >>>>>> >>>>>> Feedback and bug reports much appreciated : >>>>>> >>>>>> ===== >>>>>> New function fread(), a fast and friendly file reader. >>>>>> * header, skip, nrows, sep and colClasses are all auto detected. >>>>>> * integers>2^31 are detected and read natively as bit64::integer64. >>>>>> * accepts filenames, URLs and "A,B\n1,2\n3,4" directly >>>>>> * new implementation entirely in C >>>>>> * with a 50MB .csv, 1 million rows x 6 columns : >>>>>> read.csv("test.csv") # 30-60 sec >>>>>> read.table("test.csv",<all known tricks, known nrows>) # 10 sec >>>>>> fread("test.csv") # 3 sec >>>>>> * airline data: 658MB csv (7 million rows x 29 columns) >>>>>> read.table("2008.csv",<all known tricks, known nrows>) # 360 sec >>>>>> fread("2008.csv") # 50 sec >>>>>> See ?fread. Many thanks to Chris Neff and Garrett See for ideas, >>>>>> discussions and beta testing. >>>>>> ===== >>>>>> >>>>>> 1.8.7 is passing checks on Unix and Windows (but not Mac yet) : >>>>>> >>>>>> install.packages("data.table", repos="http://R-Forge.R-project.org") >>>>>> require(data.table) >>>>>> ?fread >>>>>> fread("your biggest baddest file") >>>>>> >>>>>> Oddly, R-Forge appears to be compiling Win64 with -O2 optimization rather >>>>>> than -O3 (but -O3 on Win32 ok), so speedups might not be as great on >>>>>> Win64 >>>>>> until that can be resolved on R-Forge, unless you compile yourself. -O3 >>>>>> has some optimizations that fread may benefit from. But interested to >>>>>> hear. >>>>>> >>>>>> Seasons greatings! >>>>>> >>>>>> Matthew >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> datatable-help mailing list >>>>>> [email protected] >>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>>> >>> > _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
