Re: [datatable-help] New function fread() in v1.8.7

Hideyoshi Maeda Mon, 24 Dec 2012 06:18:42 -0800

using autostart=1 gives the following error

Error in fread(file.path, autostart = 1) : 
' ends field 2 on line 1 when detecting types: Date and 
Time,Open,High,Low,Close,Volume
2007/01/01 22:51:00,5683.00,5683.00,5673.00,5673.00,64



On 24 Dec 2012, at 13:48, Matthew Dowle <[email protected]> wrote:

> 
> Yes autostart is the line it detects separators, then it searches upwards to 
> find the first row with the same number of columns. If that row is all 
> character then it deems that as the column name row.  So if you start 
> autostart on 1, it's already at the top and it might catch the right 
> separator by avoiding the data rows for separator detection.
> 
> On 24.12.2012 11:52, Hideyoshi Maeda wrote:
>> Thanks for the quick response.
>> 
>> I wasn't sure if I understood you correctly, but isn't the problem
>> the way that autostart finds separators?
>> 
>> and in my example, it had headers, so I think it would need to start
>> from row 2 wouldn't it, i.e. the first row that has non-header values?
>> 
>> Thanks
>> 
>> On 24 Dec 2012, at 11:44, Matthew Dowle <[email protected]> wrote:
>> 
>>> 
>>> Hi,
>>> 
>>> Ah yes, haven't hooked up the sep override yet, apologies, will fix.
>>> Maybe setting autostart to the row number of the header row (probably 1)
>>> might work.
>>> 
>>> Thanks,
>>> Matthew
>>> 
>>> 
>>> On 24.12.2012 11:08, Hideyoshi Maeda wrote:
>>>> oups…forgot to add the output from the verbose part…here it is...
>>>> 
>>>> Detected eol as \r\n (CRLF) in that order, the Windows standard.
>>>> Starting format detection on line 30 (the last non blank line in the
>>>> first 30)
>>>> Detected sep as '/' and 3 columns
>>>> Type codes: 003
>>>> Found first row with 3 fields occuring on line 1 (either column names
>>>> or first row of data)
>>>> The first data row has some non character fields. Treating as a data
>>>> row and using default column names.
>>>> Count of eol after pos: 1143699
>>>> Subtracted 1 for last eol and any trailing empty lines, leaving
>>>> 1143698 data rows
>>>>  0.153s ( 21%) Memory map (quicker if you rerun)
>>>>  0.000s (  0%) Format detection
>>>>  0.095s ( 13%) Count rows (wc -l)
>>>>  0.001s (  0%) Allocation of 1143698x3 result (xMB) in RAM
>>>>  0.480s ( 66%) Reading data
>>>>  0.000s (  0%) Bumping column type midread and coercing data already read
>>>>  0.002s (  0%) Changing na.strings to NA
>>>>  0.731s        Total
>>>> 
>>>> 
>>>> On 24 Dec 2012, at 11:04, Hideyoshi Maeda <[email protected]> 
>>>> wrote:
>>>> 
>>>>> Hi Matthew,
>>>>> 
>>>>> I am using the new `data.table` `fread()` function to read my csv files, 
>>>>> which has the format as follows when using the read.csv function
>>>>> 
>>>>>          Date.and.Time Open High  Low Close Volume
>>>>>  1 2007/01/01 22:51:00 5683 5683 5673  5673     64
>>>>>  2 2007/01/01 22:52:00 5675 5676 5674  5674     17
>>>>>  3 2007/01/01 22:53:00 5674 5674 5673  5674     42
>>>>> 
>>>>> The value of the first column is all of: `2007/01/01 22:53:00`, the next 
>>>>> 5 columns are separated with commas.
>>>>> 
>>>>> but when reading the same file using fread i get the following output
>>>>> 
>>>>>      V1 V2                                             V3
>>>>>  1 2007  1 01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>>>>  2 2007  1 01 22:52:00,5675.00,5676.00,5674.00,5674.00,17
>>>>>  3 2007  1 01 22:53:00,5674.00,5674.00,5673.00,5674.00,42
>>>>> 
>>>>> This is because the autodetect is using the "/" as a separator...
>>>>> 
>>>>> I tried overriding this using the `sep=","` argument but this does not 
>>>>> seem to be used in the function anywhere.
>>>>> 
>>>>> Furthremore when using verbose I get the following output, which suggests 
>>>>> that I was right in thinking that "/" is used as a separator rather than 
>>>>> ",".
>>>>> 
>>>>> Is there any way to fix this, so that it correctly reads all 6 columns 
>>>>> separately?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> HLM
>>>>> 
>>>>> On 21 Dec 2012, at 18:28, Matthew Dowle <[email protected]> wrote:
>>>>> 
>>>>>> 
>>>>>> Hi datatablers,
>>>>>> 
>>>>>> Feedback and bug reports much appreciated :
>>>>>> 
>>>>>> =====
>>>>>> New function fread(), a fast and friendly file reader.
>>>>>> * header, skip, nrows, sep and colClasses are all auto detected.
>>>>>> * integers>2^31 are detected and read natively as bit64::integer64.
>>>>>> * accepts filenames, URLs and "A,B\n1,2\n3,4" directly
>>>>>> * new implementation entirely in C
>>>>>> * with a 50MB .csv, 1 million rows x 6 columns :
>>>>>> read.csv("test.csv")                                   # 30-60 sec
>>>>>> read.table("test.csv",<all known tricks, known nrows>) #    10 sec
>>>>>> fread("test.csv")                                      #     3 sec
>>>>>> * airline data: 658MB csv (7 million rows x 29 columns)
>>>>>> read.table("2008.csv",<all known tricks, known nrows>) #   360 sec
>>>>>> fread("2008.csv")                                      #    50 sec
>>>>>> See ?fread. Many thanks to Chris Neff and Garrett See for ideas,
>>>>>> discussions and beta testing.
>>>>>> =====
>>>>>> 
>>>>>> 1.8.7 is passing checks on Unix and Windows (but not Mac yet) :
>>>>>> 
>>>>>> install.packages("data.table", repos="http://R-Forge.R-project.org";)
>>>>>> require(data.table)
>>>>>> ?fread
>>>>>> fread("your biggest baddest file")
>>>>>> 
>>>>>> Oddly, R-Forge appears to be compiling Win64 with -O2 optimization rather
>>>>>> than -O3 (but -O3 on Win32 ok), so speedups might not be as great on 
>>>>>> Win64
>>>>>> until that can be resolved on R-Forge, unless you compile yourself. -O3
>>>>>> has some optimizations that fread may benefit from. But interested to 
>>>>>> hear.
>>>>>> 
>>>>>> Seasons greatings!
>>>>>> 
>>>>>> Matthew
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> datatable-help mailing list
>>>>>> [email protected]
>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>> 
>>> 
> 

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] New function fread() in v1.8.7

Reply via email to