Re: [datatable-help] New function fread() in v1.8.7

Hideyoshi Maeda Fri, 28 Dec 2012 14:21:26 -0800

No problem for the confirm…Thanks again for fixing it.

As for the file itself having "Date and Time", you are right….i just assumed 
that this function was designed to replace/speed up the read.csv function, i.e. 
work in exactly the same way but faster. Thanks for letting me know about the 
make.names call though.




On 28 Dec 2012, at 22:06, Matthew Dowle <[email protected]> wrote:

> 
> Great. Thanks for confirm.
> 
> The file itself has "Date and Time" as the column name doesn't it i.e. with 
> spaces not dots? fread retains exactly what's in the file, whereas read.csv 
> runs the column names through base::make.names() which converts the spaces to 
> dots to make the column names syntactically valid, iiuc. data.table's general 
> policy is to allow spaces and other unusual characters in columns names and 
> retain them throughout (forgiving the odd bug now fixed caused by some 
> make.names calls which should have been make.unique).
> 
> To do the same as read.csv :
> 
>    DT = fread(...)
>    setnames(DT,make.names(names(DT)))
> 
> Not sure I understood correctly and I didn't test.
> 
> 
> On 28.12.2012 21:36, Hideyoshi Maeda wrote:
>> The sep argument now works thank you!
>> 
>> But just out of curiosity…not a major problem of sorts but by using
>> fread(file.path,sep=",") on my csv file, the column names includes "."
>> as shown in my original email… but the output result automatically
>> removes the "." in the column name…is there a way to stop it from
>> doing that?, i.e. the first column becomes "Data and Time"  when using
>> fread, rather than the original "Date.and.Time" when using read.csv
>> 
>> 
>> On 26 Dec 2012, at 22:21, Matthew Dowle <[email protected]> wrote:
>> 
>>> 
>>> sep is now passed through and have added your example as a test.
>>> Hope ok now.
>>> 
>>> Thanks,
>>> Matthew
>>> 
>>> On 24.12.2012 14:18, Hideyoshi Maeda wrote:
>>>> using autostart=1 gives the following error
>>>> 
>>>> Error in fread(file.path, autostart = 1) :
>>>> ' ends field 2 on line 1 when detecting types: Date and
>>>> Time,Open,High,Low,Close,Volume
>>>> 2007/01/01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>>> 
>>>> 
>>>> On 24 Dec 2012, at 13:48, Matthew Dowle <[email protected]> wrote:
>>>> 
>>>>> 
>>>>> Yes autostart is the line it detects separators, then it searches upwards 
>>>>> to find the first row with the same number of columns. If that row is all 
>>>>> character then it deems that as the column name row. So if you start 
>>>>> autostart on 1, it's already at the top and it might catch the right 
>>>>> separator by avoiding the data rows for separator detection.
>>>>> 
>>>>> On 24.12.2012 11:52, Hideyoshi Maeda wrote:
>>>>>> Thanks for the quick response.
>>>>>> 
>>>>>> I wasn't sure if I understood you correctly, but isn't the problem
>>>>>> the way that autostart finds separators?
>>>>>> 
>>>>>> and in my example, it had headers, so I think it would need to start
>>>>>> from row 2 wouldn't it, i.e. the first row that has non-header values?
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> On 24 Dec 2012, at 11:44, Matthew Dowle <[email protected]> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> Ah yes, haven't hooked up the sep override yet, apologies, will fix.
>>>>>>> Maybe setting autostart to the row number of the header row (probably 1)
>>>>>>> might work.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Matthew
>>>>>>> 
>>>>>>> 
>>>>>>> On 24.12.2012 11:08, Hideyoshi Maeda wrote:
>>>>>>>> oups…forgot to add the output from the verbose part…here it is...
>>>>>>>> 
>>>>>>>> Detected eol as \r\n (CRLF) in that order, the Windows standard.
>>>>>>>> Starting format detection on line 30 (the last non blank line in the
>>>>>>>> first 30)
>>>>>>>> Detected sep as '/' and 3 columns
>>>>>>>> Type codes: 003
>>>>>>>> Found first row with 3 fields occuring on line 1 (either column names
>>>>>>>> or first row of data)
>>>>>>>> The first data row has some non character fields. Treating as a data
>>>>>>>> row and using default column names.
>>>>>>>> Count of eol after pos: 1143699
>>>>>>>> Subtracted 1 for last eol and any trailing empty lines, leaving
>>>>>>>> 1143698 data rows
>>>>>>>> 0.153s ( 21%) Memory map (quicker if you rerun)
>>>>>>>> 0.000s (  0%) Format detection
>>>>>>>> 0.095s ( 13%) Count rows (wc -l)
>>>>>>>> 0.001s (  0%) Allocation of 1143698x3 result (xMB) in RAM
>>>>>>>> 0.480s ( 66%) Reading data
>>>>>>>> 0.000s (  0%) Bumping column type midread and coercing data already 
>>>>>>>> read
>>>>>>>> 0.002s (  0%) Changing na.strings to NA
>>>>>>>> 0.731s        Total
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 24 Dec 2012, at 11:04, Hideyoshi Maeda <[email protected]> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Matthew,
>>>>>>>>> 
>>>>>>>>> I am using the new `data.table` `fread()` function to read my csv 
>>>>>>>>> files, which has the format as follows when using the read.csv 
>>>>>>>>> function
>>>>>>>>> 
>>>>>>>>>        Date.and.Time Open High  Low Close Volume
>>>>>>>>> 1 2007/01/01 22:51:00 5683 5683 5673  5673     64
>>>>>>>>> 2 2007/01/01 22:52:00 5675 5676 5674  5674     17
>>>>>>>>> 3 2007/01/01 22:53:00 5674 5674 5673  5674     42
>>>>>>>>> 
>>>>>>>>> The value of the first column is all of: `2007/01/01 22:53:00`, the 
>>>>>>>>> next 5 columns are separated with commas.
>>>>>>>>> 
>>>>>>>>> but when reading the same file using fread i get the following output
>>>>>>>>> 
>>>>>>>>>    V1 V2                                             V3
>>>>>>>>> 1 2007  1 01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>>>>>>>> 2 2007  1 01 22:52:00,5675.00,5676.00,5674.00,5674.00,17
>>>>>>>>> 3 2007  1 01 22:53:00,5674.00,5674.00,5673.00,5674.00,42
>>>>>>>>> 
>>>>>>>>> This is because the autodetect is using the "/" as a separator...
>>>>>>>>> 
>>>>>>>>> I tried overriding this using the `sep=","` argument but this does 
>>>>>>>>> not seem to be used in the function anywhere.
>>>>>>>>> 
>>>>>>>>> Furthremore when using verbose I get the following output, which 
>>>>>>>>> suggests that I was right in thinking that "/" is used as a separator 
>>>>>>>>> rather than ",".
>>>>>>>>> 
>>>>>>>>> Is there any way to fix this, so that it correctly reads all 6 
>>>>>>>>> columns separately?
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> 
>>>>>>>>> HLM
>>>>>>>>> 
>>>>>>>>> On 21 Dec 2012, at 18:28, Matthew Dowle <[email protected]> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hi datatablers,
>>>>>>>>>> 
>>>>>>>>>> Feedback and bug reports much appreciated :
>>>>>>>>>> 
>>>>>>>>>> =====
>>>>>>>>>> New function fread(), a fast and friendly file reader.
>>>>>>>>>> * header, skip, nrows, sep and colClasses are all auto detected.
>>>>>>>>>> * integers>2^31 are detected and read natively as bit64::integer64.
>>>>>>>>>> * accepts filenames, URLs and "A,B\n1,2\n3,4" directly
>>>>>>>>>> * new implementation entirely in C
>>>>>>>>>> * with a 50MB .csv, 1 million rows x 6 columns :
>>>>>>>>>> read.csv("test.csv")                                   # 30-60 sec
>>>>>>>>>> read.table("test.csv",<all known tricks, known nrows>) #    10 sec
>>>>>>>>>> fread("test.csv")                                      #     3 sec
>>>>>>>>>> * airline data: 658MB csv (7 million rows x 29 columns)
>>>>>>>>>> read.table("2008.csv",<all known tricks, known nrows>) #   360 sec
>>>>>>>>>> fread("2008.csv")                                      #    50 sec
>>>>>>>>>> See ?fread. Many thanks to Chris Neff and Garrett See for ideas,
>>>>>>>>>> discussions and beta testing.
>>>>>>>>>> =====
>>>>>>>>>> 
>>>>>>>>>> 1.8.7 is passing checks on Unix and Windows (but not Mac yet) :
>>>>>>>>>> 
>>>>>>>>>> install.packages("data.table", repos="http://R-Forge.R-project.org";)
>>>>>>>>>> require(data.table)
>>>>>>>>>> ?fread
>>>>>>>>>> fread("your biggest baddest file")
>>>>>>>>>> 
>>>>>>>>>> Oddly, R-Forge appears to be compiling Win64 with -O2 optimization 
>>>>>>>>>> rather
>>>>>>>>>> than -O3 (but -O3 on Win32 ok), so speedups might not be as great on 
>>>>>>>>>> Win64
>>>>>>>>>> until that can be resolved on R-Forge, unless you compile yourself. 
>>>>>>>>>> -O3
>>>>>>>>>> has some optimizations that fread may benefit from. But interested 
>>>>>>>>>> to hear.
>>>>>>>>>> 
>>>>>>>>>> Seasons greatings!
>>>>>>>>>> 
>>>>>>>>>> Matthew
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> datatable-help mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>>>>> 
>>>>>>> 
>>>>> 
> 

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] New function fread() in v1.8.7

Reply via email to