Paul, Vishal,
Commit 859 : * fread now supports files larger than 4GB on 64bit Windows (#2767 thanks to Paul Harding) and files between 2GB and 4GB on 32bit Windows (#2655 thanks to Vishal). A C call to GetFileSize() needed to be GetFileSizeEx(). Please test and confirm ok now. Thanks, Matthew On 03.05.2013 14:59, Matthew Dowle wrote: > Oh. Then it's likely a bug with fread on Windows for files > 4GB. Think GetFileSize() should be GetFileSizeEx(), iirc. > > Please could you file it as a bug on the tracker. Thanks. > > Matthew > > On 03.05.2013 14:32, Paul Harding wrote: > >> Definitely a 64-bit machine. Here are the details: >> >> Processor: Intel Xeon CPU E7-4830 @2.13GHz (4 processors) >> Installed memory (RAM): 128GB >> System type: 64-bit Operating System >> Windows edition: Server 2008 R2 Enterprise SP1 >> Regards, >> Paul >> >> On 3 May 2013 10:51, Matthew Dowle <[email protected] [3]> wrote: >> >>> Hi Paul, >>> >>> Thanks for all this! >>> >>>> The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: >>> >>> Ahah. Are you using a 32bit or 64bit Windows machine? >>> >>> Thanks, Matthew >>> >>> On 02.05.2013 10:19, Paul Harding wrote: >>> >>>> Some supplementary information, here is the portion of the file (with row numbers, +1 for header) around where fread thinks the file ends. >>>> >>>> $ nl spd_all_fixed.csv | head -n 9186300 |tail >>>> 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0 >>>> 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0 >>>> 9186293 204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13 >>>> 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>>> 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0 >>>> 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0 >>>> 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0 >>>> 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0 >>>> 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0 >>>> 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0 >>>> 9186294 (row 9186293 excl header) is where fread thinks the file ends, mid-line by the look of it! >>>> I've experimented by truncating the file. The error varies, either it reads too few records or gives the error I reported, presumably determined by whether the last perceived line is entire. >>>> The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: >>>> >>>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1 12:02 spd_all_trunc_8030k.csv >>>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May 1 12:06 spd_all_trunc_8040k.csv >>>> >>>>> dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T) >>>> >>>> Detected eol as rn (CRLF) in that order, the Windows standard. >>>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>>> Found 9 columns >>>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>>> All the fields on line 1 are character fields. Treating as the column names. >>>> Count of eol after first data row: 80300000 >>>> Subtracted 1 for last eol and any trailing empty lines, leaving 80299999 data rows >>>> >>>> Type codes: 000002000 (first 5 rows) >>>> Type codes: 000002000 (+middle 5 rows) >>>> Type codes: 000002000 (+last 5 rows) >>>> 0%Bumping column 7 from INT to INT64 on data row 9, field contains '0.42634430000000001' >>>> Bumping column 7 from INT64 to REAL on data row 9, field contains '0.42634430000000001' >>>> 0.000s ( 0%) Memory map (rerun may be quicker) >>>> 0.000s ( 0%) Sep and header detection >>>> 0.000s ( 0%) Count rows (wc -l) >>>> 0.000s ( 0%) Colmn type detection (first, middle and last 5 rows) >>>> 0.000s ( 0%) Allocation of 80299999x9 result (xMB) in RAM >>>> 171.188s ( 65%) Reading data >>>> 1365231.809s (518439%) Allocation for type bumps (if any), including gc time if triggered >>>> -1365231.809s (-518439%) Coercing data already read in type bumps (if any) >>>> 0.000s ( 0%) Changing na.strings to NA >>>> 0.000s Total >>>>> dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T) >>>> >>>> Detected eol as rn (CRLF) in that order, the Windows standard. >>>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>>> Found 9 columns >>>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>>> All the fields on line 1 are character fields. Treating as the column names. >>>> Count of eol after first data row: 18913 >>>> Subtracted 0 for last eol and any trailing empty lines, leaving 18913 data rows >>>> >>>> Type codes: 000002000 (first 5 rows) >>>> Type codes: 000002000 (+middle 5 rows) >>>> Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) : >>>> Expected sep (',') but ',' ends field 2 on line 6 when detecting types: 204650,724540, >>>> Regards, >>>> Paul >>>> >>>> On 1 May 2013 10:28, Paul Harding <[email protected] [2]> wrote: >>>> >>>>> Here is the verbose output: >>>>> >>>>>> dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) >>>>> Detected eol as rn (CRLF) in that order, the Windows standard. >>>>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>>>> Found 9 columns >>>>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>>>> All the fields on line 1 are character fields. Treating as the column names. >>>>> Count of eol after first data row: 9186293 >>>>> Subtracted 0 for last eol and any trailing empty lines, leaving 9186293 data rows >>>>> Type codes: 000002000 (first 5 rows) >>>>> Type codes: 000002200 (+middle 5 rows) >>>>> Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) : >>>>> >>>>> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 >>>>> But here is the wc output (via cygwin; newline, word (whitespace delim so each word one 'line' here), byte)@ >>>>> >>>>> $ wc spd_all_fixed.csv >>>>> 168997637 168997638 9078155125 spd_all_fixed.csv >>>>> [So fread 9M, wc 168M rows]. >>>>> Regards >>>>> Paul >>>>> >>>>> On 30 April 2013 18:52, Matthew Dowle <[email protected] [1]> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Thanks for reporting this. Please set verbose=TRUE and let us know the output. >>>>>> >>>>>> Thanks, Matthew >>>>>> >>>>>> On 30.04.2013 18:01, Paul Harding wrote: >>>>>> >>>>>>> Problem with fread on a large file The file is 8GB, just short of 200,000 lines, produced as SQLoutput and modified by cygwin/perl to remove the second line. >>>>>>> >>>>>>> Using data.table 1.8.8 on R3.0.0 I get an fread error >>>>>>> >>>>>>> fread("data/spd_all_fixed.csv",sep=",") >>>>>>> Error in fread("data/spd_all_fixed.csv", sep = ",") : >>>>>>> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 >>>>>>> Looking for the offending line,with line numbers in output so I'm guessing this is line 6 of the mid-file chunk examined, >>>>>>> >>>>>>> $ grep -n '204038,2617097,201108' spd_all_fixed.csv >>>>>>> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 >>>>>>> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 >>>>>>> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>>>>>> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 >>>>>>> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 >>>>>>> and comparing to surrounding lines and the first ten lines >>>>>>> >>>>>>> $ head spd_all_fixed.csv >>>>>>> s_key,i_key,p_key,q,pq,d,l,epi,class >>>>>>> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 >>>>>>> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 >>>>>>> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 >>>>>>> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 >>>>>>> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 >>>>>>> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 >>>>>>> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 >>>>>>> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 >>>>>>> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 >>>>>>> I can't see any difference. I wonder if this is a bug? I have no problems on a small test data set run through an identical process and using the same fread command. >>>>>>> Regards >>>>>>> Paul Links: ------ [1] mailto:[email protected] [2] mailto:[email protected] [3] mailto:[email protected]
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
