Oh. Then it's likely a bug with fread on Windows for files > 4GB. Think GetFileSize() should be GetFileSizeEx(), iirc.
Please could you file it as a bug on the tracker. Thanks. Matthew On 03.05.2013 14:32, Paul Harding wrote: > Definitely a 64-bit machine. Here are the details: > > Processor: Intel Xeon CPU E7-4830 @2.13GHz (4 processors) > Installed memory (RAM): 128GB > System type: 64-bit Operating System > Windows edition: Server 2008 R2 Enterprise SP1 > Regards, > Paul > > On 3 May 2013 10:51, Matthew Dowle <[email protected] [3]> wrote: > >> Hi Paul, >> >> Thanks for all this! >> >>> The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: >> >> Ahah. Are you using a 32bit or 64bit Windows machine? >> >> Thanks, Matthew >> >> On 02.05.2013 10:19, Paul Harding wrote: >> >>> Some supplementary information, here is the portion of the file (with row numbers, +1 for header) around where fread thinks the file ends. >>> >>> $ nl spd_all_fixed.csv | head -n 9186300 |tail >>> 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0 >>> 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0 >>> 9186293 204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13 >>> 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>> 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0 >>> 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0 >>> 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0 >>> 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0 >>> 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0 >>> 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0 >>> 9186294 (row 9186293 excl header) is where fread thinks the file ends, mid-line by the look of it! >>> I've experimented by truncating the file. The error varies, either it reads too few records or gives the error I reported, presumably determined by whether the last perceived line is entire. >>> The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: >>> >>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1 12:02 spd_all_trunc_8030k.csv >>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May 1 12:06 spd_all_trunc_8040k.csv >>> >>>> dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T) >>> >>> Detected eol as rn (CRLF) in that order, the Windows standard. >>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>> Found 9 columns >>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>> All the fields on line 1 are character fields. Treating as the column names. >>> Count of eol after first data row: 80300000 >>> Subtracted 1 for last eol and any trailing empty lines, leaving 80299999 data rows >>> >>> Type codes: 000002000 (first 5 rows) >>> Type codes: 000002000 (+middle 5 rows) >>> Type codes: 000002000 (+last 5 rows) >>> 0%Bumping column 7 from INT to INT64 on data row 9, field contains '0.42634430000000001' >>> Bumping column 7 from INT64 to REAL on data row 9, field contains '0.42634430000000001' >>> 0.000s ( 0%) Memory map (rerun may be quicker) >>> 0.000s ( 0%) Sep and header detection >>> 0.000s ( 0%) Count rows (wc -l) >>> 0.000s ( 0%) Colmn type detection (first, middle and last 5 rows) >>> 0.000s ( 0%) Allocation of 80299999x9 result (xMB) in RAM >>> 171.188s ( 65%) Reading data >>> 1365231.809s (518439%) Allocation for type bumps (if any), including gc time if triggered >>> -1365231.809s (-518439%) Coercing data already read in type bumps (if any) >>> 0.000s ( 0%) Changing na.strings to NA >>> 0.000s Total >>>> dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T) >>> >>> Detected eol as rn (CRLF) in that order, the Windows standard. >>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>> Found 9 columns >>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>> All the fields on line 1 are character fields. Treating as the column names. >>> Count of eol after first data row: 18913 >>> Subtracted 0 for last eol and any trailing empty lines, leaving 18913 data rows >>> >>> Type codes: 000002000 (first 5 rows) >>> Type codes: 000002000 (+middle 5 rows) >>> Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) : >>> Expected sep (',') but ',' ends field 2 on line 6 when detecting types: 204650,724540, >>> Regards, >>> Paul >>> >>> On 1 May 2013 10:28, Paul Harding <[email protected] [2]> wrote: >>> >>>> Here is the verbose output: >>>> >>>>> dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) >>>> Detected eol as rn (CRLF) in that order, the Windows standard. >>>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>>> Found 9 columns >>>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>>> All the fields on line 1 are character fields. Treating as the column names. >>>> Count of eol after first data row: 9186293 >>>> Subtracted 0 for last eol and any trailing empty lines, leaving 9186293 data rows >>>> Type codes: 000002000 (first 5 rows) >>>> Type codes: 000002200 (+middle 5 rows) >>>> Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) : >>>> >>>> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 >>>> But here is the wc output (via cygwin; newline, word (whitespace delim so each word one 'line' here), byte)@ >>>> >>>> $ wc spd_all_fixed.csv >>>> 168997637 168997638 9078155125 spd_all_fixed.csv >>>> [So fread 9M, wc 168M rows]. >>>> Regards >>>> Paul >>>> >>>> On 30 April 2013 18:52, Matthew Dowle <[email protected] [1]> wrote: >>>> >>>>> Hi, >>>>> >>>>> Thanks for reporting this. Please set verbose=TRUE and let us know the output. >>>>> >>>>> Thanks, Matthew >>>>> >>>>> On 30.04.2013 18:01, Paul Harding wrote: >>>>> >>>>>> Problem with fread on a large file The file is 8GB, just short of 200,000 lines, produced as SQLoutput and modified by cygwin/perl to remove the second line. >>>>>> >>>>>> Using data.table 1.8.8 on R3.0.0 I get an fread error >>>>>> >>>>>> fread("data/spd_all_fixed.csv",sep=",") >>>>>> Error in fread("data/spd_all_fixed.csv", sep = ",") : >>>>>> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 >>>>>> Looking for the offending line,with line numbers in output so I'm guessing this is line 6 of the mid-file chunk examined, >>>>>> >>>>>> $ grep -n '204038,2617097,201108' spd_all_fixed.csv >>>>>> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 >>>>>> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 >>>>>> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>>>>> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 >>>>>> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 >>>>>> and comparing to surrounding lines and the first ten lines >>>>>> >>>>>> $ head spd_all_fixed.csv >>>>>> s_key,i_key,p_key,q,pq,d,l,epi,class >>>>>> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 >>>>>> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 >>>>>> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 >>>>>> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 >>>>>> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 >>>>>> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 >>>>>> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 >>>>>> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 >>>>>> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 >>>>>> I can't see any difference. I wonder if this is a bug? I have no problems on a small test data set run through an identical process and using the same fread command. >>>>>> Regards >>>>>> Paul Links: ------ [1] mailto:[email protected] [2] mailto:[email protected] [3] mailto:[email protected]
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
