Definitely a 64-bit machine. Here are the details: Processor: Intel Xeon CPU E7-4830 @2.13GHz (4 processors) Installed memory (RAM): 128GB System type: 64-bit Operating System Windows edition: Server 2008 R2 Enterprise SP1
Regards, Paul On 3 May 2013 10:51, Matthew Dowle <[email protected]> wrote: > ** > > > > Hi Paul, > > Thanks for all this! > > > The problem arises when the file reaches 4GB, in this case between > 8,030,000 and 8,040,000 rows: > > Ahah. Are you using a 32bit or 64bit Windows machine? > > Thanks, Matthew > > > > On 02.05.2013 10:19, Paul Harding wrote: > > Some supplementary information, here is the portion of the file (with row > numbers, +1 for header) around where fread thinks the file ends. > $ nl spd_all_fixed.csv | head -n 9186300 |tail > 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0 > 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0 > 9186293 > 204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13 > 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 > 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0 > 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0 > 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0 > 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0 > 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0 > 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0 > 9186294 (row 9186293 excl header) is where fread thinks the file ends, > mid-line by the look of it! > I've experimented by truncating the file. The error varies, either it > reads too few records or gives the error I reported, presumably determined > by whether the last perceived line is entire. > The problem arises when the file reaches 4GB, in this case between > 8,030,000 and 8,040,000 rows: > -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1 12:02 > spd_all_trunc_8030k.csv > -rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May 1 12:06 > spd_all_trunc_8040k.csv > > dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T) > Detected eol as \r\n (CRLF) in that order, the Windows standard. > Looking for supplied sep ',' on line 30 (the last non blank line in the > first 30) ... found > Found 9 columns > First row with 9 fields occurs on line 1 (either column names or first row > of data) > All the fields on line 1 are character fields. Treating as the column > names. > Count of eol after first data row: 80300000 > Subtracted 1 for last eol and any trailing empty lines, leaving 80299999 > data rows > Type codes: 000002000 (first 5 rows) > Type codes: 000002000 (+middle 5 rows) > Type codes: 000002000 (+last 5 rows) > 0%Bumping column 7 from INT to INT64 on data row 9, field contains > '0.42634430000000001' > Bumping column 7 from INT64 to REAL on data row 9, field contains > '0.42634430000000001' > 0.000s ( 0%) Memory map (rerun may be quicker) > 0.000s ( 0%) Sep and header detection > 0.000s ( 0%) Count rows (wc -l) > 0.000s ( 0%) Colmn type detection (first, middle and last 5 rows) > 0.000s ( 0%) Allocation of 80299999x9 result (xMB) in RAM > 171.188s ( 65%) Reading data > 1365231.809s (518439%) Allocation for type bumps (if any), including gc > time if triggered > -1365231.809s (-518439%) Coercing data already read in type bumps (if any) > 0.000s ( 0%) Changing na.strings to NA > 0.000s Total > > dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T) > Detected eol as \r\n (CRLF) in that order, the Windows standard. > Looking for supplied sep ',' on line 30 (the last non blank line in the > first 30) ... found > Found 9 columns > First row with 9 fields occurs on line 1 (either column names or first row > of data) > All the fields on line 1 are character fields. Treating as the column > names. > Count of eol after first data row: 18913 > Subtracted 0 for last eol and any trailing empty lines, leaving 18913 data > rows > Type codes: 000002000 (first 5 rows) > Type codes: 000002000 (+middle 5 rows) > Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) : > Expected sep (',') but ',' ends field 2 on line 6 when detecting types: > 204650,724540, > Regards, > Paul > > > On 1 May 2013 10:28, Paul Harding <[email protected]> wrote: > >> Here is the verbose output: >> > dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) >> Detected eol as \r\n (CRLF) in that order, the Windows standard. >> Looking for supplied sep ',' on line 30 (the last non blank line in the >> first 30) ... found >> Found 9 columns >> First row with 9 fields occurs on line 1 (either column names or first >> row of data) >> All the fields on line 1 are character fields. Treating as the column >> names. >> Count of eol after first data row: 9186293 >> Subtracted 0 for last eol and any trailing empty lines, leaving 9186293 >> data rows >> Type codes: 000002000 (first 5 rows) >> Type codes: 000002200 (+middle 5 rows) >> Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) : >> Expected sep (',') but '0' ends field 5 on line 6 when detecting >> types: 204038,2617097,20110803,0,0 >> But here is the wc output (via cygwin; newline, word (whitespace delim >> so each word one 'line' here), byte)@ >> $ wc spd_all_fixed.csv >> 168997637 168997638 9078155125 spd_all_fixed.csv >> [So fread 9M, wc 168M rows]. >> Regards >> Paul >> >> >> On 30 April 2013 18:52, Matthew Dowle <[email protected]> wrote: >> >>> >>> >>> Hi, >>> >>> Thanks for reporting this. Please set verbose=TRUE and let us know the >>> output. >>> >>> Thanks, Matthew >>> >>> >>> >>> On 30.04.2013 18:01, Paul Harding wrote: >>> >>> Problem with fread on a large file >>> The file is 8GB, just short of 200,000 lines, produced as SQLoutput and >>> modified by cygwin/perl to remove the second line. >>> Using data.table 1.8.8 on R3.0.0 I get an fread error >>> fread("data/spd_all_fixed.csv",sep=",") >>> Error in fread("data/spd_all_fixed.csv", sep = ",") : >>> Expected sep (',') but '0' ends field 5 on line 6 when detecting >>> types: 204038,2617097,20110803,0,0 >>> Looking for the offending line,with line numbers in output so I'm >>> guessing this is line 6 of the mid-file chunk examined, >>> $ grep -n '204038,2617097,201108' spd_all_fixed.csv >>> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 >>> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 >>> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 >>> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 >>> and comparing to surrounding lines and the first ten lines >>> $ head spd_all_fixed.csv >>> s_key,i_key,p_key,q,pq,d,l,epi,class >>> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 >>> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 >>> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 >>> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 >>> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 >>> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 >>> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 >>> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 >>> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 >>> I can't see any difference. I wonder if this is a bug? I have no >>> problems on a small test data set run through an identical process and >>> using the same fread command. >>> Regards >>> Paul >>> >>> >>> >>> >> > >
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
