Passing on winbuilder now.
.zip (rev 874) uploaded to homepage (will take an hour or two to refresh), but available now from here : https://r-forge.r-project.org/scm/viewvc.php/*checkout*/www/data.table_1.8.9_rev874.zip?revision=875&root=datatable Matthew On 13.05.2013 21:38, Matthew Dowle wrote: > Hi Paul, > > Sorry for that hassle. As you've realised I don't develop data.table on Windows. Those lines are switched in at compile time for Windows, and so I rely on (the truly impressive) winbuilder to compile and test for me. On this occasion, I did submit to winbuilder last night but it didn't reply (even with a compile error) which is extremely unusual. And R-Forge is stuck in 'building' state too (which is not unusual, sadly). > > I''ll let you know when it's passing on winbuilder, and I'll updated the Windows .zip on the homepage (since we can't rely on R-Forge) ... > > Matthew > > On 13.05.2013 16:01, Paul Harding wrote: > >> I'd love to test it, pulled the latest commit with svn, not sure about building from source on windows, got some compilation errors: >> >>> install.packages("pkg/",type="source",repos=NULL) >> Warning in install.packages : >> package 'pkg/' is not available (for R version 3.0.0) >> * installing *source* package 'data.table' ... >> ** libs >> gcc -m64 -I"C:/Users/PAUL~1.HAR/R/R-30~1.0/include" -DNDEBUG -I"d:/RCompile/CRANpkg/extralibs64/local/include" -O2 -Wall -std=gnu99 -mtune=core2 -c fread.c -o fread.o >> fread.c: In function 'readfile': >> fread.c:343:9: error: 'hfile' undeclared (first use in this function) >> fread.c:343:9: note: each undeclared identifier is reported only once for each function it appears in >> fread.c:346:115: error: expected ';' before ')' token >> fread.c:346:115: error: expected statement before ')' token >> fread.c:350:17: warning: implicit declaration of function 'nanosleep' [-Wimplicit-function-declaration] >> make: *** [fread.o] Error 1 >> ERROR: compilation failed for package 'data.table' >> Regards >> Paul >> >> On 11 May 2013 02:39, Matthew Dowle <[email protected] [4]> wrote: >> >>> Paul, Vishal, >>> >>> Commit 859 : >>> >>> * fread now supports files larger than 4GB on 64bit Windows (#2767 thanks to Paul Harding) and files >>> between 2GB and 4GB on 32bit Windows (#2655 thanks to Vishal). A C call to GetFileSize() needed to >>> be GetFileSizeEx(). >>> >>> Please test and confirm ok now. >>> >>> Thanks, Matthew >>> >>> On 03.05.2013 14:59, Matthew Dowle wrote: >>> >>>> Oh. Then it's likely a bug with fread on Windows for files > 4GB. Think GetFileSize() should be GetFileSizeEx(), iirc. >>>> >>>> Please could you file it as a bug on the tracker. Thanks. >>>> >>>> Matthew >>>> >>>> On 03.05.2013 14:32, Paul Harding wrote: >>>> >>>>> Definitely a 64-bit machine. Here are the details: >>>>> >>>>> Processor: Intel Xeon CPU E7-4830 @2.13GHz (4 processors) >>>>> Installed memory (RAM): 128GB >>>>> System type: 64-bit Operating System >>>>> Windows edition: Server 2008 R2 Enterprise SP1 >>>>> Regards, >>>>> Paul >>>>> >>>>> On 3 May 2013 10:51, Matthew Dowle <[email protected] [3]> wrote: >>>>> >>>>>> Hi Paul, >>>>>> >>>>>> Thanks for all this! >>>>>> >>>>>>> The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: >>>>>> >>>>>> Ahah. Are you using a 32bit or 64bit Windows machine? >>>>>> >>>>>> Thanks, Matthew >>>>>> >>>>>> On 02.05.2013 10:19, Paul Harding wrote: >>>>>> >>>>>>> Some supplementary information, here is the portion of the file (with row numbers, +1 for header) around where fread thinks the file ends. >>>>>>> >>>>>>> $ nl spd_all_fixed.csv | head -n 9186300 |tail >>>>>>> 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0 >>>>>>> 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0 >>>>>>> 9186293 204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13 >>>>>>> 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>>>>>> 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0 >>>>>>> 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0 >>>>>>> 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0 >>>>>>> 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0 >>>>>>> 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0 >>>>>>> 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0 >>>>>>> 9186294 (row 9186293 excl header) is where fread thinks the file ends, mid-line by the look of it! >>>>>>> I've experimented by truncating the file. The error varies, either it reads too few records or gives the error I reported, presumably determined by whether the last perceived line is entire. >>>>>>> The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: >>>>>>> >>>>>>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1 12:02 spd_all_trunc_8030k.csv >>>>>>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May 1 12:06 spd_all_trunc_8040k.csv >>>>>>> >>>>>>>> dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T) >>>>>>> >>>>>>> Detected eol as rn (CRLF) in that order, the Windows standard. >>>>>>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>>>>>> Found 9 columns >>>>>>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>>>>>> All the fields on line 1 are character fields. Treating as the column names. >>>>>>> Count of eol after first data row: 80300000 >>>>>>> Subtracted 1 for last eol and any trailing empty lines, leaving 80299999 data rows >>>>>>> >>>>>>> Type codes: 000002000 (first 5 rows) >>>>>>> Type codes: 000002000 (+middle 5 rows) >>>>>>> Type codes: 000002000 (+last 5 rows) >>>>>>> 0%Bumping column 7 from INT to INT64 on data row 9, field contains '0.42634430000000001' >>>>>>> Bumping column 7 from INT64 to REAL on data row 9, field contains '0.42634430000000001' >>>>>>> 0.000s ( 0%) Memory map (rerun may be quicker) >>>>>>> 0.000s ( 0%) Sep and header detection >>>>>>> 0.000s ( 0%) Count rows (wc -l) >>>>>>> 0.000s ( 0%) Colmn type detection (first, middle and last 5 rows) >>>>>>> 0.000s ( 0%) Allocation of 80299999x9 result (xMB) in RAM >>>>>>> 171.188s ( 65%) Reading data >>>>>>> 1365231.809s (518439%) Allocation for type bumps (if any), including gc time if triggered >>>>>>> -1365231.809s (-518439%) Coercing data already read in type bumps (if any) >>>>>>> 0.000s ( 0%) Changing na.strings to NA >>>>>>> 0.000s Total >>>>>>>> dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T) >>>>>>> >>>>>>> Detected eol as rn (CRLF) in that order, the Windows standard. >>>>>>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>>>>>> Found 9 columns >>>>>>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>>>>>> All the fields on line 1 are character fields. Treating as the column names. >>>>>>> Count of eol after first data row: 18913 >>>>>>> Subtracted 0 for last eol and any trailing empty lines, leaving 18913 data rows >>>>>>> >>>>>>> Type codes: 000002000 (first 5 rows) >>>>>>> Type codes: 000002000 (+middle 5 rows) >>>>>>> Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) : >>>>>>> Expected sep (',') but ',' ends field 2 on line 6 when detecting types: 204650,724540, >>>>>>> Regards, >>>>>>> Paul >>>>>>> >>>>>>> On 1 May 2013 10:28, Paul Harding <[email protected] [2]> wrote: >>>>>>> >>>>>>>> Here is the verbose output: >>>>>>>> >>>>>>>>> dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) >>>>>>>> Detected eol as rn (CRLF) in that order, the Windows standard. >>>>>>>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>>>>>>> Found 9 columns >>>>>>>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>>>>>>> All the fields on line 1 are character fields. Treating as the column names. >>>>>>>> Count of eol after first data row: 9186293 >>>>>>>> Subtracted 0 for last eol and any trailing empty lines, leaving 9186293 data rows >>>>>>>> Type codes: 000002000 (first 5 rows) >>>>>>>> Type codes: 000002200 (+middle 5 rows) >>>>>>>> Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) : >>>>>>>> >>>>>>>> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 >>>>>>>> But here is the wc output (via cygwin; newline, word (whitespace delim so each word one 'line' here), byte)@ >>>>>>>> >>>>>>>> $ wc spd_all_fixed.csv >>>>>>>> 168997637 168997638 9078155125 spd_all_fixed.csv >>>>>>>> [So fread 9M, wc 168M rows]. >>>>>>>> Regards >>>>>>>> Paul >>>>>>>> >>>>>>>> On 30 April 2013 18:52, Matthew Dowle <[email protected] [1]> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> Thanks for reporting this. Please set verbose=TRUE and let us know the output. >>>>>>>>> >>>>>>>>> Thanks, Matthew >>>>>>>>> >>>>>>>>> On 30.04.2013 18:01, Paul Harding wrote: >>>>>>>>> >>>>>>>>>> Problem with fread on a large file The file is 8GB, just short of 200,000 lines, produced as SQLoutput and modified by cygwin/perl to remove the second line. >>>>>>>>>> >>>>>>>>>> Using data.table 1.8.8 on R3.0.0 I get an fread error >>>>>>>>>> >>>>>>>>>> fread("data/spd_all_fixed.csv",sep=",") >>>>>>>>>> Error in fread("data/spd_all_fixed.csv", sep = ",") : >>>>>>>>>> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 >>>>>>>>>> Looking for the offending line,with line numbers in output so I'm guessing this is line 6 of the mid-file chunk examined, >>>>>>>>>> >>>>>>>>>> $ grep -n '204038,2617097,201108' spd_all_fixed.csv >>>>>>>>>> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 >>>>>>>>>> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 >>>>>>>>>> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>>>>>>>>> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 >>>>>>>>>> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 >>>>>>>>>> and comparing to surrounding lines and the first ten lines >>>>>>>>>> >>>>>>>>>> $ head spd_all_fixed.csv >>>>>>>>>> s_key,i_key,p_key,q,pq,d,l,epi,class >>>>>>>>>> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 >>>>>>>>>> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 >>>>>>>>>> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 >>>>>>>>>> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 >>>>>>>>>> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 >>>>>>>>>> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 >>>>>>>>>> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 >>>>>>>>>> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 >>>>>>>>>> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 >>>>>>>>>> I can't see any difference. I wonder if this is a bug? I have no problems on a small test data set run through an identical process and using the same fread command. >>>>>>>>>> Regards >>>>>>>>>> Paul Links: ------ [1] mailto:[email protected] [2] mailto:[email protected] [3] mailto:[email protected] [4] mailto:[email protected]
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
