Some test you can do without many code change : 1) transform your data.table as matrix before write 2) use write table + this config to save place& time ->sep = Pipe (1byte& rarely used) ->disable quote (saves "" more than bilion times) ->latin1 instead of utf-8 3) use chunks (say cut in slice output) and append=T (this may work in parallel)
If still too long, try installing some database (sqlite) on your 24 core system, and try load it hope this helps 2015-03-23 14:49 GMT+01:00 Gerald Jean <[email protected]>: > Hello, > > > > I am currently on a project where I have to read, process, aggregate 10 to > 12 millions of files for roughly 10 billions lines of data. > > > > The files are arranged in roughly 64000 directories, each directory is one > client’s data. > > > > I have written code importing and “massaging” the data per directory. The > code is data.table driven. I am running this on a 24 cores machine with > 145 Gb of RAM on a Linux box under RedHat. > > > > For testing purpose I have parallelized the code, using the doMC package, > runs fine and it seems to be fast. But I haven’t tried to output the > resulting files, three per client. A small one, a moderate size one and a > large one, over 500Gb estimated. > > > > My question: > > > > what is the best way to output those files without creating bottlenecks?? > > > > I thought of breaking the list of input directories into 24 threads, > supplying a list of lists to “foreach” where one of the components of each > sub-list would be the name of the output files but I am worried that > “write.table” would take for ever to write this data to disk, one solution > would be to use “save” and keep the output data in Rdata format, but that > complicates further analysis by other software. > > > > Any suggestions??? > > > > By the way “data.table” sure helped so far in processing that data, thanks > to the developpers for such an efficient package, > > > > Gérald > > > > *Gerald Jean, M. Sc. en statistiques* > Conseiller senior en statistiques > > Actuariat corporatif, > Modélisation et Recherche > Assurance de dommages > Mouvement Desjardins > > > Lévis (siège social) > > 418 835-4900, > > poste 5527639 > 1 877 835-4900, > > poste 5527639 > Télécopieur : 418 835-6657 > > > > > > > > Faites bonne impression et imprimez seulement au besoin! > > Ce courriel est confidentiel, peut être protégé par le secret > professionnel et est adressé exclusivement au destinataire. Il est > strictement interdit à toute autre personne de diffuser, distribuer ou > reproduire ce message. Si vous l'avez reçu par erreur, veuillez > immédiatement le détruire et aviser l'expéditeur. Merci. > > > > > > _______________________________________________ > datatable-help mailing list > [email protected] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
