Re: [datatable-help] Best way to export Huge data sets.

Gerald Jean Tue, 24 Mar 2015 05:57:35 -0700

Thanks again Nicolas,

in my case, for the time being, the load process has no options!!!  The data is 
supplied to us by an outside firm, there is one directory per device and one 
file per day of usage, the files are zipped, there is 10-12 millions of them.  
I read them using “fread” this way:


fread(input = sprintf("zcat %s", x), …)

works fine, much faster than using read.table.

All I have left to do before running in parallel on the whole data directories 
is to find a way to efficiently outputting the resulting aggregated data sets.

I thought about SQLite, but was told that it puts a lock on the DB when a 
process is writing to it and, apparently Postgresql handles that more 
efficiently.  But SQLite has the advantage of being embedded in RSQLite, hence 
not requiring admin intervention, can’t have everything it seems!!!

Thanks again and cheers,

Gérald

[cid:[email protected]]

Gerald Jean, M. Sc. en statistiques
Conseiller senior en statistiques

Actuariat corporatif,
Modélisation et Recherche
Assurance de dommages
Mouvement Desjardins


Lévis (siège social)

418 835-4900,
poste 5527639
1 877 835-4900,
poste 5527639
Télécopieur : 418 835-6657







Faites bonne impression et imprimez seulement au besoin!

Ce courriel est confidentiel, peut être protégé par le secret professionnel et 
est adressé exclusivement au destinataire. Il est strictement interdit à toute 
autre personne de diffuser, distribuer ou reproduire ce message. Si vous l'avez 
reçu par erreur, veuillez immédiatement le détruire et aviser l'expéditeur. 
Merci.



De : Nicolas Paris [mailto:[email protected]]
Envoyé : 24 mars 2015 08:35
À : Gerald Jean
Cc : [email protected]
Objet : Re: [datatable-help] Best way to export Huge data sets.


thanks for your suggestions.  The data.table can’t be transformed in a matrix 
as it is of mixed types : POSIX.ct columns, character, logical, factor and 
numeric columns.

What about casting all as character ? CSV does not make difference between 
types as quote is disabled in the config I proposed.

About postgresql, I use it, and the faster way to load data is to use the COPY 
statement. I load 7GB of data in 5 min, but...
COPY uses a csv as source. A "binary" file can be used too, but I have never 
tried.

The package RSqlite could help too, Some use this instead of CSV writing. 
Never tried too.



 



2015-03-24 13:03 GMT+01:00 Gerald Jean 
<[email protected]<mailto:[email protected]>>:
Hello Nicolas,



thanks for your suggestions.  The data.table can’t be transformed in a matrix 
as it is of mixed types : POSIX.ct columns, character, logical, factor and 
numeric columns.

Admin is currently installing PostgreSQL on the server, I’ll try to go that 
route.  Too bad data.table doesn’t have, yet, a writing routine as fast as 
“fread” is for reading!!!

Thanks,

Gérald

[cid:[email protected]]

Gerald Jean, M. Sc. en statistiques
Conseiller senior en statistiques

Actuariat corporatif,
Modélisation et Recherche
Assurance de dommages
Mouvement Desjardins


Lévis (siège social)

418 835-4900<tel:418%20835-4900>,
poste 5527639
1 877 835-4900<tel:1%20877%20835-4900>,
poste 5527639
Télécopieur : 418 835-6657<tel:418%20835-6657>






Faites bonne impression et imprimez seulement au besoin!

Ce courriel est confidentiel, peut être protégé par le secret professionnel et 
est adressé exclusivement au destinataire. Il est strictement interdit à toute 
autre personne de diffuser, distribuer ou reproduire ce message. Si vous l'avez 
reçu par erreur, veuillez immédiatement le détruire et aviser l'expéditeur. 
Merci.



De : Nicolas Paris [mailto:[email protected]<mailto:[email protected]>]
Envoyé : 23 mars 2015 17:51
À : Gerald Jean
Cc : 
[email protected]<mailto:[email protected]>
Objet : Re: [datatable-help] Best way to export Huge data sets.

Some test you can do without many code change  :
1) transform your data.table as matrix before write
2) use write table + this config to save place& time
->sep = Pipe (1byte& rarely used)
->disable quote (saves "" more than bilion times)
->latin1 instead of utf-8
3) use chunks (say cut in slice output) and append=T (this may work in parallel)

If still too long, try installing some database (sqlite) on your 24 core 
system, and try load it

hope this helps

2015-03-23 14:49 GMT+01:00 Gerald Jean 
<[email protected]<mailto:[email protected]>>:
Hello,

I am currently on a project where I have to read, process, aggregate 10 to 12 
millions of files for roughly 10 billions lines of data.

The files are arranged in roughly 64000 directories, each directory is one 
client’s data.

I have written code importing and “massaging” the data per directory.  The code 
is data.table driven.  I am running this on a 24 cores machine with 145 Gb of 
RAM on a Linux box under RedHat.

For testing purpose I have parallelized the code, using the doMC package, runs 
fine and it seems to be fast.  But I haven’t tried to output the resulting 
files, three per client.  A small one, a moderate size one and a  large one, 
over 500Gb estimated.

My question:

what is the best way to output those files without creating bottlenecks??

I thought of breaking the list of input directories into 24 threads, supplying 
a list of lists to “foreach” where one of the components of each sub-list would 
be the name of the output files but I am worried that “write.table” would take 
for ever to write this data to disk, one solution would be to use “save” and 
keep the output data in Rdata format, but that complicates further analysis by 
other software.

Any suggestions???

By the way “data.table” sure helped so far in processing that data, thanks to 
the developpers for such an efficient package,

Gérald

[cid:[email protected]]

Gerald Jean, M. Sc. en statistiques
Conseiller senior en statistiques

Actuariat corporatif,
Modélisation et Recherche
Assurance de dommages
Mouvement Desjardins


Lévis (siège social)

418 835-4900<tel:418%20835-4900>,
poste 5527639
1 877 835-4900<tel:1%20877%20835-4900>,
poste 5527639
Télécopieur : 418 835-6657<tel:418%20835-6657>





Faites bonne impression et imprimez seulement au besoin!

Ce courriel est confidentiel, peut être protégé par le secret professionnel et 
est adressé exclusivement au destinataire. Il est strictement interdit à toute 
autre personne de diffuser, distribuer ou reproduire ce message. Si vous l'avez 
reçu par erreur, veuillez immédiatement le détruire et aviser l'expéditeur. 
Merci.




_______________________________________________
datatable-help mailing list
[email protected]<mailto:[email protected]>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] Best way to export Huge data sets.

Reply via email to