Hi,

I have a huge data base saved in csv files like this

header --> main "table", collection
production --> each row, document in header has hundreds and thousands of 
production records
ingredients --> each header and some production records have ingredients
stages --> each ingredient is used in different stages
...

in total 15 "tables" 

in total the zipped files are 32 GB and unzipped they are around 250 GB, 
they are around 2500 zipped files so around 37500 csv files and around 
15000 are around +1GB in size

so for each file, table, collection there are around 2500 csv files 
(example header1.csv, header2.csv... header2500.csv) there is no duplicated 
data, the bigger file should contain around 300 million rows

How can I improve the import process?? It is really, really too slow... it 
takes literally days to import the whole database and the process gets 
slower as it advances

My script (.sh) is like this

# loop for unzip compressed files obtained from aws s3
for z in `find -name "*.zip"`
do
#   unzips all the csv files contained in each zip file found
unzip $z -d `unzipped`

# import process... each file from each zip file plus other parameters
arangoimp --server.connection-timeout 15000 --server.request-timeout 15000 
--collection Headers -- file Headers.csv --type csv --separator ,
arangoimp --server.connection-timeout 15000 --server.request-timeout 15000 
--collection Production -- file Production.csv --type csv --separator ,
arangoimp --server.connection-timeout 15000 --server.request-timeout 15000 
--collection Ingredients -- file Ingredients.csv --type csv --separator ,
arangoimp --server.connection-timeout 15000 --server.request-timeout 15000 
--collection Stages -- file Stages.csv --type csv --separator ,
        .
        . 
        .
        .
arangoimp --server.connection-timeout 15000 --server.request-timeout 15000 
--collection Summary -- file Summary.csv --type csv --separator ,

# delete csv files to not have to overwrite them when unzipping the zip 
files
rm *.csv
done

What are the options to reduce the time and to improve the queries later? 
Clustering? I can organize the data by state or city... Should I use 
multiple computers? This data will keep been added more and more data more 
or less monthly... 

Thanks in advance




-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to