Hi there.
Some times my app needs to create and save almost a million records in a
Postgres database. In that case, I'm looking for the best way to do this,
because the procedures I've used so far are very slow.
I will present some ways that I tried that were very slow, as well the
workarounds that I tried to improve them and the errors that occurred:
1 - a for loop to generate the records and add them to the transaction
manager
import argparse
import getopt
import sys
from pyramid.paster import bootstrap, setup_logging
from sqlalchemy.exc import OperationalError
class Topology(object):
def __init__(self, dbsession):
self.dbsession = dbsession
def autonomous_system(self):
# SOME PROCESS THAT RETURNS array1 AND array2
for i in array1:
record = Table(field1=array1[i],
#other fields here
)
self.dbsession.add(record)
def parse_args(config_file):
parser = argparse.ArgumentParser()
parser.add_argument(
'config_uri',
help='Configuration file, e.g., pyramid.ini',
)
return parser.parse_args(config_file.split())
def main(argv=sys.argv[1:]):
try:
opts, args = getopt.getopt(argv, 'h:', ["config-file="])
except getopt.GetoptError:
print('* Usage: topology --config-file=pyramid.ini')
sys.exit(2)
for opt, arg in opts:
if opt == '-h':
print('* Usage: topology --config-file=pyramid.ini')
sys.exit()
elif opt == '--config-file':
config_file = arg
args = parse_args(config_file)
setup_logging(args.config_uri)
env = bootstrap(args.config_uri)
try:
t = Topology(dbsession)
with env['request'].tm:
dbsession = env['request'].dbsession
t.autonomous_system()
except OperationalError:
print('Database error')
In this case, I noticed that the "dbsession.add (prefix)" inside the loop
made the process very slow. I tried to create an array with the records in
the loop and add everything to the transaction manager just once outside
the loop using "bulk_save_objects" (
https://docs.sqlalchemy.org/en/13/orm/persistence_techniques.html#bulk-operations),
but I don't understand what "s = Session ()" is.
s = Session()
objects = [
models.AutonomousSystem(autonomous_system=2, id_topology=68),
models.AutonomousSystem(autonomous_system=3, id_topology=68),
models.AutonomousSystem(autonomous_system=4, id_topology=68),
models.AutonomousSystem(autonomous_system=5, id_topology=68)
]
s.bulk_save_objects(objects)
2 - insert a pandas dataframe into the database
Another way that I tried, and that I think it has the best performance for
working with almost a million records, is to use panda dataframes. It is
very fast to generate all records using pandas dataframes, but I'm not
getting success to insert all data from the dataframe into Postgres
database.
I tried "to_sql" procedure (
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html),
but I don't know what is the "engine" in the "df.to_sql('users', con=engine)"
command.
Please, can someone help me with these questions?
Emerson
--
You received this message because you are subscribed to the Google Groups
"pylons-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/pylons-discuss/CAO1UhdXJSP_mvJm3Br-_oiGx3FZBR9fcKagQtfFnRXscKWQErg%40mail.gmail.com.