Re: [pylons-discuss] Saving a lot of data to Postgresql database in Pyramid using SQLAlchemy

Theron Luhn Fri, 20 Mar 2020 13:02:45 -0700

For CLI scripts like this, I usually skip Pyramid entirely and instantiate the 
SQLAlchemy session manually.  Won’t help your performance woes, but it is less 
machinery to deal with.


I second Michael’s bulk_insert_mappings, it's what I usually reach for in cases 
like this and the performance is good.  However, with Postgres you’ll need to 
enable batch mode, especially if the DB is over a network, otherwise 
performance will suffer greatly. 

So my scripts usually look like this:

dburl = sys.argv[1]
create_engine(dburl, executemany_mode='batch’)
db = Session(bind=engine)
db.bulk_insert_mappings(MyTable, get_data())
db.commit()

I wouldn’t use Pandas unless you’re already handling your data with it.

> On Mar 20, 2020, at 12:47 PM, Michael Merickel <[email protected]> wrote:
> 
> dbsession.add on each row is pretty much worse case scenario. Start with 
> https://docs.sqlalchemy.org/en/13/_modules/examples/performance/bulk_inserts.html
>  
> <https://docs.sqlalchemy.org/en/13/_modules/examples/performance/bulk_inserts.html>
>  which shows you how to use dbsession.bulk_insert_mappings and several other 
> approaches. The only caveat that's Pyramid-related is that since the 
> dbsession is controlled by the tm, if you drop down to some of the core apis 
> you'll need to use zope.sqlalchemy.mark_changed(dbsession) at the end to have 
> your data committed.
> 
> Of course there's always the ability to use the raw psycopg2 connection as 
> well which can get you closer to COPY-like performance - kinda depends on 
> your needs, but I've had success just using bulk_insert_mappings on the ORM.
> 
> - Michael
> 
>> On Mar 20, 2020, at 14:41, Emerson Barea <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hi there.
>> 
>> Some times my app needs to create and save almost a million records in a 
>> Postgres database. In that case, I'm looking for the best way to do this, 
>> because the procedures I've used so far are very slow.
>> 
>> I will present some ways that I tried that were very slow, as well the 
>> workarounds that I tried to improve them and the errors that occurred:
>> 
>> 1 - a for loop to generate the records and add them to the transaction 
>> manager
>> 
>> import argparse
>> import getopt
>> import sys
>> 
>> from pyramid.paster import bootstrap, setup_logging
>> from sqlalchemy.exc import OperationalError
>> 
>> 
>> class Topology(object):
>>     def __init__(self, dbsession):
>>         self.dbsession = dbsession
>> 
>>     def autonomous_system(self):
>>         
>>         
>>         # SOME PROCESS THAT RETURNS array1 AND array2
>>         
>>         
>>         for i in array1:
>>             record = Table(field1=array1[i],
>>                            #other fields here
>>                           )
>>             self.dbsession.add(record)
>> 
>> 
>> def parse_args(config_file):
>>     parser = argparse.ArgumentParser()
>>     parser.add_argument(
>>         'config_uri',
>>         help='Configuration file, e.g., pyramid.ini',
>>     )
>>     return parser.parse_args(config_file.split())
>> 
>> 
>> def main(argv=sys.argv[1:]):
>>     try:
>>         opts, args = getopt.getopt(argv, 'h:', ["config-file="])
>>     except getopt.GetoptError:
>>         print('* Usage: topology --config-file=pyramid.ini')
>>         sys.exit(2)
>>     for opt, arg in opts:
>>         if opt == '-h':
>>             print('* Usage: topology --config-file=pyramid.ini')
>>             sys.exit()
>>         elif opt == '--config-file':
>>             config_file = arg
>> 
>>     args = parse_args(config_file)
>>     setup_logging(args.config_uri)
>>     env = bootstrap(args.config_uri)
>>     try:
>>         t = Topology(dbsession)
>>         with env['request'].tm:
>>             dbsession = env['request'].dbsession
>>             t.autonomous_system()
>>     except OperationalError:
>>         print('Database error')
>> 
>> In this case, I noticed that the "dbsession.add (prefix)" inside the loop 
>> made the process very slow. I tried to create an array with the records in 
>> the loop and add everything to the transaction manager just once outside the 
>> loop using "bulk_save_objects" 
>> (https://docs.sqlalchemy.org/en/13/orm/persistence_techniques.html#bulk-operations
>>  
>> <https://docs.sqlalchemy.org/en/13/orm/persistence_techniques.html#bulk-operations>),
>>  but I don't understand what "s = Session ()" is.
>> 
>> s = Session()
>> objects = [
>>     models.AutonomousSystem(autonomous_system=2, id_topology=68),
>>     models.AutonomousSystem(autonomous_system=3, id_topology=68),
>>     models.AutonomousSystem(autonomous_system=4, id_topology=68),
>>     models.AutonomousSystem(autonomous_system=5, id_topology=68)
>> ]
>> s.bulk_save_objects(objects)
>> 
>> 2 - insert a pandas dataframe into the database
>> 
>> Another way that I tried, and that I think it has the best performance for 
>> working with almost a million records, is to use panda dataframes. It is 
>> very fast to generate all records using pandas dataframes, but I'm not 
>> getting success to insert all data from the dataframe into Postgres database.
>> 
>> I tried "to_sql" procedure 
>> (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html
>>  
>> <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html>),
>>  but I don't know what is the "engine" in the "df.to_sql('users', 
>> con=engine)" command.
>> 
>> Please, can someone help me with these questions?
>> 
>> Emerson
>> 
>> 
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "pylons-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] 
>> <mailto:[email protected]>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/pylons-discuss/CAO1UhdXJSP_mvJm3Br-_oiGx3FZBR9fcKagQtfFnRXscKWQErg%40mail.gmail.com
>>  
>> <https://groups.google.com/d/msgid/pylons-discuss/CAO1UhdXJSP_mvJm3Br-_oiGx3FZBR9fcKagQtfFnRXscKWQErg%40mail.gmail.com?utm_medium=email&utm_source=footer>.
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "pylons-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] 
> <mailto:[email protected]>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pylons-discuss/519F02D3-7AFF-43AB-A3BE-36729EE63A25%40gmail.com
>  
> <https://groups.google.com/d/msgid/pylons-discuss/519F02D3-7AFF-43AB-A3BE-36729EE63A25%40gmail.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"pylons-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/pylons-discuss/9E7DF195-BC81-4039-8C6C-33BA70A576CD%40luhn.com.

Re: [pylons-discuss] Saving a lot of data to Postgresql database in Pyramid using SQLAlchemy

Reply via email to