Re: [pylons-discuss] Saving a lot of data to Postgresql database in Pyramid using SQLAlchemy

Theron Luhn Fri, 27 Mar 2020 10:19:32 -0700

Parameters go in the URL.

create_engine(‘postgresql://user:pass@server/dbname' 
<postgresql://user:pass@server/dbname'>)


See https://docs.sqlalchemy.org/en/13/core/engines.html 
<https://docs.sqlalchemy.org/en/13/core/engines.html> 

> On Mar 20, 2020, at 6:41 PM, Emerson Barea <[email protected]> wrote:
> 
> Thank you Michael Merickel and Theron Luhn for the answers.
> 
> I'll try to follow bulk_insert_mappings procedure, but I don't know how to 
> pass database parameters (servername, username, password, database schema) in 
> a connection script like Theron posted. I know that is a silly doubt, but can 
> you help me again?
> 
> Thank you.
> 
> Em sex., 20 de mar. de 2020 às 17:02, Theron Luhn <[email protected] 
> <mailto:[email protected]>> escreveu:
> For CLI scripts like this, I usually skip Pyramid entirely and instantiate 
> the SQLAlchemy session manually.  Won’t help your performance woes, but it is 
> less machinery to deal with.
> 
> I second Michael’s bulk_insert_mappings, it's what I usually reach for in 
> cases like this and the performance is good.  However, with Postgres you’ll 
> need to enable batch mode, especially if the DB is over a network, otherwise 
> performance will suffer greatly. 
> 
> So my scripts usually look like this:
> 
> dburl = sys.argv[1]
> create_engine(dburl, executemany_mode='batch’)
> db = Session(bind=engine)
> db.bulk_insert_mappings(MyTable, get_data())
> db.commit()
> 
> I wouldn’t use Pandas unless you’re already handling your data with it.
> 
>> On Mar 20, 2020, at 12:47 PM, Michael Merickel <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> dbsession.add on each row is pretty much worse case scenario. Start with 
>> https://docs.sqlalchemy.org/en/13/_modules/examples/performance/bulk_inserts.html
>>  
>> <https://docs.sqlalchemy.org/en/13/_modules/examples/performance/bulk_inserts.html>
>>  which shows you how to use dbsession.bulk_insert_mappings and several other 
>> approaches. The only caveat that's Pyramid-related is that since the 
>> dbsession is controlled by the tm, if you drop down to some of the core apis 
>> you'll need to use zope.sqlalchemy.mark_changed(dbsession) at the end to 
>> have your data committed.
>> 
>> Of course there's always the ability to use the raw psycopg2 connection as 
>> well which can get you closer to COPY-like performance - kinda depends on 
>> your needs, but I've had success just using bulk_insert_mappings on the ORM.
>> 
>> - Michael
>> 
>>> On Mar 20, 2020, at 14:41, Emerson Barea <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Hi there.
>>> 
>>> Some times my app needs to create and save almost a million records in a 
>>> Postgres database. In that case, I'm looking for the best way to do this, 
>>> because the procedures I've used so far are very slow.
>>> 
>>> I will present some ways that I tried that were very slow, as well the 
>>> workarounds that I tried to improve them and the errors that occurred:
>>> 
>>> 1 - a for loop to generate the records and add them to the transaction 
>>> manager
>>> 
>>> import argparse
>>> import getopt
>>> import sys
>>> 
>>> from pyramid.paster import bootstrap, setup_logging
>>> from sqlalchemy.exc import OperationalError
>>> 
>>> 
>>> class Topology(object):
>>>     def __init__(self, dbsession):
>>>         self.dbsession = dbsession
>>> 
>>>     def autonomous_system(self):
>>>         
>>>         
>>>         # SOME PROCESS THAT RETURNS array1 AND array2
>>>         
>>>         
>>>         for i in array1:
>>>             record = Table(field1=array1[i],
>>>                            #other fields here
>>>                           )
>>>             self.dbsession.add(record)
>>> 
>>> 
>>> def parse_args(config_file):
>>>     parser = argparse.ArgumentParser()
>>>     parser.add_argument(
>>>         'config_uri',
>>>         help='Configuration file, e.g., pyramid.ini',
>>>     )
>>>     return parser.parse_args(config_file.split())
>>> 
>>> 
>>> def main(argv=sys.argv[1:]):
>>>     try:
>>>         opts, args = getopt.getopt(argv, 'h:', ["config-file="])
>>>     except getopt.GetoptError:
>>>         print('* Usage: topology --config-file=pyramid.ini')
>>>         sys.exit(2)
>>>     for opt, arg in opts:
>>>         if opt == '-h':
>>>             print('* Usage: topology --config-file=pyramid.ini')
>>>             sys.exit()
>>>         elif opt == '--config-file':
>>>             config_file = arg
>>> 
>>>     args = parse_args(config_file)
>>>     setup_logging(args.config_uri)
>>>     env = bootstrap(args.config_uri)
>>>     try:
>>>         t = Topology(dbsession)
>>>         with env['request'].tm:
>>>             dbsession = env['request'].dbsession
>>>             t.autonomous_system()
>>>     except OperationalError:
>>>         print('Database error')
>>> 
>>> In this case, I noticed that the "dbsession.add (prefix)" inside the loop 
>>> made the process very slow. I tried to create an array with the records in 
>>> the loop and add everything to the transaction manager just once outside 
>>> the loop using "bulk_save_objects" 
>>> (https://docs.sqlalchemy.org/en/13/orm/persistence_techniques.html#bulk-operations
>>>  
>>> <https://docs.sqlalchemy.org/en/13/orm/persistence_techniques.html#bulk-operations>),
>>>  but I don't understand what "s = Session ()" is.
>>> 
>>> s = Session()
>>> objects = [
>>>     models.AutonomousSystem(autonomous_system=2, id_topology=68),
>>>     models.AutonomousSystem(autonomous_system=3, id_topology=68),
>>>     models.AutonomousSystem(autonomous_system=4, id_topology=68),
>>>     models.AutonomousSystem(autonomous_system=5, id_topology=68)
>>> ]
>>> s.bulk_save_objects(objects)
>>> 
>>> 2 - insert a pandas dataframe into the database
>>> 
>>> Another way that I tried, and that I think it has the best performance for 
>>> working with almost a million records, is to use panda dataframes. It is 
>>> very fast to generate all records using pandas dataframes, but I'm not 
>>> getting success to insert all data from the dataframe into Postgres 
>>> database.
>>> 
>>> I tried "to_sql" procedure 
>>> (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html
>>>  
>>> <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html>),
>>>  but I don't know what is the "engine" in the "df.to_sql('users', 
>>> con=engine)" command.
>>> 
>>> Please, can someone help me with these questions?
>>> 
>>> Emerson
>>> 
>>> 
>>> 
>>> -- 
>>> You received this message because you are subscribed to the Google Groups 
>>> "pylons-discuss" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>> email to [email protected] 
>>> <mailto:[email protected]>.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/pylons-discuss/CAO1UhdXJSP_mvJm3Br-_oiGx3FZBR9fcKagQtfFnRXscKWQErg%40mail.gmail.com
>>>  
>>> <https://groups.google.com/d/msgid/pylons-discuss/CAO1UhdXJSP_mvJm3Br-_oiGx3FZBR9fcKagQtfFnRXscKWQErg%40mail.gmail.com?utm_medium=email&utm_source=footer>.
>> 
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "pylons-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] 
>> <mailto:[email protected]>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/pylons-discuss/519F02D3-7AFF-43AB-A3BE-36729EE63A25%40gmail.com
>>  
>> <https://groups.google.com/d/msgid/pylons-discuss/519F02D3-7AFF-43AB-A3BE-36729EE63A25%40gmail.com?utm_medium=email&utm_source=footer>.
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "pylons-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] 
> <mailto:[email protected]>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pylons-discuss/9E7DF195-BC81-4039-8C6C-33BA70A576CD%40luhn.com
>  
> <https://groups.google.com/d/msgid/pylons-discuss/9E7DF195-BC81-4039-8C6C-33BA70A576CD%40luhn.com?utm_medium=email&utm_source=footer>.
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "pylons-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] 
> <mailto:[email protected]>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pylons-discuss/CAO1UhdUi01m2SYQk6oFj%3DVZwgT7TUHBOmGnNY94a8eSbru4aNg%40mail.gmail.com
>  
> <https://groups.google.com/d/msgid/pylons-discuss/CAO1UhdUi01m2SYQk6oFj%3DVZwgT7TUHBOmGnNY94a8eSbru4aNg%40mail.gmail.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"pylons-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/pylons-discuss/2A1249B4-EF0E-47D7-8E96-048F3AA253B8%40luhn.com.

Re: [pylons-discuss] Saving a lot of data to Postgresql database in Pyramid using SQLAlchemy

Reply via email to