Re: general interest in faster bulk_update implementation

2022-10-18 Thread Jörg Breitbart
Will there *not* be a Django ORM implementation of psycopg3 COPY FROM 
when that lands? And, I guess I'll need to figure out when that 
lands/would land.
Not automagically, since the ORM does not use COPY internally. The main 
ORM parts are pretty generic, thus mostly based on ANSI compliant SQL 
statements, which is good to keep it db vendor independent.


COPY is a postgres invention to move plain constant table data as fast 
as possible skipping most higher DBMS functionality (in/out or between 
tables, many postgres backup tools use it).


--
You received this message because you are subscribed to the Google Groups "Django 
developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/3acd103a-8aea-0223-41a1-d8598a55954c%40netzkolchose.de.


Re: general interest in faster bulk_update implementation

2022-10-18 Thread jobe...@gmail.com
Will there *not* be a Django ORM implementation of psycopg3 COPY FROM when 
that lands? And, I guess I'll need to figure out when that lands/would land.

On Tuesday, October 18, 2022 at 11:07:51 AM UTC-4 j.bre...@netzkolchose.de 
wrote:

> > pretty quickly, so if you need testing input (Django 3.2, Postgres) I
> > can offer feedback from what I find.
>
> Yes testing would be awesome, esp. for edge cases (test coverage for 
> default cases is pretty complete for `fast_update` I think).
>
> > Can you tell me more about this statement:
> > > *Note* copy_update will probably never leave the alpha/PoC-state, as
> > psycopg3 brings great COPY support, which does a more secure value
> > conversion and has a very fast C-version.
>
> Well I created the `copy_update` alternative for postgres just to see 
> the advantage of COPY FROM over UPDATE FROM VALUES. The impl uses 
> psycopg2's COPY interface, which got heavily revamped in psycopg3, 
> including proper value adapters written C. V2 does not have this yet, 
> therefore I had to create the value encoders in python, which are less 
> strict about values and still ~3 times slower than the C adapters in 
> psycopg3.
> The message above is meant as a warning, that I dont plan to put too 
> much effort into polishing this soon outdated implementation.
>
> > Where can I learn more about that COPY statement, and how/where that
> > statement might be integrated with the Django ORM?
>
> Plz check the postgres docs 
> (https://www.postgresql.org/docs/current/sql-copy.html), it covers all 
> important low level details. Furthermore check psycopg3 docs (and also 
> psycopg2 docs, if you want to get your hands on the `copy__update` impl).
> I dont think that driving `bulk_update` by COPY FROM for postgres is a 
> good idea, there are quite some semantic differences, also it is slower 
> for tiny changesets than UPDATE FROM VALUES, thought it starts to shine 
> for changesets >1000 (up to ~4 times faster for a 1M changeset compared 
> to `fast_update` in my tests). Maybe it can be added to the postgres 
> subpackage, if there is demand for it.
>
> Feel free to create issues or to comment on open ones. Important pending 
> issues are:
> - proper duplicate check 
> (https://github.com/netzkolchose/django-fast-update/issues/13)
> - good story whether to integrate support for f-expressions back or to 
> keep them out (currently unsupported, as the steps to get this working 
> are very cumbersome)
>
> Cheers,
> Jerch
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/30bdbb05-cc70-4936-8b26-2435fa27ee08n%40googlegroups.com.


Re: general interest in faster bulk_update implementation

2022-10-18 Thread Jörg Breitbart

> pretty quickly, so if you need testing input (Django 3.2, Postgres) I
> can offer feedback from what I find.

Yes testing would be awesome, esp. for edge cases (test coverage for 
default cases is pretty complete for `fast_update` I think).


> Can you tell me more about this statement:
>  > *Note* copy_update will probably never leave the alpha/PoC-state, as
> psycopg3 brings great COPY support, which does a more secure value
> conversion and has a very fast C-version.

Well I created the `copy_update` alternative for postgres just to see 
the advantage of COPY FROM over UPDATE FROM VALUES. The impl uses 
psycopg2's COPY interface, which got heavily revamped in psycopg3, 
including proper value adapters written C. V2 does not have this yet, 
therefore I had to create the value encoders in python, which are less 
strict about values and still ~3 times slower than the C adapters in 
psycopg3.
The message above is meant as a warning, that I dont plan to put too 
much effort into polishing this soon outdated implementation.


> Where can I learn more about that COPY statement, and how/where that
> statement might be integrated with the Django ORM?

Plz check the postgres docs 
(https://www.postgresql.org/docs/current/sql-copy.html), it covers all 
important low level details. Furthermore check psycopg3 docs (and also 
psycopg2 docs, if you want to get your hands on the `copy__update` impl).
I dont think that driving `bulk_update` by COPY FROM for postgres is a 
good idea, there are quite some semantic differences, also it is slower 
for tiny changesets than UPDATE FROM VALUES, thought it starts to shine 
for changesets >1000 (up to ~4 times faster for a 1M changeset compared 
to `fast_update` in my tests). Maybe it can be added to the postgres 
subpackage, if there is demand for it.


Feel free to create issues or to comment on open ones. Important pending 
issues are:
- proper duplicate check 
(https://github.com/netzkolchose/django-fast-update/issues/13)
- good story whether to integrate support for f-expressions back or to 
keep them out (currently unsupported, as the steps to get this working 
are very cumbersome)


Cheers,
Jerch

--
You received this message because you are subscribed to the Google Groups "Django 
developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/4a765e10-a829-720f-1795-82ae5a1dc88d%40netzkolchose.de.


Re: general interest in faster bulk_update implementation

2022-10-18 Thread jobe...@gmail.com
Jerch,

I love that you're improving the `bulk_update` performance with your 
package. I am definitely looking to adopt it. I can start working on it 
pretty quickly, so if you need testing input (Django 3.2, Postgres) I can 
offer feedback from what I find.

Can you tell me more about this statement:
> *Note* copy_update will probably never leave the alpha/PoC-state, as 
psycopg3 brings great COPY support, which does a more secure value 
conversion and has a very fast C-version.

Where can I learn more about that COPY statement, and how/where that 
statement might be integrated with the Django ORM?

On Saturday, April 30, 2022 at 3:17:01 PM UTC-4 j.bre...@netzkolchose.de 
wrote:

> Released the second version of fast_update 
> (https://pypi.org/project/django-fast-update/), based on some findings 
> above, e.g. it now should work with all recent db engine versions 
> supported by django (despite oracle).
>
> Would be happy to get some tests/feedback, before moving things closer 
> to django itself.
>
> Cheers,
> Jerch
>
>
> Am 29.04.22 um 09:34 schrieb Jörg Breitbart:
> > Have found workarounds for older db engines, which makes the more 
> > demanding version requirements from above obsolete. Db support with 
> > these workaround would be:
> > 
> > - SQLite 3.15+ (should work with Python 3.7+ installer, Ubuntu 18 LTS)
> > - MySQL 5.7+ (older versions should work too, not tested)
> > 
> > The workarounds construct the literal values tables from multiple 
> > SELECTs + UNION ALL, which is perfwise slightly worse for sqlite (~40% 
> > slower), but on par for mysql (well, mysql runs much earlier into stack 
> > issues than with TVC, but this can be configured by the user).
> > 
> > Downside - this creates 2 more code paths for 2 db engine versions, that 
> > would need to be tested with the test battery. The nuisance can be 
> > removed by a later release, once db version support is dropped for other 
> > reasons.
> > 
> > I also found possible fast update pattern for:
> > - oracle 19c (prolly older as well, UNION ALL + correlated update)
> > - oracle 21c (UNION ALL + join update)
> > - SQL Server 2014+ (FROM VALUES pattern)
> > 
> > but this needs anyone else to test and integrate, since I have no 
> > development environments for those. So whether they can gain significant 
> > performance remains uncertain until actually adopted.
> > 
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/60090f39-7a47-403a-baad-442b6b2b7488n%40googlegroups.com.


Re: general interest in faster bulk_update implementation

2022-04-30 Thread Jörg Breitbart
Released the second version of fast_update 
(https://pypi.org/project/django-fast-update/), based on some findings 
above, e.g. it now should work with all recent db engine versions 
supported by django (despite oracle).


Would be happy to get some tests/feedback, before moving things closer 
to django itself.


Cheers,
Jerch


Am 29.04.22 um 09:34 schrieb Jörg Breitbart:
Have found workarounds for older db engines, which makes the more 
demanding version requirements from above obsolete. Db support with 
these workaround would be:


- SQLite 3.15+ (should work with Python 3.7+ installer, Ubuntu 18 LTS)
- MySQL 5.7+ (older versions should work too, not tested)

The workarounds construct the literal values tables from multiple 
SELECTs + UNION ALL, which is perfwise slightly worse for sqlite (~40% 
slower), but on par for mysql (well, mysql runs much earlier into stack 
issues than with TVC, but this can be configured by the user).


Downside - this creates 2 more code paths for 2 db engine versions, that 
would need to be tested with the test battery. The nuisance can be 
removed by a later release, once db version support is dropped for other 
reasons.


I also found possible fast update pattern for:
- oracle 19c (prolly older as well, UNION ALL + correlated update)
- oracle 21c (UNION ALL + join update)
- SQL Server 2014+ (FROM VALUES pattern)

but this needs anyone else to test and integrate, since I have no 
development environments for those. So whether they can gain significant 
performance remains uncertain until actually adopted.




--
You received this message because you are subscribed to the Google Groups "Django 
developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/f82a9ee7-d474-db62-40ff-39e61e9feccf%40netzkolchose.de.


Re: general interest in faster bulk_update implementation

2022-04-29 Thread Jörg Breitbart
Have found workarounds for older db engines, which makes the more 
demanding version requirements from above obsolete. Db support with 
these workaround would be:


- SQLite 3.15+ (should work with Python 3.7+ installer, Ubuntu 18 LTS)
- MySQL 5.7+ (older versions should work too, not tested)

The workarounds construct the literal values tables from multiple 
SELECTs + UNION ALL, which is perfwise slightly worse for sqlite (~40% 
slower), but on par for mysql (well, mysql runs much earlier into stack 
issues than with TVC, but this can be configured by the user).


Downside - this creates 2 more code paths for 2 db engine versions, that 
would need to be tested with the test battery. The nuisance can be 
removed by a later release, once db version support is dropped for other 
reasons.


I also found possible fast update pattern for:
- oracle 19c (prolly older as well, UNION ALL + correlated update)
- oracle 21c (UNION ALL + join update)
- SQL Server 2014+ (FROM VALUES pattern)

but this needs anyone else to test and integrate, since I have no 
development environments for those. So whether they can gain significant 
performance remains uncertain until actually adopted.


--
You received this message because you are subscribed to the Google Groups "Django 
developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/70dec78a-d37d-fd54-ece4-8d2d7a88f42b%40netzkolchose.de.


Re: general interest in faster bulk_update implementation

2022-04-28 Thread Jörg Breitbart

Some findings/ideas on possible expression field support:

The SQL standard requires, that within a single UPDATE command the 
column ordering does not matter. This effectively means, that the RHS of 
the SET clause operates on old record values, while LHS gets the new 
values, which dont manifest before the whole command returned.*


This allows a more aggressive transformation of updating fields with 
expression values - those field updates could be pulled upfront and 
executed separately, before applying changes from literal values in the 
VALUES table. But since the expression values might be randomly 
scattered across all to-be-updated fields, slicing into 
expression/literal value fields at individual objects would lead to tons 
of VALUES tables with different dimensions + extra field ref updates 
(again creating high query load).


Suggestion:
A more straight forward way excludes objects with expression values from 
the VALUES table handling, and updates them upfront with the fallback 
path. This should still be an equivalent transformation in set theory, 
given that expressions cannot directly ref a different row. Is that 
always the case? (Otherwise we have a bigger problem with row order 
dependency during an update...)
Compared to the individual object/field splitting, this way has several 
advantages:

- easier to comprehend (at least for me)
- avoids building CASE chains in the VALUES table update command (keeps 
the fast path fast)
- much easier to test (as branching happens in two main blocks, and not 
deep at single object-update level creating a deep control-flow 
branching hell)
- automatically deals with the MySQL update issue the same way as the 
current bulk_update implementation



While I think that this should produce the same update results as the 
current bulk_update implementation, I may have overlooked crucial 
details, that need further to be addressed. Esp. around multi-table 
inheritance the correct field update ordering is not yet clear to me - 
Is this always done as second update after the local fields? How does 
the ascent into parent fields work here? Is this python-mro ordered?



Cheers,
jerch


[*] MySQL/MariaDB have a non-standard UPDATE implementation in this 
regard, as can be tested in this fiddle: 
https://dbfiddle.uk/?rdbms=mariadb_10.3=5e596bc7d256ad85a5fd4718acd46496 
(works for all other db engines...)

Should this be noted in the .update docs?

--
You received this message because you are subscribed to the Google Groups "Django 
developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/efa48bac-ae4e-5aaa-c65b-922a116959ee%40netzkolchose.de.


Re: general interest in faster bulk_update implementation

2022-04-27 Thread Jörg Breitbart

@Mariusz

Thx for the update on the db support roadmap, helps alot to get things 
into perspective.



@expression support
After a few more tests it turns out, that expression support prolly can 
be saved exactly as it is by a combination of the current CASE chaining 
with the VALUES table. At least sqlite and postgres support the 
following format (mysql variants not tested yet):


UPDATE table SET f1 = CASE pk=1 THEN f2 ELSE foo.column2 END FROM 
(VALUES (1, NULL), (2, 'Hello')) AS foo where table.pk = foo.column1;


This would be the result from something like
bulk_update([
  Model(pk=1, f1=F('f2')),
  Model(pk=..., f1=),
  ...], ['f1'])

This again introduces the bad runtime of the CASE chains into the 
statement, but only for objects using expression fields, so the user has 
limited control over the runtime needs (dont use expressions if you want 
very fast value pumping...)


Advantage of this - it should do exactly the same updates as the 
bulk_update implementation (no weird ordering side effects), thus it 
gets alot easier to prove, that the fast and fallback code path exhibit 
the same update behavior.



Cheers,
jerch

--
You received this message because you are subscribed to the Google Groups "Django 
developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/754232be-68e1-1472-7371-2195688b32fa%40netzkolchose.de.


Re: general interest in faster bulk_update implementation

2022-04-26 Thread Mariusz Felisiak
Support for MariaDB 10.2 is already dropped in Django 4.1. We will drop 
support for MySQL 5.7 in Django 4.2 or 5.0 (probably 4.2).

See https://code.djangoproject.com/wiki/SupportedDatabaseVersions for more 
details.

Best,
Mariusz

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/d4576ed7-b8d1-4158-a7b7-9069e7099adcn%40googlegroups.com.


Re: general interest in faster bulk_update implementation

2022-04-26 Thread Jörg Breitbart

@Florian

Thx for your response.

Looking through the release notes and the listed databases I find these 
version requirements:

- PostgreSQL 10+
- MariaDB 10.2+
- MySQL 5.7+
- Oracle 19c+
- SQLite 3.9.0+

Compared to the UPDATE FROM VALUES pattern requirements:
- MariaDB 10.3.3+
- MySQL 8.0.19+
- Oracle currently no impl at all
- SQLite 3.33+

thus only postgres would work out of the box. Question then is, whether 
to raise version requirements for django. Imho a good indicator for that 
might be the age of a db release, its EOL state, and whether it is still 
part of LTS distros:

- MariaDB 10.2: EOL 04/2022
- MariaDB 10.3: released in 04/2018
  (Ubuntu 20.04 LTS is on 10.3 line, 18.04 LTS on 10.1)
  --> prolly safe to raise to 10.3 line?

- MySQL 5.7: EOL 10/2023
- MySQL 8.0: released in 04/2018
  (Ubuntu 20.04 LTS contains 8.0 line, 18.04 LTS on 5.7)
  --> 5.7 is still within lifetime for 1.5ys
  --> Cut old ropes early here?

- SQLite: 3.22 on ubuntu 18.04, 3.31 on ubuntu 20.04
  --> imho 3.33+ cannot be requested here, as upgrading sqlite3
  packages is much more of a hassle for peeps

- others non supported (incl. oracle):
Should there be a default fallback within django? Or should db vendors
be bugged for implementing an abstract interface for UPDATE FROM VALUES?

Especially the last 2 points (sqlite and general db vendor compat) are 
tricky. Here I think a general fallback within django ORM might be the 
only way to not let db version issues surface its way to the user. Not 
sure yet, how practical/maintainable that would be in the end, as it 
would have to provide 2 internal code paths for the same bulk_update API 
endpoint:

- fast one, if UPDATE FROM VALUES pattern is supported
- fallback for backends not supporting the fast update pattern

Thinking the problem from a db vendor perspective, it could look like 
this in a db package (just brainstorming atm):

- a flag indicating support for UPDATE FROM VALUES pattern
- the flag result might be active code, if the db driver has
  to test support on server side (thats the currently case for mysql)
- to provide an easy upgrade path for db vendors, the flag might be 
missing on the db backend at first (hasattr is your friend)

- if supported: implementation of an abstract ORM interface

While writing this down it kinda became clear to me, that for easy 
transition of the db backends, a fallback impl in the ORM always would 
be needed. Furthermore with that flag scheme in the db backends a strict 
version match is not needed at all, as the db backend could always say 
"nope, cannot do that" and the fallback would kick in. This fallback 
could be the current bulk_update impl.


The downside of such an approach is clearly the needed code complexity, 
furthermore the ORM would leave its mostly(?) ISO/ANSI grounds and have 
to delegate the real sql creation to vendor specific implementation in 
the backends.



@f-expressions
Yes the current .bulk_update implementation inherits the expression 
support from .update (kinda passes things along). I am currently not 
sure, if that can be mimicked 100% around an optimized implementation by 
pre-/post executing updates for those fields, as col/row ordering might 
have weird side effects. I'd first need to do some tests how the current 
implementation deals with field refs while the ref'ed field itself gets 
updated before/after the ref usage. (Not even sure if thats guaranteed 
to always do the same thing across db engines)
Pre-/posthandling of f-expressions will slow down the code, as those 
most likely have to go into the SET clause of a second update statement. 
There might be several faster UNION tricks possible here, but I have not 
tested those.


Cheers,
jerch

--
You received this message because you are subscribed to the Google Groups "Django 
developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/1defa341-9b2a-e716-0319-ff29582ca77e%40netzkolchose.de.


Re: general interest in faster bulk_update implementation

2022-04-26 Thread Florian Apolloner
Hi Jörg,

Regarding your question about database version support: 
https://docs.djangoproject.com/en/4.0/ref/databases/ lists the supported 
versions. Granted not as support grid and you have to scroll to every 
database, but it is usually in the first paragraph.

If f-expressions are currently supported your implementation could fall 
back to a slower implementation if it encounters them?

Cheers,
Florian 

On Sunday, April 24, 2022 at 10:19:16 PM UTC+2 Jörg Breitbart wrote:

> Hi there,
>
> I have recently tried to tackle the known performance issues of 
> bulk_update. For more context plz see the original thread: 
> https://code.djangoproject.com/ticket/31202
>
> In another issue we kinda ended up at the same idea again for a 
> different reason (https://code.djangoproject.com/ticket/33647).
>
> My early attempt for a faster implementation: 
> https://github.com/netzkolchose/django-fast-update
>
> This impl has several drawbacks/restrictions which I'd like to discuss, 
> before wasting more time to get things more aligned with the ORM, esp 
> these aspects:
> - missing f-expression support
> - only support for recent db engines
>
> @f-expressions
> This is not directly possible from a constant table (thats how postgres 
> calls tables created from VALUES(...)). At this point idk if 
> f-expressions is an actively used thing with bulk_update at all, or if 
> support for those would just penalize the update throughput for no good 
> reason. My impl above currently skips such a workaround. Would like to 
> hear some more opinions about that.
>
> @db engine support
> Are there any strict db support promises made by django releases? I see 
> that some django versions have notes about certain db engines and 
> versions, but I cant find a "support grid" or an explicit doc for that. 
> If django does not officially restrict db versions, then the impl above 
> would need some sort of a fallback for unsupported db versions.
>
> Plz also tell me, if this all sounds too wild for getting considered 
> into django (tbh the db version dependencies bug me alot), then it 
> prolly is better kept in a 3rd party package.
>
> Regards,
> jerch
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/7f6e3d0b-1bf0-4c41-8877-f9a28913d2ccn%40googlegroups.com.