2018-03-21 14:04 GMT+01:00 Gary Cowell <gary.cow...@gmail.com>: > Thank you Pavel for those ideas. > > I should probably have mentioned we don't have access to the file > system on the PostgreSQL server, as it's provided by Amazon AWS RDS > service. > > These functions look good when you can push the file to be loaded into > the database file system. > > I'll see if it's possible to do this on AWS PostgreSQL RDS service but > this sort of thing is usually not >
lo API doesn't need file access https://www.postgresql.org/docs/9.2/static/lo-interfaces.html you can use lo_write function > On 21 March 2018 at 12:59, Pavel Stehule <pavel.steh...@gmail.com> wrote: > > > > > > 2018-03-21 13:56 GMT+01:00 Pavel Stehule <pavel.steh...@gmail.com>: > >> > >> > >> > >> 2018-03-21 13:03 GMT+01:00 Gary Cowell <gary.cow...@gmail.com>: > >>> > >>> We are trying to implement postgresql code to load a large object into > >>> a postgresql bytea in chunks to avoid loading the file into memory in > >>> the client. > >>> > >>> First attempt was to do > >>> > >>> update build_attachment set chunk = chunk || newdata ; > >>> > >>> this did not scale and got significantly slower after 4000-5000 > updates. > >>> > >>> The chunks are 4K in size, and I'm testing with a 128MB input file, > >>> requiring 32,774 chunk updates. > >>> > >>> Next, I tried creating an aggregate, thus: > >>> > >>> (taken from stackoverflow) > >>> > >>> CREATE AGGREGATE bytea_agg(bytea) (SFUNC=byteacat,STYPE=bytea); > >>> > >>> changed the code to insert the chunks to a temporary table : > >>> > >>> create temporary table build_attachment (seq bigserial primary key, > >>> chunk bytea ) on commit drop; > >>> > >>> we then insert our 4K chunks to this, which takes very little time (20 > >>> seconds for the 32,774 inserts) > >>> > >>> Here's an example though of trying to select the aggregate: > >>> > >>> gary=> \timing > >>> Timing is on. > >>> gary=> select bytea_agg(chunk order by seq) from build_attachment > >>> where seq < 4000 \g output > >>> Time: 13372.843 ms > >>> gary=> select bytea_agg(chunk order by seq) from build_attachment > >>> where seq < 8000 \g output > >>> Time: 54447.541 ms > >>> gary=> select bytea_agg(chunk order by seq) from build_attachment > >>> where seq < 16000 \g output > >>> Time: 582219.773 ms > >>> > >>> So those partial aggregates completed in somewhat acceptable times but > >>> ... > >>> > >>> gary=> select bytea_agg(chunk order by seq) from build_attachment > >>> where seq < 32000 \g output > >>> this one hadn't completed in an hour - the PostgreSQL connection > >>> process for my connection on the server goes to 100% CPU and stays > >>> there, not using much RAM, not doing much IO, oddly > >>> > >>> EXPLAINing these aggregate selects doesn't show anything useful. > >>> > >>> Am I doomed to not be able to update a bytea this way? Is there some > >>> way I can tune this? > >>> > >> > >> bytea is immutable object without preallocation - so update of big tasks > >> is very expensive. > >> > >> I am thinking so using LO API and then transformation to bytea will be > >> much more effective > >> > >> \lo_import path > >> > >> you can use > >> > >> CREATE OR REPLACE FUNCTION attachment_to_bytea(attachment oid) > >> RETURNS bytea AS $$ > >> DECLARE > >> fd integer; > >> size integer; > >> BEGIN > >> fd := lo_open(attachment, 262144); > >> size := lo_lseek(fd, 0, 2); > >> PERFORM lo_lseek(fd, 0, 0); > >> RETURN loread(fd, size); > >> EXCEPTION WHEN undefined_object THEN > >> PERFORM lo_close(fd); > >> RETURN NULL; > >> END; > >> $$ LANGUAGE plpgsql STRICT SECURITY DEFINER SET search_path = > >> 'pg_catalog'; > >> > >> function > >> > >> import cca 44MB was in few seconds > > > > > > there is native function lo_get > > > > https://www.postgresql.org/docs/current/static/lo-funcs.html > > > > > >> > >> Regards > >> > >> Pavel > >> > > > >