On 2017-12-07 20:21:45 -0800, Alex Tokarev wrote:
> I have a set of tables with fairly large number of columns, mostly int with
> a few bigints and short char/varchar columns. I¹ve noticed that Postgres is
> pretty slow at inserting data in such a table. I tried to tune every
> possible setting: using unlogged tables, increased shared_buffers, etc; even
> placed the db cluster on ramfs and turned fsync off. The results are pretty
> much the same with the exception of using unlogged tables that improves
> performance just a little bit.
> I have made a minimally reproducible test case consisting of a table with
> 848 columns, inserting partial dataset of 100,000 rows with 240 columns. On
> my dev VM the COPY FROM operation takes just shy of 3 seconds to complete,
> which is entirely unexpected for such a small dataset.
I don't find this to be this absurdly slow. On my laptop loading with a
development checkout this takes 1223.950 ms. This is 20mio fields
parsed/sec, rows with 69mio fields/sec inserted. Removing the TRUNCATE
and running the COPYs concurrently scales well to a few clients, and
only stops because my laptop's SSD stops being able to keep up.
That said, I do think there's a few places that could stand some
improvement. Locally the profile shows up as:
+ 15.38% postgres libc-2.25.so [.] __GI_____strtoll_l_internal
+ 11.79% postgres postgres [.] heap_fill_tuple
+ 8.00% postgres postgres [.] CopyFrom
+ 7.40% postgres postgres [.] CopyReadLine
+ 6.79% postgres postgres [.] ExecConstraints
+ 6.68% postgres postgres [.] NextCopyFromRawFields
+ 6.36% postgres postgres [.] heap_compute_data_size
+ 6.02% postgres postgres [.] pg_atoi
the strtoll is libc functionality triggered by pg_atoi(), something I've
seen show up in numerous profiles. I think it's probably time to have
our own optimized version of it rather than relying on libcs.
That heap_fill_tuple(), which basically builds a tuple from the parsed
datums, takes time somewhat proportional to the number of columns in the
table seems hard to avoid, especially because this isn't something we
want to optimize for with the price of making more common workloads with
fewer columns slower. But there seems quite some micro-optimization
That ExecConstraints() shows up seems unsurprising, it has to walk
through all the table's columns checking for constraints. We could
easily optimize this so we have a separate datastructure listing
constraints, but that'd be slower in the very common case of more
reasonable numbers of columns.
The copy implementation deserves some optimization too...
> Here¹s a tarball with test schema and data:
> http://nohuhu.org/copy_perf.tar.bz2; it¹s 338k compressed but expands to
> ~50mb. Here¹s the result of profiling session with perf: