On Sun, Feb 23, 2020 at 05:09:51PM -0800, Andres Freund wrote:
Hi,

On 2020-02-19 11:38:45 +0100, Tomas Vondra wrote:
I generally agree with the impression that parsing CSV is tricky and
unlikely to benefit from parallelism in general. There may be cases with
restrictions making it easier (e.g. restrictions on the format) but that
might be a bit too complex to start with.

For example, I had an idea to parallelise the planning by splitting it
into two phases:

FWIW, I think we ought to rewrite our COPY parsers before we go for
complex schemes. They're way slower than a decent green-field
CSV/... parser.


Yep, that's quite possible.


The one piece of information I'm missing here is at least a very rough
quantification of the individual steps of CSV processing - for example
if parsing takes only 10% of the time, it's pretty pointless to start by
parallelising this part and we should focus on the rest. If it's 50% it
might be a different story. Has anyone done any measurements?

Not recently, but I'm pretty sure that I've observed CSV parsing to be
way more than 10%.


Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
so I still think we need to do some measurements first. I'm willing to
do that, but (a) I doubt I'll have time for that until after 2020-03,
and (b) it'd be good to agree on some set of typical CSV files.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply via email to