I'm planning to apply to GSOC'17 and my proposal consists currently of two
(1) Add errors handling to COPY as a minimum program
Motivation: Using PG on the daily basis for years I found that there are some
cases when you need to load (e.g. for a further analytics) a bunch of not well
consistent records with rare type/column mismatches. Since PG throws exception
on the first error, currently the only one solution is to preformat your data
with any other tool and then load to PG. However, frequently it is easier to
drop certain records instead of doing such preprocessing for every data source
I have done a small research and found the item in PG's TODO
https://wiki.postgresql.org/wiki/Todo#COPY, previous attempt to push similar
There were no negative responses against this patch and it seams that it was
just forgoten and have not been finalized.
As an example of a general idea I can provide read_csv method of python package
It uses C parser which throws error on first columns mismatch. However, it has
two flags error_bad_lines and warn_bad_lines, which being set to False helps to
drop bad lines or even hide warn messages about them.
(2) Parallel COPY execution as a maximum program
I guess that there is nothing necessary to say about motivation, it just should
be faster on multicore CPUs.
There is also an record about parallel COPY in PG's wiki
https://wiki.postgresql.org/wiki/Parallel_Query_Execution. There are some side
extensions, e.g. https://github.com/ossc-db/pg_bulkload, but it always better
to have well-performing core functionality out of the box.
My main concerns here are:
1) Is there anyone out of PG comunity who will be interested in such project
and can be a menthor?
2) These two points have a general idea – to simplify work with a large amount
of data from a different sources, but mybe it would be better to focus on the
3) Is it realistic to mostly finish both parts during the 3+ months of almost
full-time work or I am too presumptuous?
I will be very appreciate to any comments and criticism.
P.S. I know about very interesting ready projects from the PG's comunity
https://wiki.postgresql.org/wiki/GSoC_2017, but it always more interesting to
solve your own problems, issues and questions, which are the product of you
experience with software. That's why I dare to propose my own project.
P.P.S. A few words about me: I'm a PhD stident in Theoretical physics from
Moscow, Russia, and highly involved in software development since 2010. I guess
development and basic understanding of algorithms design and analysis.