Hi,

Spark uses the JDBC data source / connector. to read data. Internally the
database performs a scan of the table and streams the rows out, i.e
sequential table scan or streaming result set

Writing through JDBC is different. Each row must be processed as a
transaction operation.

Spark row
   │
   ▼
INSERT statement
   │
   ▼
database engine (monetDB, Oracle etc)
   ├─ constraint checks
   ├─ index updates
   ├─ transaction logging
   └─ storage update

So the database does a lot more work for each row. This creates a
row-by-row workflow, which looks serial and inefficient as you observed

INSERT row 1
INSERT row 2

Spark’s generic JDBC writer deliberately avoids database-specific features
and therefore falls back to the safest universally supported mechanism i.e
standard SQL INSERT statements.
JDBC is designed to work with all relational databases, such as Oracle,
MonetDB etc. Because of that, Spark must use the lowest common denominator
that every database supports.

That lowest common denominator is

INSERT INTO table VALUES (...)

In short, because the JDBC interface is database-agnostic, Spark uses the
safest universally supported operation (standard SQL INSERT). Since Spark
cannot assume the availability of database-specific bulk loaders, the
generic implementation often inserts rows individually.

HTH,

Dr Mich Talebzadeh,
Data Scientist | Distributed Systems (Spark) | Financial Forensics &
Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
Analytics

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>





On Fri, 6 Mar 2026 at 12:43, Joeri van Ruth via dev <[email protected]>
wrote:

> Hi!
>
> I'm a developer working on MonetDB, a column-oriented SQL database.  See
> https://www.monetdb.org.
>
> I've created a JdbcDialect for MonetDB, it seems to work fine. The
> source code is at https://github.com/MonetDB/monetdb-spark.
>
> Unfortunately it turns out the JDBC Data Source is good at downloading
> data from the database but really slow when uploading. The reason it's
> so slow is that it uses a separate INSERT statement for each row.
>
> To work around this, I implemented a custom data source that uses
> MonetDB's COPY BINARY INTO feature to more efficiently upload data.
> This is orders of magnitude faster, but it currently only supports
> Append mode. I would like to also support Overwrite mode. This
> turned out to be harder than expected.
>
> It seems the table existence checks and creation functionality is part
> of org.apache.spark.sql.catalog.Catalog. Do I have to hook into that
> somehow? And if so, how does my
>
>     dataframe
>         .write()
>         .source("org.monetdb.spark")
>         .mode(SaveMode.Overwrite)
>         .option("url", url)
>         .option("dbtable", "foo")
>         .save()
>
> find my catalog? The Catalog interface also contains lots of methods
> that I don't really understand, do I have to implement all of these?
>
> Can someone give me an overview of the big picture?
>
>
> Note: another approach would be to not try to implement a v2 DataSource but
> more or less "subclass" the v1 JDBC Data Source like the now abandoned
> SQL Server dialect seems to do:
>
>     https://github.com/microsoft/sql-spark-connector.
>
> Would that still be the way to go?
>
>
> Best regards,
>
> Joeri van Ruth
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>

Reply via email to