Re: Implementing SaveMode.Overwrite in my v2 DataSource

Joeri van Ruth via dev Mon, 09 Mar 2026 01:50:13 -0700

Hi Mich!

First, thanks for the clear explanation!

However, it doesn't really answer my question. I understand why the JDBCData Source works the way it does. I also understand and accept thatmoving data OUT of Spark is never going to get as much attention fromthe Spark devs as moving INTO Spark gets :)

My question is, if I do need to upload lots of data, the JDBC sourcejust doesn't cut it. Right now my users export to CSV and use externaltooling to import that in our database, but I would prefer them to justbe able to use dataframe.write().

So if I write a new Data Source specific to my database, how do I bestgo about it? Do I use the v2 API and if so, how do I integrate into itstable detection/creation framework? Or do still use the v1 APIeven though v2 is available?

I'm currently just adding my own table detection and creation code but Iimagine that is not really 'the Spark way'. But I can't see the forestfor the trees!


Joeri


On 3/6/26 11:06 PM, Mich Talebzadeh wrote:

Hi,
Spark uses the JDBC data source / connector. to read data. Internallythe database performs a scan of the table and streams the rows out,i.e sequential table scan or streaming result set
Writing through JDBC is different. Each row must be processed as atransaction operation.
Spark row
   │
   ▼
INSERT statement
   │
   ▼
database engine (monetDB, Oracle etc)
   ├─ constraint checks
   ├─ index updates
   ├─ transaction logging
   └─ storage update
So the database does a lot more work for each row. This creates arow-by-row workflow, which looks serial and inefficient as you observed
INSERT row 1
INSERT row 2
Spark’s generic JDBC writer deliberately avoids database-specificfeatures and therefore falls back to the safest universally supportedmechanism i.e standard SQL INSERT statements.JDBC is designed to work with all relational databases, suchas Oracle, MonetDB etc. Because of that, Spark must use the lowestcommon denominator that every database supports.
That lowest common denominator is

INSERT INTO table VALUES (...)
In short, because the JDBC interface is database-agnostic, Spark usesthe safest universally supported operation (standard SQL INSERT).Since Spark cannot assume the availability of database-specific bulkloaders, the generic implementation often inserts rows individually.
HTH,

Dr Mich Talebzadeh,
Data Scientist | Distributed Systems (Spark) | Financial Forensics &Metadata Analytics | Transaction Reconstruction | Audit &Evidence-Based Analytics
**view my Linkedin profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
On Fri, 6 Mar 2026 at 12:43, Joeri van Ruth via dev<[email protected]> wrote:
    Hi!

    I'm a developer working on MonetDB, a column-oriented SQL
    database.  See
    https://www.monetdb.org.

    I've created a JdbcDialect for MonetDB, it seems to work fine. The
    source code is at https://github.com/MonetDB/monetdb-spark.

    Unfortunately it turns out the JDBC Data Source is good at downloading
    data from the database but really slow when uploading. The reason it's
    so slow is that it uses a separate INSERT statement for each row.

    To work around this, I implemented a custom data source that uses
    MonetDB's COPY BINARY INTO feature to more efficiently upload data.
    This is orders of magnitude faster, but it currently only supports
    Append mode. I would like to also support Overwrite mode. This
    turned out to be harder than expected.

    It seems the table existence checks and creation functionality is part
    of org.apache.spark.sql.catalog.Catalog. Do I have to hook into that
    somehow? And if so, how does my

        dataframe
            .write()
            .source("org.monetdb.spark")
            .mode(SaveMode.Overwrite)
            .option("url", url)
            .option("dbtable", "foo")
            .save()

    find my catalog? The Catalog interface also contains lots of methods
    that I don't really understand, do I have to implement all of these?

    Can someone give me an overview of the big picture?


    Note: another approach would be to not try to implement a v2
    DataSource but
    more or less "subclass" the v1 JDBC Data Source like the now abandoned
    SQL Server dialect seems to do:

    https://github.com/microsoft/sql-spark-connector.

    Would that still be the way to go?


    Best regards,

    Joeri van Ruth

    ---------------------------------------------------------------------
    To unsubscribe e-mail: [email protected]

Re: Implementing SaveMode.Overwrite in my v2 DataSource

Reply via email to