Hi!

I'm a developer working on MonetDB, a column-oriented SQL database.  See
https://www.monetdb.org.

I've created a JdbcDialect for MonetDB, it seems to work fine. The
source code is at https://github.com/MonetDB/monetdb-spark.

Unfortunately it turns out the JDBC Data Source is good at downloading
data from the database but really slow when uploading. The reason it's
so slow is that it uses a separate INSERT statement for each row.

To work around this, I implemented a custom data source that uses
MonetDB's COPY BINARY INTO feature to more efficiently upload data.
This is orders of magnitude faster, but it currently only supports 
Append mode. I would like to also support Overwrite mode. This 
turned out to be harder than expected.

It seems the table existence checks and creation functionality is part
of org.apache.spark.sql.catalog.Catalog. Do I have to hook into that
somehow? And if so, how does my

    dataframe
        .write()
        .source("org.monetdb.spark")
        .mode(SaveMode.Overwrite)
        .option("url", url)
        .option("dbtable", "foo")
        .save()

find my catalog? The Catalog interface also contains lots of methods
that I don't really understand, do I have to implement all of these?

Can someone give me an overview of the big picture?


Note: another approach would be to not try to implement a v2 DataSource but
more or less "subclass" the v1 JDBC Data Source like the now abandoned
SQL Server dialect seems to do: 

    https://github.com/microsoft/sql-spark-connector.

Would that still be the way to go?


Best regards,

Joeri van Ruth

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Reply via email to