Hi! I'm a developer working on MonetDB, a column-oriented SQL database. See https://www.monetdb.org.
I've created a JdbcDialect for MonetDB, it seems to work fine. The source code is at https://github.com/MonetDB/monetdb-spark. Unfortunately it turns out the JDBC Data Source is good at downloading data from the database but really slow when uploading. The reason it's so slow is that it uses a separate INSERT statement for each row. To work around this, I implemented a custom data source that uses MonetDB's COPY BINARY INTO feature to more efficiently upload data. This is orders of magnitude faster, but it currently only supports Append mode. I would like to also support Overwrite mode. This turned out to be harder than expected. It seems the table existence checks and creation functionality is part of org.apache.spark.sql.catalog.Catalog. Do I have to hook into that somehow? And if so, how does my dataframe .write() .source("org.monetdb.spark") .mode(SaveMode.Overwrite) .option("url", url) .option("dbtable", "foo") .save() find my catalog? The Catalog interface also contains lots of methods that I don't really understand, do I have to implement all of these? Can someone give me an overview of the big picture? Note: another approach would be to not try to implement a v2 DataSource but more or less "subclass" the v1 JDBC Data Source like the now abandoned SQL Server dialect seems to do: https://github.com/microsoft/sql-spark-connector. Would that still be the way to go? Best regards, Joeri van Ruth --------------------------------------------------------------------- To unsubscribe e-mail: [email protected]
