Thanks, René. I actually added a warning to the new JDBC reader/writer interface for 1.4.0.
Even with that, I think we should support throttling JDBC; otherwise it's too convenient for our users to DOS their production database servers! /** * Construct a [[DataFrame]] representing the database table accessible via JDBC URL * url named table. Partitions of the table will be retrieved in parallel based on the parameters * passed to this function. * * * Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash* * * your external database systems.* * * @param url JDBC database url of the form `jdbc:subprotocol:subname` * @param table Name of the table in the external database. * @param columnName the name of a column of integral type that will be used for partitioning. * @param lowerBound the minimum value of `columnName` used to decide partition stride * @param upperBound the maximum value of `columnName` used to decide partition stride * @param numPartitions the number of partitions. the range `minValue`-`maxValue` will be split * evenly into this many partitions * @param connectionProperties JDBC database connection arguments, a list of arbitrary string * tag/value. Normally at least a "user" and "password" property * should be included. * * @since 1.4.0 */ On Mon, Jun 1, 2015 at 1:54 AM, René Treffer <rtref...@gmail.com> wrote: > Hi, > > I'm using sqlContext.jdbc(uri, table, where).map(_ => > 1).aggregate(0)(_+_,_+_) on an interactive shell (where "where" is an > Array[String] of 32 to 48 elements). (The code is tailored to your db, > specifically through the where conditions, I'd have otherwise post it) > That should be the DataFrame API, but I'm just trying to load everything > and discard it as soon as possible :-) > > (1) Never do a silent drop of the values by default: it kills confidence. > An option sounds reasonable. Some sort of insight / log would be great. > (How many columns of what type were truncated? why?) > Note that I could declare the field as string via JdbcDialects (thank you > guys for merging that :-) ). > I have quite bad experiences with silent drops / truncates of columns and > thus _like_ the strict way of spark. It causes trouble but noticing later > that your data was corrupted during conversion is even worse. > > (2) SPARK-8004 https://issues.apache.org/jira/browse/SPARK-8004 > > (3) One option would be to make it safe to use, the other option would be > to document the behavior (s.th. like "WARNING: this method tries to load > as many partitions as possible, make sure your database can handle the load > or load them in chunks and use union"). SPARK-8008 > https://issues.apache.org/jira/browse/SPARK-8008 > > Regards, > Rene Treffer >