Re: spark 1.4 - test-loading 1786 mysql tables / a few TB

Reynold Xin Mon, 01 Jun 2015 14:03:44 -0700

Thanks, René. I actually added a warning to the new JDBC reader/writer
interface for 1.4.0.

Even with that, I think we should support throttling JDBC; otherwise it's
too convenient for our users to DOS their production database servers!

  /**
   * Construct a [[DataFrame]] representing the database table accessible
via JDBC URL
   * url named table. Partitions of the table will be retrieved in parallel
based on the parameters
   * passed to this function.
   *
*   * Don't create too many partitions in parallel on a large cluster;
otherwise Spark might crash*
*   * your external database systems.*
   *
   * @param url JDBC database url of the form `jdbc:subprotocol:subname`
   * @param table Name of the table in the external database.
   * @param columnName the name of a column of integral type that will be
used for partitioning.
   * @param lowerBound the minimum value of `columnName` used to decide
partition stride
   * @param upperBound the maximum value of `columnName` used to decide
partition stride
   * @param numPartitions the number of partitions.  the range
`minValue`-`maxValue` will be split
   *                      evenly into this many partitions
   * @param connectionProperties JDBC database connection arguments, a list
of arbitrary string
   *                             tag/value. Normally at least a "user" and
"password" property
   *                             should be included.
   *
   * @since 1.4.0
   */

On Mon, Jun 1, 2015 at 1:54 AM, René Treffer <rtref...@gmail.com> wrote:

> Hi,
>
> I'm using sqlContext.jdbc(uri, table, where).map(_ =>
> 1).aggregate(0)(_+_,_+_) on an interactive shell (where "where" is an
> Array[String] of 32 to 48 elements).  (The code is tailored to your db,
> specifically through the where conditions, I'd have otherwise post it)
> That should be the DataFrame API, but I'm just trying to load everything
> and discard it as soon as possible :-)
>
> (1) Never do a silent drop of the values by default: it kills confidence.
> An option sounds reasonable.  Some sort of insight / log would be great.
> (How many columns of what type were truncated? why?)
> Note that I could declare the field as string via JdbcDialects (thank you
> guys for merging that :-) ).
> I have quite bad experiences with silent drops / truncates of columns and
> thus _like_ the strict way of spark. It causes trouble but noticing later
> that your data was corrupted during conversion is even worse.
>
> (2) SPARK-8004 https://issues.apache.org/jira/browse/SPARK-8004
>
> (3) One option would be to make it safe to use, the other option would be
> to document the behavior (s.th. like "WARNING: this method tries to load
> as many partitions as possible, make sure your database can handle the load
> or load them in chunks and use union"). SPARK-8008
> https://issues.apache.org/jira/browse/SPARK-8008
>
> Regards,
>   Rene Treffer
>

Re: spark 1.4 - test-loading 1786 mysql tables / a few TB

Reply via email to