Hi *, I used to run into a few problems with the jdbc/mysql integration and thought it would be nice to load our whole db, doing nothing but .map(_ => 1).aggregate(0)(_+_,_+_) on the DataFrames. SparkSQL has to load all columns and process them so this should reveal type errors like SPARK-7897 Column with an unsigned bigint should be treated as DecimalType in JDBCRDD <https://issues.apache.org/jira/browse/SPARK-7897> SPARK-7697 <https://issues.apache.org/jira/browse/SPARK-7697>Column with an unsigned int should be treated as long in JDBCRDD
The test was done on the 1.4 branch (checkout 2-3 days ago, local build, running standalone with a 350G heap). 1. Date/Timestamp 0000-00-00 org.apache.spark.SparkException: Job aborted due to stage failure: Task 15 in stage 18.0 failed 1 times, most recent failure: Lost task 15.0 in stage 18.0 (TID 186, localhost): java.sql.SQLException: Value '0000-00-00 00:00:00' can not be represented as java.sql.Timestamp at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:998) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 26.0 failed 1 times, most recent failure: Lost task 0.0 in stage 26.0 (TID 636, localhost): java.sql.SQLException: Value '0000-00-00' can not be represented as java.sql.Date at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:998) This was the most common error when I tried to load tables with Date/Timestamp types. Can be worked around by subqueries or by specifying those types to be string and handling them afterwards. 2. Keywords as column names fail SparkSQL does not enclose column names, e.g. SELECT key,value FROM tablename fails and should be SELECT `key`,`value` FROM tablename org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 157.0 failed 1 times, most recent failure: Lost task 0.0 in stage 157.0 (TID 4322, localhost): com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'key,value FROM [XXXXXX]' I'm not sure how to work around that issue except for manually writing sub-queries (with all the performance problems that may cause). 3. Overloaded DB due to concurrency While providing where clauses works well to parallelize the fetch it can overload the DB, thus causing trouble (e.g. query/connection timeouts due to an overloaded DB). It would be nice to specify fetch parallelism independent from result partitions (e.g. 100 partitions, but don't fetch more than 5 in parallel). This can be emulated by loading just n partition at a time and doing a union afterwards. (gouped(5).map(...)....) Success I've successfully loaded 8'573'651'154 rows from 1667 tables ( >93% success rate). This is pretty awesome given that e.g. sqoop has failed horrible on the same data. Note that I didn't not verify that the retrieved data is valid. I've only checked for fetch errors so far. But given that spark does not silence any errors I'm quite confident that fetched data will be valid :-) Regards, Rene Treffer