spark 1.4 - test-loading 1786 mysql tables / a few TB

René Treffer Mon, 01 Jun 2015 01:11:31 -0700

Hi *,

I used to run into a few problems with the jdbc/mysql integration and
thought it would be nice to load our whole db, doing nothing but .map(_ =>
1).aggregate(0)(_+_,_+_) on the DataFrames.
SparkSQL has to load all columns and process them so this should reveal
type errors like
SPARK-7897 Column with an unsigned bigint should be treated as DecimalType
in JDBCRDD <https://issues.apache.org/jira/browse/SPARK-7897>
SPARK-7697 <https://issues.apache.org/jira/browse/SPARK-7697>Column with an
unsigned int should be treated as long in JDBCRDD


The test was done on the 1.4 branch (checkout 2-3 days ago, local build,
running standalone with a 350G heap).

1. Date/Timestamp 0000-00-00

org.apache.spark.SparkException: Job aborted due to stage failure: Task 15
in stage 18.0 failed 1 times, most recent failure: Lost task 15.0 in stage
18.0 (TID 186, localhost): java.sql.SQLException: Value '0000-00-00
00:00:00' can not be represented as java.sql.Timestamp
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:998)

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 26.0 failed 1 times, most recent failure: Lost task 0.0 in stage
26.0 (TID 636, localhost): java.sql.SQLException: Value '0000-00-00' can
not be represented as java.sql.Date
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:998)

This was the most common error when I tried to load tables with
Date/Timestamp types.
Can be worked around by subqueries or by specifying those types to be
string and handling them afterwards.

2. Keywords as column names fail

SparkSQL does not enclose column names, e.g.
SELECT key,value FROM tablename
fails and should be
SELECT `key`,`value` FROM tablename

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 157.0 failed 1 times, most recent failure: Lost task 0.0 in stage
157.0 (TID 4322, localhost):
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an
error in your SQL syntax; check the manual that corresponds to your MySQL
server version for the right syntax to use near 'key,value FROM [XXXXXX]'

I'm not sure how to work around that issue except for manually writing
sub-queries (with all the performance problems that may cause).

3. Overloaded DB due to concurrency

While providing where clauses works well to parallelize the fetch it can
overload the DB, thus causing trouble (e.g. query/connection timeouts due
to an overloaded DB).
It would be nice to specify fetch parallelism independent from result
partitions (e.g. 100 partitions, but don't fetch more than 5 in parallel).
This can be emulated by loading just n partition at a time and doing a
union afterwards. (gouped(5).map(...)....)

Success

I've successfully loaded 8'573'651'154 rows from 1667 tables ( >93% success
rate). This is pretty awesome given that e.g. sqoop has failed horrible on
the same data.
Note that I didn't not verify that the retrieved data is valid. I've only
checked for fetch errors so far. But given that spark does not silence any
errors I'm quite confident that fetched data will be valid :-)

Regards,
  Rene Treffer

spark 1.4 - test-loading 1786 mysql tables / a few TB

Reply via email to