Never mind my comment about 3. You were talking about the read side, while I was thinking about the write side. Your workaround actually is a pretty good idea. Can you create a JIRA for that as well?
On Monday, June 1, 2015, Reynold Xin <r...@databricks.com> wrote: > René, > > Thanks for sharing your experience. Are you using the DataFrame API or SQL? > > (1) Any recommendations on what we do w.r.t. out of range values? Should > we silently turn them into a null? Maybe based on an option? > > (2) Looks like a good idea to always quote column names. The small tricky > thing is each database seems to have its own unique quotes. Do you mind > filing a JIRA for this? > > (3) It is somewhat hard to do because that'd require changing Spark's task > scheduler. The easiest way maybe to coalesce it into a smaller number of > partitions -- or we can coalesce for the user based on an option. Can you > file a JIRA for this also? > > Thanks! > > > > > > > > > > > On Mon, Jun 1, 2015 at 1:10 AM, René Treffer <rtref...@gmail.com > <javascript:_e(%7B%7D,'cvml','rtref...@gmail.com');>> wrote: > >> Hi *, >> >> I used to run into a few problems with the jdbc/mysql integration and >> thought it would be nice to load our whole db, doing nothing but .map(_ => >> 1).aggregate(0)(_+_,_+_) on the DataFrames. >> SparkSQL has to load all columns and process them so this should reveal >> type errors like >> SPARK-7897 Column with an unsigned bigint should be treated as >> DecimalType in JDBCRDD <https://issues.apache.org/jira/browse/SPARK-7897> >> SPARK-7697 <https://issues.apache.org/jira/browse/SPARK-7697>Column with >> an unsigned int should be treated as long in JDBCRDD >> >> The test was done on the 1.4 branch (checkout 2-3 days ago, local build, >> running standalone with a 350G heap). >> >> 1. Date/Timestamp 0000-00-00 >> >> org.apache.spark.SparkException: Job aborted due to stage failure: Task >> 15 in stage 18.0 failed 1 times, most recent failure: Lost task 15.0 in >> stage 18.0 (TID 186, localhost): java.sql.SQLException: Value '0000-00-00 >> 00:00:00' can not be represented as java.sql.Timestamp >> at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:998) >> >> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 >> in stage 26.0 failed 1 times, most recent failure: Lost task 0.0 in stage >> 26.0 (TID 636, localhost): java.sql.SQLException: Value '0000-00-00' can >> not be represented as java.sql.Date >> at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:998) >> >> This was the most common error when I tried to load tables with >> Date/Timestamp types. >> Can be worked around by subqueries or by specifying those types to be >> string and handling them afterwards. >> >> 2. Keywords as column names fail >> >> SparkSQL does not enclose column names, e.g. >> SELECT key,value FROM tablename >> fails and should be >> SELECT `key`,`value` FROM tablename >> >> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 >> in stage 157.0 failed 1 times, most recent failure: Lost task 0.0 in stage >> 157.0 (TID 4322, localhost): >> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an >> error in your SQL syntax; check the manual that corresponds to your MySQL >> server version for the right syntax to use near 'key,value FROM [XXXXXX]' >> >> I'm not sure how to work around that issue except for manually writing >> sub-queries (with all the performance problems that may cause). >> >> 3. Overloaded DB due to concurrency >> >> While providing where clauses works well to parallelize the fetch it can >> overload the DB, thus causing trouble (e.g. query/connection timeouts due >> to an overloaded DB). >> It would be nice to specify fetch parallelism independent from result >> partitions (e.g. 100 partitions, but don't fetch more than 5 in parallel). >> This can be emulated by loading just n partition at a time and doing a >> union afterwards. (gouped(5).map(...)....) >> >> Success >> >> I've successfully loaded 8'573'651'154 rows from 1667 tables ( >93% >> success rate). This is pretty awesome given that e.g. sqoop has failed >> horrible on the same data. >> Note that I didn't not verify that the retrieved data is valid. I've only >> checked for fetch errors so far. But given that spark does not silence any >> errors I'm quite confident that fetched data will be valid :-) >> >> Regards, >> Rene Treffer >> > >