Re: spark 1.4 - test-loading 1786 mysql tables / a few TB

Reynold Xin Mon, 01 Jun 2015 01:52:07 -0700

Never mind my comment about 3. You were talking about the read side, while
I was thinking about the write side. Your workaround actually is a pretty
good idea. Can you create a JIRA for that as well?


On Monday, June 1, 2015, Reynold Xin <r...@databricks.com> wrote:

> René,
>
> Thanks for sharing your experience. Are you using the DataFrame API or SQL?
>
> (1) Any recommendations on what we do w.r.t. out of range values? Should
> we silently turn them into a null? Maybe based on an option?
>
> (2) Looks like a good idea to always quote column names. The small tricky
> thing is each database seems to have its own unique quotes. Do you mind
> filing a JIRA for this?
>
> (3) It is somewhat hard to do because that'd require changing Spark's task
> scheduler. The easiest way maybe to coalesce it into a smaller number of
> partitions -- or we can coalesce for the user based on an option. Can you
> file a JIRA for this also?
>
> Thanks!
>
>
>
>
>
>
>
>
>
>
> On Mon, Jun 1, 2015 at 1:10 AM, René Treffer <rtref...@gmail.com
> <javascript:_e(%7B%7D,'cvml','rtref...@gmail.com');>> wrote:
>
>> Hi *,
>>
>> I used to run into a few problems with the jdbc/mysql integration and
>> thought it would be nice to load our whole db, doing nothing but .map(_ =>
>> 1).aggregate(0)(_+_,_+_) on the DataFrames.
>> SparkSQL has to load all columns and process them so this should reveal
>> type errors like
>> SPARK-7897 Column with an unsigned bigint should be treated as
>> DecimalType in JDBCRDD <https://issues.apache.org/jira/browse/SPARK-7897>
>> SPARK-7697 <https://issues.apache.org/jira/browse/SPARK-7697>Column with
>> an unsigned int should be treated as long in JDBCRDD
>>
>> The test was done on the 1.4 branch (checkout 2-3 days ago, local build,
>> running standalone with a 350G heap).
>>
>> 1. Date/Timestamp 0000-00-00
>>
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>> 15 in stage 18.0 failed 1 times, most recent failure: Lost task 15.0 in
>> stage 18.0 (TID 186, localhost): java.sql.SQLException: Value '0000-00-00
>> 00:00:00' can not be represented as java.sql.Timestamp
>>         at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:998)
>>
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
>> in stage 26.0 failed 1 times, most recent failure: Lost task 0.0 in stage
>> 26.0 (TID 636, localhost): java.sql.SQLException: Value '0000-00-00' can
>> not be represented as java.sql.Date
>>         at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:998)
>>
>> This was the most common error when I tried to load tables with
>> Date/Timestamp types.
>> Can be worked around by subqueries or by specifying those types to be
>> string and handling them afterwards.
>>
>> 2. Keywords as column names fail
>>
>> SparkSQL does not enclose column names, e.g.
>> SELECT key,value FROM tablename
>> fails and should be
>> SELECT `key`,`value` FROM tablename
>>
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
>> in stage 157.0 failed 1 times, most recent failure: Lost task 0.0 in stage
>> 157.0 (TID 4322, localhost):
>> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an
>> error in your SQL syntax; check the manual that corresponds to your MySQL
>> server version for the right syntax to use near 'key,value FROM [XXXXXX]'
>>
>> I'm not sure how to work around that issue except for manually writing
>> sub-queries (with all the performance problems that may cause).
>>
>> 3. Overloaded DB due to concurrency
>>
>> While providing where clauses works well to parallelize the fetch it can
>> overload the DB, thus causing trouble (e.g. query/connection timeouts due
>> to an overloaded DB).
>> It would be nice to specify fetch parallelism independent from result
>> partitions (e.g. 100 partitions, but don't fetch more than 5 in parallel).
>> This can be emulated by loading just n partition at a time and doing a
>> union afterwards. (gouped(5).map(...)....)
>>
>> Success
>>
>> I've successfully loaded 8'573'651'154 rows from 1667 tables ( >93%
>> success rate). This is pretty awesome given that e.g. sqoop has failed
>> horrible on the same data.
>> Note that I didn't not verify that the retrieved data is valid. I've only
>> checked for fetch errors so far. But given that spark does not silence any
>> errors I'm quite confident that fetched data will be valid :-)
>>
>> Regards,
>>   Rene Treffer
>>
>
>

Re: spark 1.4 - test-loading 1786 mysql tables / a few TB

Reply via email to