Yang Juan hu created SPARK-14955:
------------------------------------
Summary: JDBCRelation should report an IllegalArgumentException if
stride equals 0
Key: SPARK-14955
URL: https://issues.apache.org/jira/browse/SPARK-14955
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 1.6.1, 1.5.1
Reporter: Yang Juan hu
Priority: Minor
In file
https://github.com/apache/spark/blob/40ed2af587cedadc6e5249031857a922b3b234ca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
row 56 and 57 has following line
val stride: Long = (partitioning.upperBound / numPartitions
- partitioning.lowerBound / numPartitions)
if we invoke columnPartition as below:
columnPartition( JDBCPartitioningInfo("partitionColumn", 0, 7, 8) );
columnPartition will generate following where condition:
whereClause: partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0 AND partitionColumn < 0
whereClause: partitionColumn >= 0
it will cause data skew, the last partition will contain all data.
Propose to throw an exception if stride equal 0, help spark user to aware data
skew issue ASAP.
if (stride == 0) return throw new
IllegalArgumentException("partitioning.upperBound / numPartitions -
partitioning.lowerBound / numPartitions is zero");
partitionColumn must be an integral type, if we want to load a big table from
DBMS, we need to do some work around.
Real case to export data from ORACLE database through pyspark.
#data skew issue version
df=ssc.read.format("jdbc").options( url=url,
dbtable="( SELECT ORA_HASH(PART_COL,7) AS PART_ID, A.* FROM DBMS_TAB A )
TAB_ALIAS",
fetchSize="1000",
partitionColumn="PART_ID",
numPartitions="8",
lowerBound="0",
upperBound="7").load()
#no data skew issue version
df=ssc.read.format("jdbc").options( url=url,
dbtable="( SELECT ORA_HASH(PART_COL,7)+1 AS PART_ID, A.* FROM DBMS_TAB A )
TAB_ALIAS",
fetchSize="1000",
partitionColumn="PART_ID",
numPartitions="8",
lowerBound="1",
upperBound="8").load()
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]