Jaeboo Jung created SPARK-19623:
-----------------------------------
Summary: Take rows from DataFrame with empty first partition
Key: SPARK-19623
URL: https://issues.apache.org/jira/browse/SPARK-19623
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.6.2
Reporter: Jaeboo Jung
Priority: Minor
I use Spark 1.6.2 with 1 master and 6 workers. Assuming we have partitions
having a empty first partition, DataFrame and its RDD have different behaviors
during taking rows from it. If we take only 1000 rows from DataFrame, it causes
OOME but RDD is OK.
In detail,
DataFrame without a empty first partition => OK
DataFrame with a empty first partition => OOME
RDD of DataFrame with a empty first partition => OK
Codes below reproduce this error.
{code}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val rdd = sc.parallelize(1 to 100000000,1000).map(i =>
Row.fromSeq(Array.fill(100)(i)))
val schema = StructType(for(i <- 1 to 100) yield {
StructField("COL"+i,IntegerType, true)
})
val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1)
Iterator[Row]() else iter)
val df1 = sqlContext.createDataFrame(rdd,schema)
df1.take(1000) // OK
val df2 = sqlContext.createDataFrame(rdd2,schema)
df2.rdd.take(1000) // OK
df2.take(1000) // OOME
{code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]