[ 
https://issues.apache.org/jira/browse/SPARK-28707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906076#comment-16906076
 ] 

angerszhu commented on SPARK-28707:
-----------------------------------

When we call a limit, it will change to a SinglePartition, it means you will 
return data with one partition of 1000 rows. and for Spark collect data, it 
will compress and serialize data in Executor and then pass it to driver. It 
means, result size is compressed data size with compression method lz4, for 
lz4, more  same content  will get better  _compression ratio._ 

_And for your situation, when you call select * from table, it has two task, 
means two partition, select * from table limit 1000 will only got one 
partition._ 

_For same data,  One partition 's size is smller then two's  total. And for you 
data's particularity.  This can be explained._ 

 

_In my case ,_  Data structures similar to yours,  after compression size is 
900m, return to driver , decompress  and deserialization, it comes to 19G.

> Spark SQL select query result size issue
> ----------------------------------------
>
>                 Key: SPARK-28707
>                 URL: https://issues.apache.org/jira/browse/SPARK-28707
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: jobit mathew
>            Priority: Major
>
> Spark sql select * from table; query fails with the validation of 
> spark.driver.maxResultSize.
> But select * from table  limit 1000; pass with the same table data.
> *Test steps*
> spark.driver.maxResultSize=5120 in spark-default.conf
> 1.Create a table with more than 5KB size in my example 23KB text file with 
> name consecutive2.txt
> local path /opt/jobit/consecutive2.txt
> AUS,30.33,CRICKET,1000
> AUS,30.33,CRICKET,1001
> --
> AUS,30.33,CRICKET,1999
> 2.launch spark-sql --master yarn
> 3.create table cons5(country String,avg float, sports String, year int) row 
> format delimited fields terminated by ',';
> 4.load data local inpath '/opt/jobit/consecutive2.txt' into table cons5;
> 5. select count(*)from cons5; gives 1000;
> 6.select * from cons5 *limit 1000*;  query and displays  the 1000 data .*Not 
> getting any error and query executing successfully.*
> 7. select * from cons5;
> getting the error as mentioned below.
> *ERROR*
> select * from cons5;
> *org.apache.spark.SparkException: Job aborted due to stage failure: Total 
> size of serialized results of 2 tasks (7.5 KB) is bigger than 
> spark.driver.maxResultSize (5.0 KB)*
>         at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1887)
>         at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)
>         at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)
>         at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> *As per my observation limit query also should validate maxResultSize check 
> if select * does.*



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to