[jira] [Updated] (SPARK-28613) Spark SQL action collect just judge size of compressed RDD's size, not accurate enough

angerszhu (JIRA) Sun, 04 Aug 2019 04:44:00 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-28613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


angerszhu updated SPARK-28613:
------------------------------
    Description: 
When we run action DataFrame.collect() , for the configuration 
*spark.driver.maxResultSize  ,*when determine if the returned data exceeds the 
limit, it will use the compressed byte array's size, it is not accurate. 

Since when we get data when use SparkThriftServer, when not use incremental 
colletion. It will get all data of datafrme for each partition.

For return data, it has the preocess"
 # compress data's byte array 
 # Being packaged as ResultSet
 # return to driver and judge by *spark.Driver.resultMaxSize*
 # *decode(uncompress) data as Array[Row]*

The amount of data unzipped differs significantly from the amount of data 
unzipped， The difference in the size of the data is more than ten times

 

 

  was:When we run action DataFrame.collect() , for the configuration 
*spark.driver.maxResultSize  ,*when determine if the returned data exceeds the 
limit, it will use the compressed byte array's size, it it not 


> Spark SQL action collect just judge size of compressed RDD's size, not 
> accurate enough
> --------------------------------------------------------------------------------------
>
>                 Key: SPARK-28613
>                 URL: https://issues.apache.org/jira/browse/SPARK-28613
>             Project: Spark
>          Issue Type: Wish
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: angerszhu
>            Priority: Major
>
> When we run action DataFrame.collect() , for the configuration 
> *spark.driver.maxResultSize  ,*when determine if the returned data exceeds 
> the limit, it will use the compressed byte array's size, it is not accurate. 
> Since when we get data when use SparkThriftServer, when not use incremental 
> colletion. It will get all data of datafrme for each partition.
> For return data, it has the preocess"
>  # compress data's byte array 
>  # Being packaged as ResultSet
>  # return to driver and judge by *spark.Driver.resultMaxSize*
>  # *decode(uncompress) data as Array[Row]*
> The amount of data unzipped differs significantly from the amount of data 
> unzipped， The difference in the size of the data is more than ten times
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-28613) Spark SQL action collect just judge size of compressed RDD's size, not accurate enough

Reply via email to