[ 
https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15649386#comment-15649386
 ] 

mingjie tang commented on SPARK-18372:
--------------------------------------

the PR is https://github.com/apache/spark/pull/15819

> .Hive-staging folders created from Spark hiveContext are not getting cleaned 
> up
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-18372
>                 URL: https://issues.apache.org/jira/browse/SPARK-18372
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.2, 1.6.2, 1.6.3
>         Environment: spark standalone and spark yarn 
>            Reporter: mingjie tang
>             Fix For: 2.0.1
>
>
> Steps to reproduce:
> ================
> 1. Launch spark-shell 
> 2. Run the following scala code via Spark-Shell 
> scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
> scala> import org.apache.spark.sql.DataFrameWriter 
> scala> val dfw : DataFrameWriter = hivesampletabledf.write 
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( 
> clientid string, querytime string, market string, deviceplatform string, 
> devicemake string, devicemodel string, state string, country string, 
> querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
> scala> dfw.insertInto("hivesampletablecopypy") 
> scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, 
> querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE 
> state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
> hivesampletablecopypydfdf.show
> 3. in HDFS (in our case, WASB), we can see the following folders 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-10000
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
> the issue is that these don't get cleaned up and get accumulated
> =====
> with the customer, we have tried setting "SET 
> hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any 
> difference.
> .hive-staging folders are created under the <TableName> folder - 
> hive/warehouse/hivesampletablecopypy/
> we have tried adding this property to hive-site.xml and restart the 
> components -
> <property> 
> <name>hive.exec.stagingdir</name> 
> <value>$ {hive.exec.scratchdir}
> /$
> {user.name}
> /.staging</value> 
> </property>
> a new .hive-staging folder was created in hive/warehouse/<tablename> folder
> moreover, please understand that if we run the hive query in pure Hive via 
> Hive CLI on the same Spark cluster, we don't see the behavior
> so it doesn't appear to be a Hive issue/behavior in this case- this is a 
> spark behavior
> I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark 
> configuration already
> The issue happens via Spark-submit as well - customer used the following 
> command to reproduce this -
> spark-submit test-hive-staging-cleanup.py
> Solution: 
> This bug is reported by customers.
> The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class 
> of (org.apache.hadoop.hive.) to create the staging directory. Default, from 
> the hive side, this staging file would be removed after the hive session is 
> expired. However, spark fail to notify the hive to remove the staging files.
> Thus, follow the code of spark 2.0.x, I just write one function inside the 
> InsertIntoHiveTable to create the .staging directory, then, after the session 
> expired of spark, this .staging directory would be removed.
> This update is tested for the spark 1.5.2 and spark 1.6.3, and the push 
> request is : 
> For the test, I have manually checking .staging files from table belong 
> directory after the spark shell close. meanwhile, please advise how to write 
> the test case? because the directory for the related tables can not get.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to