[
https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen updated SPARK-18372:
------------------------------
Fix Version/s: (was: 2.0.2)
> .Hive-staging folders created from Spark hiveContext are not getting cleaned
> up
> -------------------------------------------------------------------------------
>
> Key: SPARK-18372
> URL: https://issues.apache.org/jira/browse/SPARK-18372
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.5.2, 1.6.2, 1.6.3
> Environment: spark standalone and spark yarn
> Reporter: mingjie tang
> Attachments: _thumb_37664.png
>
>
> Steps to reproduce:
> ================
> 1. Launch spark-shell
> 2. Run the following scala code via Spark-Shell
> scala> val hivesampletabledf = sqlContext.table("hivesampletable")
> scala> import org.apache.spark.sql.DataFrameWriter
> scala> val dfw : DataFrameWriter = hivesampletabledf.write
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy (
> clientid string, querytime string, market string, deviceplatform string,
> devicemake string, devicemodel string, state string, country string,
> querydwelltime double, sessionid bigint, sessionpagevieworder bigint )")
> scala> dfw.insertInto("hivesampletablecopypy")
> scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid,
> querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE
> state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
> hivesampletablecopypydfdf.show
> 3. in HDFS (in our case, WASB), we can see the following folders
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
>
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-10000
>
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
> the issue is that these don't get cleaned up and get accumulated
> =====
> with the customer, we have tried setting "SET
> hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any
> difference.
> .hive-staging folders are created under the <TableName> folder -
> hive/warehouse/hivesampletablecopypy/
> we have tried adding this property to hive-site.xml and restart the
> components -
> <property>
> <name>hive.exec.stagingdir</name>
> <value>$ {hive.exec.scratchdir}
> /$
> {user.name}
> /.staging</value>
> </property>
> a new .hive-staging folder was created in hive/warehouse/<tablename> folder
> moreover, please understand that if we run the hive query in pure Hive via
> Hive CLI on the same Spark cluster, we don't see the behavior
> so it doesn't appear to be a Hive issue/behavior in this case- this is a
> spark behavior
> I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark
> configuration already
> The issue happens via Spark-submit as well - customer used the following
> command to reproduce this -
> spark-submit test-hive-staging-cleanup.py
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]