[GitHub] [spark] senthh commented on pull request #33577: [SPARK-36327][SQL] Spark sql creates staging dir inside database directory rather than creating inside table directory

GitBox Fri, 30 Jul 2021 20:33:29 -0700


senthh commented on pull request #33577:
URL: https://github.com/apache/spark/pull/33577#issuecomment-890284584



   > @senthh could you explain the motivation in this PR in the "Why are the 
changes needed?" section? IIRC Hive also stores staging directory under 
database directory so I'm wondering why we need the change here.
   
   @sunchao Hive is creating .staging directories inside "/db/table" location 
but Spark-sql creates .staging directories inside /db/" location when we use 
hadoop federation(viewFs). But works as expected (creating .staging inside 
/db/table/ location for other filesystems like hdfs).
   
   HIVE:
   {{
   # beeline
   > use dicedb;
   > insert into table part_test partition (j=1) values (1);
   ...
   INFO  : Loading data to table dicedb.part_test partition (j=1) from 
**viewfs://cloudera/user/daisuke/dicedb/part_test/j=1/.hive-staging_hive_2021-07-19_13-04-44_989_6775328876605030677-1/-ext-10000**
 
   
   }}
   
   but spark's behaviour,
   
   {{
   spark-sql> use dicedb;
   spark-sql> insert into table part_test partition (j=2) values (2);
   21/07/19 13:07:37 INFO FileUtils: Creating directory if it doesn't exist: 
**viewfs://cloudera/user/daisuke/dicedb/.hive-staging_hive_2021-07-19_13-07-37_317_5083528872437596950-1**
   ... 
   }}
   
   
   The reason why we require this change is , if we allow spark-sql to create 
.staging directory inside /db/ location then we will end-up with security 
issues. We need to provide permission for "viewfs:///db/" location to all users 
who submit spark jobs.
   
   After this change is applied spark-sql creates .staging inside /db/table/,  
similar to hive, as below,
   
   {{
   spark-sql> use dicedb;
   21/07/28 00:22:47 INFO SparkSQLCLIDriver: Time taken: 0.929 seconds
   spark-sql> insert into table part_test partition (j=8) values (8);
   21/07/28 00:23:25 INFO HiveMetaStoreClient: Closed a connection to 
metastore, current connections: 1
   21/07/28 00:23:26 INFO FileUtils: Creating directory if it doesn't exist: 
**viewfs://cloudera/user/daisuke/dicedb/part_test/.hive-staging_hive_2021-07-28_00-23-26_109_4548714524589026450-1**
 
   }} 
   
   The reason why we don't see this issue in Hive but only occurs in Spark-sql:
   
   In hive, "/db/table/tmp" directory structure is passed for path and hence 
path.getParent returns "db/table/" . But in Spark we just pass "/db/table" so 
it is not required to use "path.getParent" for hadoop federation(viewfs)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] senthh commented on pull request #33577: [SPARK-36327][SQL] Spark sql creates staging dir inside database directory rather than creating inside table directory

Reply via email to