[jira] [Commented] (HIVE-27970) Single Hive table partitioning to multiple storage system- (e.g, S3 and HDFS)

Ayush Saxena (Jira) Mon, 23 Dec 2024 02:21:51 -0800


    [ 
https://issues.apache.org/jira/browse/HIVE-27970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17907821#comment-17907821
 ]


Ayush Saxena commented on HIVE-27970:
-------------------------------------

Regarding failure with hive.metastore.dml.events=false

I think the problem lies here:
[https://github.com/apache/hive/blob/20d26ad269af3c281f845df76d3b8d260cabc904/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L3913]

It actually creates a FileSystem using the Table location, & then reuses for 
all the paths considering all the file paths provided will be within the same 
FileSystem, which ain't the case here:

Something like this should fix I believe
{noformat}
diff --git a/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 
b/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java
index f447aacdf7..59c6286fcd 100644
--- a/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java
+++ b/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java
@@ -3844,13 +3844,12 @@ public static void addWriteNotificationLog(HiveConf 
conf, Table tbl, List<String
                                              Long txnId, Long writeId, 
List<FileStatus> newFiles,
                                              List<WriteNotificationLogRequest> 
requestList)
           throws IOException, HiveException, TException {
-    FileSystem fileSystem = tbl.getDataLocation().getFileSystem(conf);
     InsertEventRequestData insertData = new InsertEventRequestData();
     insertData.setReplace(true);
 
     WriteNotificationLogRequest rqst = new WriteNotificationLogRequest(txnId, 
writeId,
             tbl.getDbName(), tbl.getTableName(), insertData);
-    addInsertFileInformation(newFiles, fileSystem, insertData);
+    addInsertFileInformation(newFiles, conf, insertData);
     rqst.setPartitionVals(partitionVals);
 
     if (requestList == null) {
@@ -3910,13 +3909,12 @@ private void fireInsertEvent(Table tbl, Map<String, 
String> partitionSpec, boole
         return;
       }
       try {
-        FileSystem fileSystem = tbl.getDataLocation().getFileSystem(conf);
         FireEventRequestData data = new FireEventRequestData();
         InsertEventRequestData insertData = new InsertEventRequestData();
         insertData.setReplace(replace);
         data.setInsertData(insertData);
         if (newFiles != null && !newFiles.isEmpty()) {
-          addInsertFileInformation(newFiles, fileSystem, insertData);
+          addInsertFileInformation(newFiles, conf, insertData);
         } else {
           insertData.setFilesAdded(new ArrayList<String>());
         }
@@ -3938,7 +3936,7 @@ private void fireInsertEvent(Table tbl, Map<String, 
String> partitionSpec, boole
   }
 
 
-  private static void addInsertFileInformation(List<FileStatus> newFiles, 
FileSystem fileSystem,
+  private static void addInsertFileInformation(List<FileStatus> newFiles, 
Configuration conf,
       InsertEventRequestData insertData) throws IOException {
     LinkedList<Path> directories = null;
     for (FileStatus status : newFiles) {
@@ -3949,7 +3947,7 @@ private static void 
addInsertFileInformation(List<FileStatus> newFiles, FileSyst
         directories.add(status.getPath());
         continue;
       }
-      addInsertNonDirectoryInformation(status.getPath(), fileSystem, 
insertData);
+      addInsertNonDirectoryInformation(status.getPath(), conf, insertData);
     }
     if (directories == null) {
       return;
@@ -3958,7 +3956,7 @@ private static void 
addInsertFileInformation(List<FileStatus> newFiles, FileSyst
     // are some examples where we would have 1, or few, levels respectively.
     while (!directories.isEmpty()) {
       Path dir = directories.poll();
-      FileStatus[] contents = fileSystem.listStatus(dir);
+      FileStatus[] contents = dir.getFileSystem(conf).listStatus(dir);
       if (contents == null) {
         continue;
       }
@@ -3967,15 +3965,16 @@ private static void 
addInsertFileInformation(List<FileStatus> newFiles, FileSyst
           directories.add(status.getPath());
           continue;
         }
-        addInsertNonDirectoryInformation(status.getPath(), fileSystem, 
insertData);
+        addInsertNonDirectoryInformation(status.getPath(), conf, insertData);
       }
     }
   }
 
 
-  private static void addInsertNonDirectoryInformation(Path p, FileSystem 
fileSystem,
+  private static void addInsertNonDirectoryInformation(Path p, Configuration 
conf,
       InsertEventRequestData insertData) throws IOException {
     insertData.addToFilesAdded(p.toString());
+    FileSystem fileSystem =  p.getFileSystem(conf);
     FileChecksum cksum = fileSystem.getFileChecksum(p);
     String acidDirPath = AcidUtils.getFirstLevelAcidDirPath(p.getParent(), 
fileSystem);
     // File checksum is not implemented for local filesystem 
(RawLocalFileSystem)

{noformat}
Though I haven't tried yet for the use case mentioned.
{quote}Maybe not. I am not sure if viewfs can be used for s3a protocal.
{quote}
ViewFs works for every FileSystem, though we haven't tested the latest Hive 
against it too much, the path resolution to figure out wether to copy or do a 
rename might be screwed at some place or the other because hive usually just 
matches the schema to figure out if the FS are same or not & whether to copy or 
rename, for ViewFs we need to resolve the link & then compare the schema, if I 
remember it right, there are places where the logics aren't in place for ViewFs

> Single Hive table partitioning to multiple storage system- (e.g, S3 and HDFS)
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-27970
>                 URL: https://issues.apache.org/jira/browse/HIVE-27970
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 3.1.2
>            Reporter: zhixingheyi-tian
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: hive4_test_partition_on_s3.txt
>
>
> Single Hive/Datasource table partitioning to multiple storage system- (e.g, 
> S3 and HDFS)
> For Hive table:
>  
> {code:java}
> CREATE  TABLE htable a string, b string)  PARTITIONED BY ( p string ) 
> location "hdfs://{cluster}}/user/hadoop/htable/";
> alter table htable  add partition(p='p1')  location 
> 's3a://{bucketname}/usr/hive/warehouse/htable/p=p1';
> {code}
>  
> When inserting into htable,  or insert overwrite htable.  New data of “p=p1” 
> will insert table location storage. This does not meet the requirements.
> Is there any best practise?  Or is there a plan to support this feature?
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-27970) Single Hive table partitioning to multiple storage system- (e.g, S3 and HDFS)

Reply via email to