[ https://issues.apache.org/jira/browse/HIVE-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17300233#comment-17300233 ]
Sungwoo commented on HIVE-24849: -------------------------------- I think both HiveServer2 and Metastore try connections to S3 when 'create external table' is executed with a concrete location. Here is part of the trace log of MinIO when 'create external table' is executed. 0: jdbc:hive2://orange1:444/> CREATE EXTERNAL TABLE t2 (value string) PARTITIONED BY (`date` string, shard string) LOCATION 's3a://tmp/load/'; 01:29:41.889 [404 Not Found] s3.HeadObject orange0:9000/tmp/load 192.168.10.2 667µs ↑ 153 B ↓ 196 B 01:29:41.893 [404 Not Found] s3.HeadObject orange0:9000/tmp/load/ 192.168.10.2 749µs ↑ 153 B ↓ 196 B 01:29:41.897 [200 OK] s3.ListObjectsV2 orange0:9000/tmp/?list-type=2&delimiter=%2F&max-keys=1&prefix=load%2F&fetch-owner=false 192.168.10.2 1.788ms ↑ 153 B ↓ 611 B 01:29:41.903 [404 Not Found] s3.HeadObject orange0:9000/tmp/load 192.168.10.2 561µs ↑ 153 B ↓ 196 B 01:29:41.906 [404 Not Found] s3.HeadObject orange0:9000/tmp/load/ 192.168.10.2 714µs ↑ 153 B ↓ 196 B 01:29:41.910 [200 OK] s3.ListObjectsV2 orange0:9000/tmp/?list-type=2&delimiter=%2F&max-keys=1&prefix=load%2F&fetch-owner=false 192.168.10.2 1.079ms ↑ 153 B ↓ 611 B 01:29:41.914 [200 OK] s3.ListObjectsV2 orange0:9000/tmp/?list-type=2&delimiter=%2F&max-keys=5000&prefix=load%2F&fetch-owner=false 192.168.10.2 10.547ms ↑ 153 B ↓ 6.8 KiB 01:29:41.962 [200 OK] s3.HeadBucket orange0:9000/tmp/ 192.168.10.2 507µs ↑ 153 B ↓ 189 B 01:29:41.983 [404 Not Found] s3.HeadObject orange0:9000/tmp/load 192.168.10.2 419µs ↑ 153 B ↓ 196 B 01:29:41.985 [404 Not Found] s3.HeadObject orange0:9000/tmp/load/ 192.168.10.2 810µs ↑ 153 B ↓ 196 B 01:29:41.989 [200 OK] s3.ListObjectsV2 orange0:9000/tmp/?list-type=2&delimiter=%2F&max-keys=1&prefix=load%2F&fetch-owner=false 192.168.10.2 1.415ms ↑ 153 B ↓ 611 B The problem here is that HiveServer2 and Metastore send ListObjectV2 requests which are equivalent to executing 'aws s3api list-objects-V2 --bucket ...' and thus make a recursive traversal of the destination bucket. As a result, the execution time is proportional to the number of entries inside the bucket. For HiveServer2, we can update toTable() method in CreateTableDesc.java. It has this code: if (!this.isCTAS && (tbl.getPath() == null || (tbl.isEmpty() && !isExternal()))) { This code is bad because it calls tbl.isEmpty() which sends a ListObjectV2 request, but only to return false in the end as isExternal() evalutes to true. We can revise the code as follows: if (!this.isCTAS && (tbl.getPath() == null || (!isExternal() && tbl.isEmpty()))) { Now, HiveServer2 does not send ListObjectV2 requests. Still Metastore sends a ListObjectV2 request from Warehouse.isDir(). So, a partial solution is to revise toTable() method in CreateTableDesc.java. > Create external table socket timeout when location has large number of files > ---------------------------------------------------------------------------- > > Key: HIVE-24849 > URL: https://issues.apache.org/jira/browse/HIVE-24849 > Project: Hive > Issue Type: Bug > Components: Metastore > Affects Versions: 2.3.4 > Environment: AWS EMR 5.23 with default Hive metastore and external > location S3 > > Reporter: Mithun Antony > Priority: Major > > # The create table API call timeout when during an external table creation on > a location where the number files in the S3 location is large ( ie: ~10K > objects ). > The default timeout `hive.metastore.client.socket.timeout` is `600s` current > workaround is it to increase the timeout to a higher value > {code:java} > 2021-03-04T01:37:42,761 ERROR [66b8024b-e52f-42b8-8629-a45383bcac0c > main([])]: exec.DDLTask (DDLTask.java:failed(639)) - > org.apache.hadoop.hive.ql.metadata.HiveException: > org.apache.thrift.transport.TTransportException: > java.net.SocketTimeoutException: Read timed out > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:873) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:878) > at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4356) > at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:354) > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199) > at > org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100) > at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183) > at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839) > at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227) > at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233) > at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336) > at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474) > at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490) > at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:793) > at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) > at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:239) > at org.apache.hadoop.util.RunJar.main(RunJar.java:153) > Caused by: org.apache.thrift.transport.TTransportException: > java.net.SocketTimeoutException: Read timed out > at > org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) > at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) > at > org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) > at > org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) > at > org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_create_table_with_environment_context(ThriftHiveMetastore.java:1199) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.create_table_with_environment_context(ThriftHiveMetastore.java:1185) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.create_table_with_environment_context(HiveMetaStoreClient.java:2399) > at > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.create_table_with_environment_context(SessionHiveMetaStoreClient.java:93) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:752) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:740) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) > at com.sun.proxy.$Proxy37.createTable(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2330) > at com.sun.proxy.$Proxy37.createTable(Unknown Source) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:863) > ... 25 more > Caused by: java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:171) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at > org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) > ... 49 more{code} > > Understanding is the create table should not check the exisiting > files/partitions and the validation is done when the table is read. > The issue could be reproduced executing the below statment in Hive CLI where > the location should contain ~10k objects > > {code:java} > SET hive.stats.autogather=false; > DROP table if exists tTable; > CREATE EXTERNAL TABLE tTable( > id string, > mId string, > wd bigint, > lang string, > tcountry string, > vId bigint > ) > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' > LOCATION "s3://<your bucket>/" > TBLPROPERTIES ('serialization.null.format' = ''); {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)