[jira] [Commented] (HIVE-24849) Create external table socket timeout when location has large number of files

Steve Loughran (Jira) Wed, 30 Jun 2021 11:18:09 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17372145#comment-17372145
 ]


Steve Loughran commented on HIVE-24849:
---------------------------------------


Something like this
  
* existence check is integrated into the LIST call, saves 1 request against all 
stores, 2 for s3
* listing is incremental, so can determine if a dir is not empty without 
processing all the output,
  at least on those stores which do the listings incrementally (hdfs, webhdfs, 
s3a, abfs).
  on the others it is no slower than listStatus, which is what it calls 
internally.
  
{code}


  public boolean isEmpty() throws HiveException {
    Preconditions.checkNotNull(getPath());
    try {
      FileSystem fs = FileSystem.get(getPath().toUri(),
          SessionState.getSessionConf());
      RemoteIterator<FileStatus> it = listStatusIterator(getPath);
      while (it.hasNext()) {
        FileStatus fs = it.next();
        if (FileUtils.HIDDEN_FILES_PATH_FILTER.accept(fs.getPath())) {
          // something not matching the filter exists
          return false;
        }
        
      }
      return true;
    } catch (FileNotFoundException e) {
      // list failed
      return true;
    } catch (IOException e) {
      throw new HiveException(e);
    }
  }
  {code}

  {code}

For isDir(), I'd just call FileSystem.isDir(path) and ignore the deprecation 
warning, which is there to make people look at their uses and wonder if there 
are more efficient ways. (too often app code has isFile, isDir or exists() 
before some API Call which would just raise a FileNotFoundException anyway). 

What I would I recommend doing is looking at uses of isDir() and wondering if 
they could be eliminated entirely.

> Create external table socket timeout when location has large number of files
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-24849
>                 URL: https://issues.apache.org/jira/browse/HIVE-24849
>             Project: Hive
>          Issue Type: Bug
>          Components: Metastore
>    Affects Versions: 2.3.4
>         Environment: AWS EMR 5.23 with default Hive metastore and external 
> location S3
>  
>            Reporter: Mithun Antony
>            Priority: Major
>
> # The create table API call timeout when during an external table creation on 
> a location where the number files in the S3 location is large ( ie: ~10K 
> objects ).
> The default timeout `hive.metastore.client.socket.timeout` is `600s` current 
> workaround is it to increase the timeout to a higher value
> {code:java}
> 2021-03-04T01:37:42,761 ERROR [66b8024b-e52f-42b8-8629-a45383bcac0c 
> main([])]: exec.DDLTask (DDLTask.java:failed(639)) - 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.thrift.transport.TTransportException: 
> java.net.SocketTimeoutException: Read timed out
>  at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:873)
>  at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:878)
>  at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4356)
>  at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:354)
>  at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199)
>  at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
>  at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183)
>  at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839)
>  at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526)
>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
>  at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
>  at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
>  at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
>  at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
>  at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474)
>  at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490)
>  at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:793)
>  at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
>  at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
> Caused by: org.apache.thrift.transport.TTransportException: 
> java.net.SocketTimeoutException: Read timed out
>  at 
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
>  at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
>  at 
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
>  at 
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
>  at 
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
>  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_create_table_with_environment_context(ThriftHiveMetastore.java:1199)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.create_table_with_environment_context(ThriftHiveMetastore.java:1185)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.create_table_with_environment_context(HiveMetaStoreClient.java:2399)
>  at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.create_table_with_environment_context(SessionHiveMetaStoreClient.java:93)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:752)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:740)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173)
>  at com.sun.proxy.$Proxy37.createTable(Unknown Source)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2330)
>  at com.sun.proxy.$Proxy37.createTable(Unknown Source)
>  at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:863)
>  ... 25 more
> Caused by: java.net.SocketTimeoutException: Read timed out
>  at java.net.SocketInputStream.socketRead0(Native Method)
>  at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>  at java.net.SocketInputStream.read(SocketInputStream.java:171)
>  at java.net.SocketInputStream.read(SocketInputStream.java:141)
>  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>  at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>  at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>  at 
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
>  ... 49 more{code}
>  
> Understanding is the create table should not check the exisiting 
> files/partitions and the validation is done when the table is read.
> The issue could be reproduced executing the below statment in Hive CLI where 
> the location should contain ~10k objects 
>  
> {code:java}
> SET hive.stats.autogather=false;
> DROP table if exists tTable;
> CREATE EXTERNAL TABLE tTable(
>    id string,
>    mId string,
>    wd bigint,
>    lang string,
>    tcountry string,
>    vId bigint
> ) 
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
> LOCATION "s3://<your bucket>/"
> TBLPROPERTIES ('serialization.null.format' = ''); {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24849) Create external table socket timeout when location has large number of files

Reply via email to