[GitHub] [iceberg] NageshB82 opened a new issue #2302: Reading from iceberg table through spark thrift server using jdbc taking larger time

GitBox Mon, 08 Mar 2021 03:46:18 -0800


NageshB82 opened a new issue #2302:
URL: https://github.com/apache/iceberg/issues/2302



   Environment 
    - Spark 3.0.1
    - Apache hive v2.3.*. - Hive server using Derby DB internally for metastore
    - Hadoop v3.2.*
    - Minio docker
   
   Recreation Steps : 
   - Started hiveserver metastore
   - Using Scala code created tables in Iceberg.
      - CREATE TABLE if not exists Account(accountId string , name string,  
description  string) USING iceberg TBLPROPERTIES('engine.hive.enabled'='true', 
'write.parquet.compression-codec'='snappy/gzip')"
      
   - Loaded almost 400k json parquet records in icerberg, it loaded pretty 
quickly in iceberg table.
   - Started Spark thrift-hiverserver using 
<SPARK_HOME>/bin/start-thriftserver.sh
   - Written JDBC program to read iceberg table through thrift server
   - We see while making any query it takes almost around 8-10 seconds 
depending upon limit we query from eg: 200 to 20000 records per call.
   We observered thrift server logs to check where it is taking maximum time, 
looks like it is taking most of time almost 5-7 seconds of time to fetch 
parquet files from iceberg table and decompressing it either these are stored 
in s3 (minio)/local FS
   
   `21/03/03 00:27:37 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
   21/03/03 00:27:37 INFO HadoopRDD: Input split: null:0+0
   21/03/03 00:27:38 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/03 00:27:39 INFO CodeGenerator: Code generated in 182.353266 ms
   21/03/03 00:27:39 INFO CodecPool: Got brand-new decompressor [.gz]
   21/03/03 00:27:41 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/03 00:27:41 INFO CodecPool: Got brand-new decompressor [.gz]
   21/03/03 00:27:42 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/03 00:27:42 INFO CodecPool: Got brand-new decompressor [.gz]
   21/03/03 00:27:42 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/03 00:27:43 INFO CodecPool: Got brand-new decompressor [.gz]
   21/03/03 00:27:43 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/03 00:27:43 INFO CodecPool: Got brand-new decompressor [.gz]
   21/03/03 00:27:44 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/03 00:27:44 INFO CodecPool: Got brand-new decompressor [.gz]
   21/03/03 00:27:45 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/03 00:27:45 INFO CodecPool: Got brand-new decompressor [.gz]
   21/03/03 00:27:46 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/03 00:27:46 INFO CodecPool: Got brand-new decompressor [.gz]
   21/03/03 00:27:46 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/03 00:27:46 INFO CodecPool: Got brand-new decompressor [.gz]
   21/03/03 00:27:47 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/03 00:27:47 INFO CodecPool: Got brand-new decompressor [.gz]
   21/03/03 00:27:48 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/03 00:27:48 INFO CodecPool: Got brand-new decompressor [.gz]
   21/03/03 00:27:49 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/03 00:27:49 INFO CodecPool: Got brand-new decompressor [.gz]
   21/03/03 00:27:49 INFO MemoryStore: Block taskresult_0 stored as bytes in 
memory (estimated size 10.9 MiB, free 422.8 MiB)
   21/03/03 00:27:49 INFO BlockManagerInfo: Added taskresult_0 in memory on 
fuddled1.fyre.ibm.com:36922 (size: 10.9 MiB, free: 423.4 MiB)
   21/03/03 00:27:49 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 
11425022 bytes result sent via BlockManager)`
   
   We tried changing compression to 'snappy' instead of default compression 
'gzip' both takes same amount of time, to make a query from iceberg table.
   
   `21/03/05 00:15:11 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
   21/03/05 00:15:11 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 
0, fuddled1.fyre.ibm.com, executor driver, partition 0, ANY, 101471 bytes)
   21/03/05 00:15:11 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
   21/03/05 00:15:11 INFO HadoopRDD: Input split: null:0+0
   21/03/05 00:15:11 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/05 00:15:12 INFO CodeGenerator: Code generated in 189.91119 ms
   21/03/05 00:15:12 INFO CodecPool: Got brand-new decompressor [.snappy]
   21/03/05 00:15:15 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/05 00:15:16 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/05 00:15:17 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/05 00:15:17 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/05 00:15:18 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/05 00:15:19 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/05 00:15:19 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/05 00:15:20 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/05 00:15:21 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/05 00:15:22 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/05 00:15:22 INFO S3AInputStream: Switching to Random IO seek policy
   21/03/05 00:15:23 INFO MemoryStore: Block taskresult_0 stored as bytes in 
memory (estimated size 10.9 MiB, free 422.8 MiB)
   21/03/05 00:15:23 INFO BlockManagerInfo: Added taskresult_0 in memory on 
fuddled1.fyre.ibm.com:42090 (size: 10.9 MiB, free: 423.4 MiB)
   21/03/05 00:15:23 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 
11425022 bytes result sent via BlockManager)`
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] NageshB82 opened a new issue #2302: Reading from iceberg table through spark thrift server using jdbc taking larger time

Reply via email to