matthiasdg opened a new issue, #6940:
URL: https://github.com/apache/hudi/issues/6940

   **Describe the problem you faced**
   
   We have existing tables ingested with the default settings of hudi 0.10.
   We synced them to hive metastore using the standalone sync tool. Typical 
partitioning scheme is by device, year, month, day.
   Sync command is with parameters like: 
   ```
   --base-path abfss://[email protected]/.../qwe.aq_msm \
   --database default --sync-mode hms --partitioned-by device,year,month,day 
--spark-datasource \
   --table qwe_aq_msm --jdbc-url thrift://localhost:9083 --user hive --pass 
hive \
   --partition-value-extractor org.apache.hudi.hive.MultiPartKeysValueExtractor"
   ```
   We load the table using a `SparkSession` with `enableHiveSupport()` and the 
correct metastore uri, then use `sparkSession.table(tableName)`.
   
   When we tried to upgrade to 0.12 and use the existing hive tables we noticed 
queries take way longer than before (x10).
   
   What we experience is: some `Metadata table was not found at path ...` 
messages and then it seems like nothing happens for e.g. 10 minutes. In the 
Spark UI, we can see a running query in the `SQL` tab, but there are no active 
jobs in the `Jobs` tab. When I ran this on my local machine, I noticed a pretty 
high upload meanwhile. When I enabled DEBUG logging, messages are continuously 
being generated like added in the Stacktrace section (getting FileStatuses).
   
   This probably has something to do with the changed behavior of the metadata 
table? If we write using 0.12, we noticed that the metadata table is now 
enabled by default (`metadata` folder inside `.hoodie` folder). It seems to 
work fine if we write, sync and read with hudi 0.12 and metadata enabled (did 
not do large scale tests though).
   Is it somehow possible to indicate we are not using metadata tables in 
combination with hive metastore, is there some configuration for this?
   I already tried adding a `hudi-defaults.conf` with parameters like 
`hoodie.metadata.enable` to false etc, but it seems not to have effect there.
   Resyncing the existing data with hudi's 0.12 synctool to another hive table 
does also not help. (Noticed there are parameters like 
`META_SYNC_USE_FILE_LISTING_FROM_METADATA` but you pbly first need metadata for 
that to have an effect).
   
   For tables without partitions the problem is not noticeable.
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Ingest data with lots of partitions, without metadata table enabled (was 
using hudi 0.10 in our case)
   2. Sync to hive metastore (0.10 or 0.12)
   3. Load the created table using session.table and do some query (0.12)
   4. Query seems to hang after some `Metadata table was not found at path` 
messages, seems like lots of FileStatuses are fetched, eventually Spark Job 
gets launched (may take 10 minutes).
   
   **Expected behavior**
   Ability to load existing hive tables (or resync) without performance impact 
or do we need to rewrite data?
   
   
   **Environment Description**
   
   * Hudi version : upgraded from 0.10 to 0.12
   
   * Spark version : 3.1.2 (initially had also upgraded to 3.2, but wanted to 
rule this out so reverted)
   
   * Hive version : 2.3.9 comes with hudi libs, metastore is standalone 3.0
   
   * Hadoop version : 3.2.0 (with Spark 3.2 I had 3.3.2)
   
   * Storage (HDFS/S3/GCS..) : Azure Data Lake Gen 2
   
   * Running on Docker? (yes/no) : both with Spark in k8s mode or local mode.
   
   
   **Additional context**
   
   
   **Stacktrace**
   Log excerpt
   ```
   [DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-9] 193 Incrementing 
counter action_http_head_request by 1 with final value 13653
   [DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-21] 216 First 
execution of REST operation - GetPathStatus
   [DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-9] 216 First 
execution of REST operation - GetPathStatu
   [DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-21] 193 Incrementing 
counter connections_made by 1 with final value 14221
   [DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-21] 254 
Authenticating request with OAuth2 access token
   [DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-9] 193 Incrementing 
counter connections_made by 1 with final value 14222
   [DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-21] 51 Request Headers
   [DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-9] 254 Authenticating 
request with OAuth2 access token
   [DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-21] 64   
Accept-Charset=utf-8
   [DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-9] 51 Request Headers
   [DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-21] 64   
x-ms-version=2019-12-12
   [DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-21] 64   
Accept=application/json, application/octet-stream
   [DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-9] 64   
Accept-Charset=utf-8
   [DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-21] 64   
User-Agent=APN/1.0 Azure Blob FS/3.3.2 (AdoptOpenJDK JavaJRE 11.0.10; Linux 
5.15.0-48-generic/amd64; SunJSSE-11.0; UNKNOWN/UNKNOWN) unknown
   [DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-9] 64   
x-ms-version=2019-12-12
   [DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-21] 64   
x-ms-client-request-id=:92d1dc21-0ec6-4b94-9a27-8eb38684dd69:37ae89c9-dba1-4490-9431-ef3246a82c13:::GF:0
   [DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-9] 64   
Accept=application/json, application/octet-stream
   [DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-21] 64   Content-Type=
   [DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-9] 64   
User-Agent=APN/1.0 Azure Blob FS/3.3.2 (AdoptOpenJDK JavaJRE 11.0.10; Linux 
5.15.0-48-generic/amd64; SunJSSE-11.0; UNKNOWN/UNKNOWN) unknown
   [DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-9] 64   
x-ms-client-request-id=:3ec30675-75be-4d3e-96b4-8cbdc410aad0:37ae89c9-dba1-4490-9431-ef3246a82c13:::GF:0
   [DEBUG] [14:21:04.147] [ForkJoinPool.commonPool-worker-9] 64   Content-Type=
   [DEBUG] [14:21:04.160] [ForkJoinPool.commonPool-worker-5] 51 Response Headers
   [DEBUG] [14:21:04.161] [ForkJoinPool.commonPool-worker-5] 64   HTTP 
Response=HTTP/1.1 200 OK
   [DEBUG] [14:21:04.161] [ForkJoinPool.commonPool-worker-5] 64   
x-ms-group=a3fbf5af-b6f5-4372-97f3-1d5ff16faa88
   [DEBUG] [14:21:04.161] [ForkJoinPool.commonPool-worker-5] 64   
x-ms-version=2019-12-12
   [DEBUG] [14:21:04.161] [ForkJoinPool.commonPool-worker-5] 64   
Server=Windows-Azure-HDFS/1.0 Microsoft-HTTPAPI/2.0
   [DEBUG] [14:21:04.161] [ForkJoinPool.commonPool-worker-5] 64   
Last-Modified=Fri, 05 Aug 2022 19:57:08 GMT
   [DEBUG] [14:21:04.161] [ForkJoinPool.commonPool-worker-5] 64   Date=Tue, 11 
Oct 2022 12:21:03 GMT
   [DEBUG] [14:21:04.161] [ForkJoinPool.commonPool-worker-5] 64   
x-ms-resource-type=file
   [DEBUG] [14:21:04.161] [ForkJoinPool.commonPool-worker-5] 64   
x-ms-server-encrypted=true
   [DEBUG] [14:21:04.161] [ForkJoinPool.commonPool-worker-5] 64   
ETag="0x8DA771CAD82156B"
   [DEBUG] [14:21:04.161] [ForkJoinPool.commonPool-worker-5] 64   
x-ms-permissions=rwxrwx---+
   [DEBUG] [14:21:04.161] [ForkJoinPool.commonPool-worker-5] 64   
Content-Length=96
   [DEBUG] [14:21:04.162] [ForkJoinPool.commonPool-worker-5] 64   
x-ms-request-id=a750d9cf-301f-002f-806b-dd6a93000000
   [DEBUG] [14:21:04.162] [ForkJoinPool.commonPool-worker-5] 64   
x-ms-client-request-id=:86d903e4-25df-4d06-a2fe-433fd6665477:37ae89c9-dba1-4490-9431-ef3246a82c13:::GF:0
   [DEBUG] [14:21:04.162] [ForkJoinPool.commonPool-worker-5] 64   
x-ms-owner=$superuser
   [DEBUG] [14:21:04.162] [ForkJoinPool.commonPool-worker-5] 193 Incrementing 
counter get_responses by 1 with final value 14217
   [DEBUG] [14:21:04.162] [ForkJoinPool.commonPool-worker-5] 193 Incrementing 
counter bytes_received by 0 with final value 1210
   [DEBUG] [14:21:04.162] [ForkJoinPool.commonPool-worker-5] 323 HttpRequest: 
GetPathStatus: 
200,,,cid=:86d903e4-25df-4d06-a2fe-433fd6665477:37ae89c9-dba1-4490-9431-ef3246a82c13:::GF:0,rid=a750d9cf-301f-002f-806b-dd6a93000000,sent=0,recv=0,HEAD,https://stsdpglasshouse.dfs.core.windows.net/dev/datalakehouse/tml.tf_msm/39877/2021/06/07/.hoodie_partition_metadata?upn=false&action=getStatus&timeout=90
   [DEBUG] [14:21:04.163] [ForkJoinPool.commonPool-worker-5] 193 Incrementing 
counter op_exists by 1 with final value 13647
   [DEBUG] [14:21:04.163] [ForkJoinPool.commonPool-worker-5] 604 
AzureBlobFileSystem.getFileStatus path: 
abfss://[email protected]/datalakehouse/tml.tf_msm/39877/2021/06/08/.hoodie_partition_metadata
   [DEBUG] [14:21:04.163] [ForkJoinPool.commonPool-worker-5] 193 Incrementing 
counter op_get_file_status by 1 with final value 13650
   [DEBUG] [14:21:04.163] [ForkJoinPool.commonPool-worker-5] 933 getFileStatus 
filesystem: dev path: 
abfss://[email protected]/datalakehouse/tml.tf_msm/39877/2021/06/08/.hoodie_partition_metadata
 isNamespaceEnabled: true
   [DEBUG] [14:21:04.163] [ForkJoinPool.commonPool-worker-5] 193 Incrementing 
counter action_http_head_request by 1 with final value 13654
   [DEBUG] [14:21:04.163] [ForkJoinPool.commonPool-worker-5] 216 First 
execution of REST operation - GetPathStatu
   [DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-5] 193 Incrementing 
counter connections_made by 1 with final value 14223
   [DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-5] 254 Authenticating 
request with OAuth2 access token
   [DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-5] 51 Request Headers
   [DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-5] 64   
Accept-Charset=utf-8
   [DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-17] 51 Response 
Headers
   [DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-5] 64   
x-ms-version=2019-12-12
   [DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-7] 51 Response Headers
   [DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-17] 64   HTTP 
Response=HTTP/1.1 200 OK
   [DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-5] 64   
Accept=application/json, application/octet-stream
   [DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-17] 64   
x-ms-group=a3fbf5af-b6f5-4372-97f3-1d5ff16faa8
   [DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-5] 64   
User-Agent=APN/1.0 Azure Blob FS/3.3.2 (AdoptOpenJDK JavaJRE 11.0.10; Linux 
5.15.0-48-generic/amd64; SunJSSE-11.0; UNKNOWN/UNKNOWN) unknown
   [DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-7] 64   HTTP 
Response=HTTP/1.1 200 OK
   [DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-17] 64   
x-ms-version=2019-12-12
   [DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-5] 64   
x-ms-client-request-id=:65bc8c0c-541b-473b-ac05-76a762de7918:37ae89c9-dba1-4490-9431-ef3246a82c13:::GF:0
   [DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-17] 64   
Server=Windows-Azure-HDFS/1.0 Microsoft-HTTPAPI/2.0
   [DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-5] 64   Content-Type=
   [DEBUG] [14:21:04.164] [ForkJoinPool.commonPool-worker-7] 64   
x-ms-group=a3fbf5af-b6f5-4372-97f3-1d5ff16faa88
   [DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-17] 64   
Last-Modified=Sat, 03 Sep 2022 00:06:12 GMT
   [DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-17] 64   Date=Tue, 11 
Oct 2022 12:21:03 GMT
   [DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-7] 64   
x-ms-version=2019-12-12
   [DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-17] 64   
x-ms-resource-type=file
   [DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-7] 64   
Server=Windows-Azure-HDFS/1.0 Microsoft-HTTPAPI/2.0
   [DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-17] 64   
x-ms-server-encrypted=true
   [DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-17] 64   
ETag="0x8DA8D401C66C479"
   [DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-7] 64   
Last-Modified=Fri, 05 Aug 2022 20:09:16 GMT
   [DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-17] 64   
x-ms-permissions=rw-r-----+
   [DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-17] 64   
Content-Length=96
   [DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-7] 64   Date=Tue, 11 
Oct 2022 12:21:03 GMT
   [DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-17] 64   
x-ms-request-id=3754b849-801f-0048-126b-dd7a6f000000
   [DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-17] 64   
x-ms-client-request-id=:b458b6e8-8bfd-4122-874d-8e1c9b58cd0a:37ae89c9-dba1-4490-9431-ef3246a82c13:::GF:0
   [DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-7] 64   
x-ms-resource-type=file
   [DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-17] 64   
x-ms-owner=fbab02f2-d7a0-4f78-ab47-349acf6f275
   [DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-17] 193 Incrementing 
counter get_responses by 1 with final value 14218
   [DEBUG] [14:21:04.165] [ForkJoinPool.commonPool-worker-7] 64   
x-ms-server-encrypted=true
   [DEBUG] [14:21:04.166] [ForkJoinPool.commonPool-worker-17] 193 Incrementing 
counter bytes_received by 0 with final value 1210
   [DEBUG] [14:21:04.166] [ForkJoinPool.commonPool-worker-7] 64   
ETag="0x8DA771E5FD67CCB"
   [DEBUG] [14:21:04.166] [ForkJoinPool.commonPool-worker-17] 323 HttpRequest: 
GetPathStatus: 
200,,,cid=:b458b6e8-8bfd-4122-874d-8e1c9b58cd0a:37ae89c9-dba1-4490-9431-ef3246a82c13:::GF:0,rid=3754b849-801f-0048-126b-dd7a6f000000,sent=0,recv=0,HEAD,https://stsdpglasshouse.dfs.core.windows.net/dev/datalakehouse/tml.tf_msm/9000001550/2022/09/02/.hoodie_partition_metadata?upn=false&action=getStatus&timeout=90
   [DEBUG] [14:21:04.166] [ForkJoinPool.commonPool-worker-7] 64   
x-ms-permissions=rwxrwx---+
   [DEBUG] [14:21:04.166] [ForkJoinPool.commonPool-worker-7] 64   
Content-Length=96
   [DEBUG] [14:21:04.166] [ForkJoinPool.commonPool-worker-17] 193 Incrementing 
counter op_exists by 1 with final value 13648
   [DEBUG] [14:21:04.166] [ForkJoinPool.commonPool-worker-17] 604 
AzureBlobFileSystem.getFileStatus path: 
abfss://[email protected]/datalakehouse/tml.tf_msm/9000001550/2022/09/03/.hoodie_partition_metadata
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to