bhavya-ganatra opened a new issue, #18139:
URL: https://github.com/apache/hudi/issues/18139
### Bug Description
**What happened:**
I am using Hudi with AWS Glue sync. I am using below AWS Glue sync related
configurations:
`
options.put("hoodie.datasource.hive_sync.enable", "true");
options.put("hoodie.datasource.hive_sync.mode", "hms");
options.put("hoodie.datasource.hive_sync.database", glueDatabase);
options.put("hoodie.datasource.hive_sync.table", tableName);
options.put("hoodie.datasource.hive_sync.partition_fields",
getPartitionKeyFieldName(config));
options.put("hoodie.datasource.hive_sync.partition_extractor_class",
"org.apache.hudi.hive.MultiPartKeysValueExtractor");
options.put("hoodie.datasource.hive_sync.use_jdbc", "false");
options.put("hoodie.datasource.meta.sync.enable", "true");
options.put("hoodie.datasource.meta_sync.condition.sync", "true");
options.put("hoodie.meta.sync.client.tool.class",
"org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool");
options.put("hoodie.datasource.hive_sync.auto_create_database",
"true");
options.put("hoodie.datasource.hive_sync.support_timestamp", "true");
options.put("hoodie.datasource.hive_sync.create_managed_table",
"false");
options.put("hoodie.datasource.hive_sync.skip_ro_suffix", "false");
`
I am passing glueDatabase = `staging_lh_dynamodb_tenant_123` and table:
`content` which correspondence to s3 path:
`s3a://bucket-staging/lakehouse/lh-dynamodb/data/123/content`
There is also another database in Glue, name: "default", table: `content`
which correspondence to s3 path:
`s3a://bucket-perf-scale/lakehouse/dd/data/51/content`.
When I had ingested few upserts via my spark job, Hudi write was successful,
on bucket-staging. But, after that when it tried to sync with AWS Glue, it was
using "default" database's content table's s3 path: bucket-perf-scale.
Since, we have provided IAM roles such that service writing to one bucket
doesn't not have access to any other buckets, hence it was throwing expcetion
related to permission issue, which is expected.
But, I wanted to know why it was trying to use default database's content
table. Even though I had passed database name with
`hoodie.datasource.hive_sync.database` property.
Based on stacktrace, I found below code block in spark library, it seems to
be using only tableName, so that might use "default" database. But, I am not
sure if this is exact culpirt, i might be missing some configuration.
`
Potential cause -
// Spark library function - passes tableName only
public void refreshTable(final String tableName) {
LogicalPlan relation =
this.org$apache$spark$sql$internal$CatalogImpl$$sparkSession.table(tableName).queryExecution().analyzed();
relation.refresh();
this.invalidateCache$1(relation);
this.org$apache$spark$sql$internal$CatalogImpl$$sparkSession.sharedState().cacheManager().recacheByPlan(this.org$apache$spark$sql$internal$CatalogImpl$$sparkSession,
relation);
}
`
Note that: My writer is writing to multiple different s3 locations/tables.
So, it is not possible to set database name as part of spark level
configurations. I need some dynamic config setting which are applicable only
during writes.
**What you expected:**
If database name is passed as hoodie.datasource.hive_sync.database, it
shouldn't use default database. It should refer to this property.
**Steps to reproduce:**
1. Enable Inline Glue sync in writer configuration
2. Ingest some data so glue database gets created automatically.
3. Create table with the same name, ex. content, under default database.
4. Ingest few more data so Glue sync can get triggered.
5. Validate that write is successful, but glue sync might fail due to the
same table under default database.
[Note, table can be created earlier as well. i.e. step 2 can be avoided]
Note that, this Glue table - staging_lh_dynamodb_tenant_123 was created
automatically by Glue itself. "default" database -> content table - this was
created by some manual work.
### Environment
**Hudi version: 1.1.0**
**Query engine: Spark 3.5.6** (Spark/Flink/Trino etc)
**Relevant configs: **
Below are all the hudi options provided during write:
`Hudi options specified: {hoodie.write.lock.dynamodb.region=us-west-2,
hoodie.metrics.pushgateway.report.labels=cluster:staging,namespace:lh-dynamodb,
hoodie.datasource.hive_sync.partition_fields=, hoodie.index.type=BUCKET,
hoodie.clean.automatic=false,
hoodie.write.lock.provider=org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider,
hoodie.compact.inline=false, hoodie.datasource.write.recordkey.field=cid,
hoodie.datasource.hive_sync.create_managed_table=false,
hoodie.metadata.enable=true, hoodie.datasource.write.table.type=MERGE_ON_READ,
hoodie.datasource.write.keygenerator.type=NON_PARTITION,
hoodie.parquet.small.file.limit=134217728,
hoodie.write.lock.dynamodb.partition_key=content-123,
hoodie.cleaner.commits.retained=2, hoodie.datasource.meta.sync.enable=true,
hoodie.write.lock.dynamodb.table=hudi-lock-staging-lh-dynamodb-table,
hoodie.write.lock.wait_time_ms=60000, hoodie.memory.spillable.map.path=/tmp/,
hoodie.write.record.merge.mode=CUSTOM, hoodie.table.cdc.enabled
=true, hoodie.datasource.hive_sync.use_jdbc=false,
hoodie.delete.shuffle.parallelism=4, hoodie.table.name=content,
hoodie.table.log.file.format=parquet, hoodie.upsert.shuffle.parallelism=4,
hoodie.meta.sync.client.tool.class=org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool,
hoodie.datasource.hive_sync.skip_ro_suffix=false,
hoodie.datasource.write.precombine.field=processing_timestamp,
hoodie.insert.shuffle.parallelism=4, hoodie.cleaner.policy.failed.writes=LAZY,
hoodie.datasource.compaction.async.enable=false,
hoodie.datasource.write.operation=upsert,
hoodie.logfile.data.block.format=parquet,
hoodie.datasource.hive_sync.table=content, hoodie.index.bucket.engine=SIMPLE,
hoodie.datasource.meta_sync.condition.sync=true,
hoodie.metadata.index.column.stats.enable=false,
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor,
hoodie.write.record.merge.custom.implementation.classes=com.xyz.lhcommons.hudi.merger.CustomHoodieSparkRecordMerger,
h
oodie.datasource.hive_sync.auto_create_database=true,
hoodie.datasource.hive_sync.mode=hms, hoodie.parquet.max.file.size=268435456,
hoodie.datasource.hive_sync.support_timestamp=true,
hoodie.datasource.write.hive_style_partitioning=true,
hoodie.datasource.hive_sync.enable=true,
hoodie.bulkinsert.shuffle.parallelism=4,
hoodie.cleaner.policy=KEEP_LATEST_COMMITS,
hoodie.metrics.pushgateway.job.name=lhwriter,
hoodie.write.lock.dynamodb.endpoint_url=dynamodb.us-west-2.amazonaws.com,
hoodie.table.cdc.supplemental.logging=data_before_after,
hoodie.write.concurrency.mode=optimistic_concurrency_control,
hoodie.memory.merge.fraction=0.2, hoodie.datasource.write.partitionpath.field=,
hoodie.bucket.index.num.buckets=17,
hoodie.write.record.merge.strategy.id=749c702e-ffa2-43da-ab2e-9c43d5080754,
hoodie.datasource.hive_sync.database=staging_lh_dynamodb_tenant_123}`
### Logs and Stack Trace
Below is full exception trace and few relevant logs:
`
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.hive.HiveSyncTool - Last commit time synced was found to be
20260204194748073, last commit completion time is found to be 20260204194759976
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.hive.HiveSyncTool - Sync all partitions given the last commit
time synced is empty or before the start of the active timeline. Listing all
partitions in s3a://bucket-staging/lakehouse/lh-dynamodb/data/123/content, file
system: S3AFileSystem{uri=s3a://bucket-staging,
workingDir=s3a://bucket-staging/user/hadoop, partSize=104857600,
enableMultiObjectsDelete=true, maxKeys=5000, performanceFlags={},
OpenFileSupport{changePolicy=ETagChangeDetectionPolicy mode=Server,
defaultReadAhead=65536, defaultBufferSize=65536,
defaultAsyncDrainThreshold=16000, defaultInputPolicy=default},
blockSize=33554432, multiPartThreshold=134217728, s3EncryptionAlgorithm='NONE',
blockFactory=org.apache.hadoop.fs.s3a.S3ADataBlocks$DiskBlockFactory@71dfb684,
auditManager=Service ActiveAuditManagerS3A in state ActiveAuditManagerS3A:
STARTED, auditor=LoggingAuditor{ID='424ad7
b7-e6cb-408b-9eaf-ecfd7b7d8caa', headerEnabled=true, rejectOutOfSpan=false,
isMultipartUploadEnabled=true}}, authoritativePath=[], useListV1=false,
magicCommitter=true,
boundedExecutor=BlockingThreadPoolExecutorService{SemaphoredDelegatingExecutor{permitCount=224,
available=224, waiting=0}, activeCount=0},
unboundedExecutor=java.util.concurrent.ThreadPoolExecutor@70038363[Running,
pool size = 59, active threads = 0, queued tasks = 0, completed tasks = 129],
credentials=AWSCredentialProviderList name=; refcount= 1; size=1:
[DefaultCredentialsProvider(providerChain=LazyAwsCredentialsProvider(delegate=Lazy(value=AwsCredentialsProviderChain(credentialsProviders=[SystemPropertyCredentialsProvider(),
EnvironmentVariableCredentialsProvider(),
WebIdentityTokenCredentialsProvider(),
ProfileCredentialsProvider(profileName=default),
ContainerCredentialsProvider(), InstanceProfileCredentialsProvider()]))))] last
provider:
DefaultCredentialsProvider(providerChain=LazyAwsCredentialsProvider(deleg
ate=Lazy(value=AwsCredentialsProviderChain(credentialsProviders=[SystemPropertyCredentialsProvider(),
EnvironmentVariableCredentialsProvider(),
WebIdentityTokenCredentialsProvider(),
ProfileCredentialsProvider(profileName=default),
ContainerCredentialsProvider(), InstanceProfileCredentialsProvider()])))),
delegation tokens=disabled, DirectoryMarkerRetention{policy='keep'},
instrumentation {S3AInstrumentation{}}, ClientSideEncryption=false}
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.hive.HiveSyncTool - Sync complete for content_ro
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.hive.HiveSyncTool - Trying to sync hoodie table content_rt with
base path s3a://bucket-staging/lakehouse/lh-dynamodb/data/123/content of type
MERGE_ON_READ
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.hive.HiveSyncTool - No Schema difference for content_rt.
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.hive.HiveSyncTool - Last commit time synced was found to be
20260204194748073, last commit completion time is found to be 20260204194759976
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.hive.HiveSyncTool - Sync all partitions given the last commit
time synced is empty or before the start of the active timeline. Listing all
partitions in s3a://bucket-staging/lakehouse/lh-dynamodb/data/123/content, file
system: S3AFileSystem{uri=s3a://bucket-staging,
workingDir=s3a://bucket-staging/user/hadoop, partSize=104857600,
enableMultiObjectsDelete=true, maxKeys=5000, performanceFlags={},
OpenFileSupport{changePolicy=ETagChangeDetectionPolicy mode=Server,
defaultReadAhead=65536, defaultBufferSize=65536,
defaultAsyncDrainThreshold=16000, defaultInputPolicy=default},
blockSize=33554432, multiPartThreshold=134217728, s3EncryptionAlgorithm='NONE',
blockFactory=org.apache.hadoop.fs.s3a.S3ADataBlocks$DiskBlockFactory@71dfb684,
auditManager=Service ActiveAuditManagerS3A in state ActiveAuditManagerS3A:
STARTED, auditor=LoggingAuditor{ID='424ad7
b7-e6cb-408b-9eaf-ecfd7b7d8caa', headerEnabled=true, rejectOutOfSpan=false,
isMultipartUploadEnabled=true}}, authoritativePath=[], useListV1=false,
magicCommitter=true,
boundedExecutor=BlockingThreadPoolExecutorService{SemaphoredDelegatingExecutor{permitCount=224,
available=224, waiting=0}, activeCount=0},
unboundedExecutor=java.util.concurrent.ThreadPoolExecutor@70038363[Running,
pool size = 59, active threads = 0, queued tasks = 0, completed tasks = 129],
credentials=AWSCredentialProviderList name=; refcount= 1; size=1:
[DefaultCredentialsProvider(providerChain=LazyAwsCredentialsProvider(delegate=Lazy(value=AwsCredentialsProviderChain(credentialsProviders=[SystemPropertyCredentialsProvider(),
EnvironmentVariableCredentialsProvider(),
WebIdentityTokenCredentialsProvider(),
ProfileCredentialsProvider(profileName=default),
ContainerCredentialsProvider(), InstanceProfileCredentialsProvider()]))))] last
provider:
DefaultCredentialsProvider(providerChain=LazyAwsCredentialsProvider(deleg
ate=Lazy(value=AwsCredentialsProviderChain(credentialsProviders=[SystemPropertyCredentialsProvider(),
EnvironmentVariableCredentialsProvider(),
WebIdentityTokenCredentialsProvider(),
ProfileCredentialsProvider(profileName=default),
ContainerCredentialsProvider(), InstanceProfileCredentialsProvider()])))),
delegation tokens=disabled, DirectoryMarkerRetention{policy='keep'},
instrumentation {S3AInstrumentation{}}, ClientSideEncryption=false}
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.hive.HiveSyncTool - Sync complete for content_rt
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.hive.HiveSyncTool - Trying to sync hoodie table content with
base path s3a://bucket-staging/lakehouse/lh-dynamodb/data/123/content of type
MERGE_ON_READ
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.hive.HiveSyncTool - No Schema difference for content.
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.hive.HiveSyncTool - Last commit time synced was found to be
20260204194748073, last commit completion time is found to be 20260204194759976
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.hive.HiveSyncTool - Sync all partitions given the last commit
time synced is empty or before the start of the active timeline. Listing all
partitions in s3a://bucket-staging/lakehouse/lh-dynamodb/data/123/content, file
system: S3AFileSystem{uri=s3a://bucket-staging,
workingDir=s3a://bucket-staging/user/hadoop, partSize=104857600,
enableMultiObjectsDelete=true, maxKeys=5000, performanceFlags={},
OpenFileSupport{changePolicy=ETagChangeDetectionPolicy mode=Server,
defaultReadAhead=65536, defaultBufferSize=65536,
defaultAsyncDrainThreshold=16000, defaultInputPolicy=default},
blockSize=33554432, multiPartThreshold=134217728, s3EncryptionAlgorithm='NONE',
blockFactory=org.apache.hadoop.fs.s3a.S3ADataBlocks$DiskBlockFactory@71dfb684,
auditManager=Service ActiveAuditManagerS3A in state ActiveAuditManagerS3A:
STARTED, auditor=LoggingAuditor{ID='424ad7
b7-e6cb-408b-9eaf-ecfd7b7d8caa', headerEnabled=true, rejectOutOfSpan=false,
isMultipartUploadEnabled=true}}, authoritativePath=[], useListV1=false,
magicCommitter=true,
boundedExecutor=BlockingThreadPoolExecutorService{SemaphoredDelegatingExecutor{permitCount=224,
available=224, waiting=0}, activeCount=0},
unboundedExecutor=java.util.concurrent.ThreadPoolExecutor@70038363[Running,
pool size = 59, active threads = 0, queued tasks = 0, completed tasks = 129],
credentials=AWSCredentialProviderList name=; refcount= 1; size=1:
[DefaultCredentialsProvider(providerChain=LazyAwsCredentialsProvider(delegate=Lazy(value=AwsCredentialsProviderChain(credentialsProviders=[SystemPropertyCredentialsProvider(),
EnvironmentVariableCredentialsProvider(),
WebIdentityTokenCredentialsProvider(),
ProfileCredentialsProvider(profileName=default),
ContainerCredentialsProvider(), InstanceProfileCredentialsProvider()]))))] last
provider:
DefaultCredentialsProvider(providerChain=LazyAwsCredentialsProvider(deleg
ate=Lazy(value=AwsCredentialsProviderChain(credentialsProviders=[SystemPropertyCredentialsProvider(),
EnvironmentVariableCredentialsProvider(),
WebIdentityTokenCredentialsProvider(),
ProfileCredentialsProvider(profileName=default),
ContainerCredentialsProvider(), InstanceProfileCredentialsProvider()])))),
delegation tokens=disabled, DirectoryMarkerRetention{policy='keep'},
instrumentation {S3AInstrumentation{}}, ClientSideEncryption=false}
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.hive.HiveSyncTool - Sync complete for content
2026-02-05 07:19:39,959 INFO [stream execution thread for [id =
82a0bf9b-64bc-45c4-9072-35f2e5acd3ff, runId =
50dc2060-a949-4d8e-bbce-a6794b24f247]] (AuditContextUtil.java:79) [] - Getting
current Glue AuditContext
2026-02-05 07:19:39,959 INFO [stream execution thread for [id =
82a0bf9b-64bc-45c4-9072-35f2e5acd3ff, runId =
50dc2060-a949-4d8e-bbce-a6794b24f247]] (AuditContextUtil.java:111) [] - No
AuditContext is present
2026-02-05 07:19:40,117 INFO [stream execution thread for [id =
82a0bf9b-64bc-45c4-9072-35f2e5acd3ff, runId =
50dc2060-a949-4d8e-bbce-a6794b24f247]] (AuditContextUtil.java:79) [] - Getting
current Glue AuditContext
2026-02-05 07:19:40,118 INFO [stream execution thread for [id =
82a0bf9b-64bc-45c4-9072-35f2e5acd3ff, runId =
50dc2060-a949-4d8e-bbce-a6794b24f247]] (AuditContextUtil.java:111) [] - No
AuditContext is present
2026-02-05 07:19:40,178 INFO [stream execution thread for [id =
82a0bf9b-64bc-45c4-9072-35f2e5acd3ff, runId =
50dc2060-a949-4d8e-bbce-a6794b24f247]] (AuditContextUtil.java:79) [] - Getting
current Glue AuditContext
2026-02-05 07:19:40,184 INFO [stream execution thread for [id =
82a0bf9b-64bc-45c4-9072-35f2e5acd3ff, runId =
50dc2060-a949-4d8e-bbce-a6794b24f247]] (AuditContextUtil.java:111) [] - No
AuditContext is present
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.DataSourceUtils - Getting table path..
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.common.util.TablePathUtils - Getting table path from path :
s3a://bucket-staging/lakehouse/lh-dynamodb/data/123/content
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.DefaultSource - Obtained hudi table path:
s3a://bucket-staging/lakehouse/lh-dynamodb/data/123/content
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.common.table.HoodieTableConfig - Loading table properties from
s3a://bucket-staging/lakehouse/lh-dynamodb/data/123/content/.hoodie/hoodie.properties
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.DefaultSource - Is bootstrapped table => false, tableType is:
MERGE_ON_READ, queryType is: read_optimized
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.common.table.log.HoodieLogFileReader - Closing Log file reader
.00000000-1248-4c23-88cb-b647eac38862-0_20260205071918459.log.1_0-62-479
2026-02-05 07:19:40,685 INFO [stream execution thread for [id =
82a0bf9b-64bc-45c4-9072-35f2e5acd3ff, runId =
50dc2060-a949-4d8e-bbce-a6794b24f247]] (AuditContextUtil.java:79) [] - Getting
current Glue AuditContext
2026-02-05 07:19:40,686 INFO [stream execution thread for [id =
82a0bf9b-64bc-45c4-9072-35f2e5acd3ff, runId =
50dc2060-a949-4d8e-bbce-a6794b24f247]] (AuditContextUtil.java:111) [] - No
AuditContext is present
2026-02-05 07:19:40,780 INFO [stream execution thread for [id =
82a0bf9b-64bc-45c4-9072-35f2e5acd3ff, runId =
50dc2060-a949-4d8e-bbce-a6794b24f247]] (AuditContextUtil.java:79) [] - Getting
current Glue AuditContext
2026-02-05 07:19:40,782 INFO [stream execution thread for [id =
82a0bf9b-64bc-45c4-9072-35f2e5acd3ff, runId =
50dc2060-a949-4d8e-bbce-a6794b24f247]] (AuditContextUtil.java:111) [] - No
AuditContext is present
2026-02-05 07:19:40,835 INFO [stream execution thread for [id =
82a0bf9b-64bc-45c4-9072-35f2e5acd3ff, runId =
50dc2060-a949-4d8e-bbce-a6794b24f247]] (AuditContextUtil.java:79) [] - Getting
current Glue AuditContext
2026-02-05 07:19:40,837 INFO [stream execution thread for [id =
82a0bf9b-64bc-45c4-9072-35f2e5acd3ff, runId =
50dc2060-a949-4d8e-bbce-a6794b24f247]] (AuditContextUtil.java:111) [] - No
AuditContext is present
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.HoodieSparkSqlWriterInternal - Config.inlineCompactionEnabled ?
false
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.HoodieSparkSqlWriterInternal - Config.asyncClusteringEnabled ?
false
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.client.BaseHoodieClient - Stopping Timeline service !!
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.common.util.collection.ExternalSpillableMap -
KeyBasedFileGroupRecordBuffer : Total entries in InMemory map 2, with average
record size as 3672, currentInMemoryMapSize 7344. No entries were spilled to
disk.
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.client.embedded.EmbeddedTimelineService - Closing Timeline
server
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.timeline.service.TimelineService - Closing Timeline Service
with port 33855
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO io.javalin.Javalin -
Stopping Javalin ...
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO io.javalin.Javalin -
Javalin has stopped
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.timeline.service.TimelineService - Closed Timeline Service with
port 33855
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.client.embedded.EmbeddedTimelineService - Closed Timeline server
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.client.transaction.TransactionManager - Transaction manager
closed
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.client.transaction.TransactionManager - Transaction manager
closed
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.hudi.metrics.Metrics - Stopping the metrics reporter...
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] INFO
org.apache.spark.sql.execution.SQLExecution - Skipped
SparkListenerSQLExecutionObfuscatedInfo event due to NON_EMPTY_ERROR.
[stream execution thread for [id = 82a0bf9b-64bc-45c4-9072-35f2e5acd3ff,
runId = 50dc2060-a949-4d8e-bbce-a6794b24f247]] ERROR
com.xyz.lhwriter.writer.HudiDataWriter - Error writing to Hudi for tenant 123
entity content: java.nio.file.AccessDeniedException:
s3a://bucket-perf-scale/lakehouse/dd/data/51/content/.hoodie: getFileStatus on
s3a://bucket-perf-scale/lakehouse/dd/data/51/content/.hoodie:
software.amazon.awssdk.services.s3.model.S3Exception: User:
arn:aws:sts::744500168534:assumed-role/staging_emr_irsa/aws-sdk-java-1770275951000
is not authorized to perform: s3:ListBucket on resource:
"arn:aws:s3:::bucket-perf-scale" because no identity-based policy allows the
s3:ListBucket action (Service: S3, Status Code: 403, Request ID:
98W8TAS2YDPVBCER, Extended Request ID:
IoNmWcgGoL36lqev2rjxSeNYnjtxB/zYbmkX04Mzraco9JXGqMzfflLr+uwBPSizRS94YbOzT6I=)
(SDK Attempt Count: 1):AccessDenied
java.util.concurrent.ExecutionException:
java.nio.file.AccessDeniedException:
s3a://bucket-perf-scale/lakehouse/dd/data/51/content/.hoodie: getFileStatus on
s3a://bucket-perf-scale/lakehouse/dd/data/51/content/.hoodie:
software.amazon.awssdk.services.s3.model.S3Exception: User:
arn:aws:sts::744500168534:assumed-role/staging_emr_irsa/aws-sdk-java-1770275951000
is not authorized to perform: s3:ListBucket on resource:
"arn:aws:s3:::bucket-perf-scale" because no identity-based policy allows the
s3:ListBucket action (Service: S3, Status Code: 403, Request ID:
98W8TAS2YDPVBCER, Extended Request ID:
IoNmWcgGoL36lqev2rjxSeNYnjtxB/zYbmkX04Mzraco9JXGqMzfflLr+uwBPSizRS94YbOzT6I=)
(SDK Attempt Count: 1):AccessDenied
at
org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
at
org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
at
org.sparkproject.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at
org.sparkproject.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
at
org.sparkproject.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
at
org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
at
org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at
org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
at
org.sparkproject.guava.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCachedPlan(SessionCatalog.scala:215)
at
org.apache.spark.sql.execution.datasources.FindDataSourceTable.org$apache$spark$sql$execution$datasources$FindDataSourceTable$$readDataSourceTable(DataSourceStrategy.scala:277)
at
org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:353)
at
org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:330)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$2(AnalysisHelper.scala:205)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:77)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$1(AnalysisHelper.scala:205)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:359)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning(AnalysisHelper.scala:203)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning$(AnalysisHelper.scala:199)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDownWithPruning(LogicalPlan.scala:39)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$4(AnalysisHelper.scala:210)
at
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1422)
at
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1421)
at
org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias.mapChildren(basicLogicalOperators.scala:2369)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$1(AnalysisHelper.scala:210)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:359)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning(AnalysisHelper.scala:203)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning$(AnalysisHelper.scala:199)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDownWithPruning(LogicalPlan.scala:39)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsWithPruning(AnalysisHelper.scala:134)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsWithPruning$(AnalysisHelper.scala:131)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsWithPruning(LogicalPlan.scala:39)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:85)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:84)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:39)
at
org.apache.spark.sql.execution.datasources.FindDataSourceTable.apply(DataSourceStrategy.scala:330)
at
org.apache.spark.sql.execution.datasources.FindDataSourceTable.apply(DataSourceStrategy.scala:248)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:232)
at
scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
at
scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
at scala.collection.immutable.List.foldLeft(List.scala:91)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor.executeBatch$1(RuleExecutor.scala:229)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$6(RuleExecutor.scala:312)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor$RuleExecutionContext$.withContext(RuleExecutor.scala:361)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$5(RuleExecutor.scala:312)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$5$adapted(RuleExecutor.scala:302)
at scala.collection.immutable.List.foreach(List.scala:431)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:302)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:188)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:184)
at
org.apache.spark.sql.catalyst.analysis.Analyzer.executeSameContext(Analyzer.scala:302)
at
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$execute$1(Analyzer.scala:298)
at
org.apache.spark.sql.catalyst.analysis.AnalysisContext$.withNewAnalysisContext(Analyzer.scala:231)
at
org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:298)
at
org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:260)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:175)
at
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:108)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:175)
at
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:285)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:366)
at
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:284)
at
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:94)
at
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:227)
at
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:293)
at
org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:748)
at
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:293)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:996)
at
org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:292)
at
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:94)
at
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:91)
at
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:83)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:93)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:996)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:91)
at org.apache.spark.sql.DataFrameReader.table(DataFrameReader.scala:688)
at org.apache.spark.sql.SparkSession.table(SparkSession.scala:603)
at
org.apache.spark.sql.internal.CatalogImpl.refreshTable(CatalogImpl.scala:846)
at
org.apache.hudi.HoodieSparkSqlWriterInternal.$anonfun$metaSync$3(HoodieSparkSqlWriter.scala:938)
at
org.apache.hudi.HoodieSparkSqlWriterInternal.$anonfun$metaSync$3$adapted(HoodieSparkSqlWriter.scala:934)
at scala.collection.immutable.List.foreach(List.scala:431)
at
org.apache.hudi.HoodieSparkSqlWriterInternal.metaSync(HoodieSparkSqlWriter.scala:934)
at
org.apache.hudi.HoodieSparkSqlWriterInternal.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:1020)
at
org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:548)
at
org.apache.hudi.HoodieSparkSqlWriterInternal.$anonfun$write$1(HoodieSparkSqlWriter.scala:193)
at
org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:211)
at
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:133)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:171)
at
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:127)
at
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:108)
at
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:391)
at
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:159)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$11(SQLExecution.scala:227)
at
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:108)
at
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:391)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$10(SQLExecution.scala:227)
at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:412)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:226)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:996)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:84)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
at
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:124)
at
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:115)
at
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:521)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:77)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:521)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:39)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:303)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:299)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:39)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:39)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:497)
at
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:115)
at
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:102)
at
org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:100)
at
org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:165)
at
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:884)
at
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:405)
at
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:365)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:244)
at
com.xyz.lhwriter.writer.HudiDataWriter.executeHudiWrite(HudiDataWriter.java:161)
at
com.xyz.lhwriter.writer.HudiDataWriter.writeToHudi(HudiDataWriter.java:75)
at
com.xyz.lhwriter.writer.HudiWriteExecutor.attemptWrite(HudiWriteExecutor.java:73)
at
com.xyz.lhwriter.writer.HudiWriteExecutor.writeWithRetry(HudiWriteExecutor.java:47)
at
com.xyz.lhwriter.writer.SequentialGroupProcessor.processSingleGroup(SequentialGroupProcessor.java:61)
at
com.xyz.lhwriter.writer.SequentialGroupProcessor.processGroups(SequentialGroupProcessor.java:36)
at
com.xyz.lhwriter.writer.HudiWriter.processEntityDataset(HudiWriter.java:394)
at
com.xyz.lhwriter.writer.HudiWriter.lambda$processEntityMapBatch$0(HudiWriter.java:287)
at
com.xyz.lhwriter.writer.EntityParallelProcessor.processEntitiesSequentially(EntityParallelProcessor.java:117)
at
com.xyz.lhwriter.writer.EntityParallelProcessor.processEntities(EntityParallelProcessor.java:92)
at
com.xyz.lhwriter.writer.HudiWriter.processEntityMapBatch(HudiWriter.java:277)
at
com.xyz.lhwriter.writer.HudiWriter.lambda$writeAllEntitiesToHudi$58d5af8e$1(HudiWriter.java:170)
at
org.apache.spark.sql.streaming.DataStreamWriter.$anonfun$foreachBatch$1(DataStreamWriter.scala:505)
at
org.apache.spark.sql.streaming.DataStreamWriter.$anonfun$foreachBatch$1$adapted(DataStreamWriter.scala:505)
at
org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:38)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$17(MicroBatchExecution.scala:732)
at
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:108)
at
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:391)
at
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:159)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$11(SQLExecution.scala:227)
at
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:108)
at
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:391)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$10(SQLExecution.scala:227)
at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:412)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:226)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:996)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:84)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$16(MicroBatchExecution.scala:729)
at
org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)
at
org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)
at
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:729)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:286)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at
org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)
at
org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)
at
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:249)
at
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:67)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:239)
at
org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:311)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:996)
at
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:289)
at
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.$anonfun$run$1(StreamExecution.scala:211)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at
org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)
at
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211)
Caused by: java.nio.file.AccessDeniedException:
s3a://bucket-perf-scale/lakehouse/dd/data/51/content/.hoodie: getFileStatus on
s3a://bucket-perf-scale/lakehouse/dd/data/51/content/.hoodie:
software.amazon.awssdk.services.s3.model.S3Exception: User:
arn:aws:sts::744500168534:assumed-role/staging_emr_irsa/aws-sdk-java-1770275951000
is not authorized to perform: s3:ListBucket on resource:
"arn:aws:s3:::bucket-perf-scale" because no identity-based policy allows the
s3:ListBucket action (Service: S3, Status Code: 403, Request ID:
98W8TAS2YDPVBCER, Extended Request ID:
IoNmWcgGoL36lqev2rjxSeNYnjtxB/zYbmkX04Mzraco9JXGqMzfflLr+uwBPSizRS94YbOzT6I=)
(SDK Attempt Count: 1):AccessDenied
at
org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:275)
at
org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:162)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:4541)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:4403)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$exists$34(S3AFileSystem.java:5403)
at
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:547)
at
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:528)
at
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:449)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:3274)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:3293)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.exists(S3AFileSystem.java:5401)
at
org.apache.hudi.storage.hadoop.HoodieHadoopStorage.exists(HoodieHadoopStorage.java:165)
at
org.apache.spark.sql.hudi.HoodieSqlCommonUtils$.tableExistsInPath(HoodieSqlCommonUtils.scala:222)
at
org.apache.spark.sql.catalyst.catalog.HoodieCatalogTable.<init>(HoodieCatalogTable.scala:88)
at
org.apache.spark.sql.catalyst.catalog.HoodieCatalogTable$.apply(HoodieCatalogTable.scala:403)
at
org.apache.spark.sql.catalyst.catalog.HoodieCatalogTable$.apply(HoodieCatalogTable.scala:399)
at
org.apache.hudi.HoodieFileIndex$.getConfigProperties(HoodieFileIndex.scala:540)
at org.apache.hudi.HoodieFileIndex.<init>(HoodieFileIndex.scala:99)
at
org.apache.hudi.HoodieCopyOnWriteSnapshotHadoopFsRelationFactory.<init>(HoodieHadoopFsRelationFactory.scala:371)
at
org.apache.hudi.DefaultSource$.createRelation(DefaultSource.scala:341)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:144)
at
org.apache.spark.sql.execution.datasources.DataSource.$anonfun$resolveRelation$6(DataSource.scala:369)
at
org.apache.spark.util.FileAccessContext$.withContext(FileAccessContext.scala:71)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:348)
at
org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anon$1.call(DataSourceStrategy.scala:268)
at
org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anon$1.call(DataSourceStrategy.scala:255)
at
org.sparkproject.guava.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792)
at
org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at
org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
... 168 more
Caused by: software.amazon.awssdk.services.s3.model.S3Exception: User:
arn:aws:sts::744500168534:assumed-role/staging_emr_irsa/aws-sdk-java-1770275951000
is not authorized to perform: s3:ListBucket on resource:
"arn:aws:s3:::bucket-perf-scale" because no identity-based policy allows the
s3:ListBucket action (Service: S3, Status Code: 403, Request ID:
98W8TAS2YDPVBCER, Extended Request ID:
IoNmWcgGoL36lqev2rjxSeNYnjtxB/zYbmkX04Mzraco9JXGqMzfflLr+uwBPSizRS94YbOzT6I=)
(SDK Attempt Count: 1)
at
software.amazon.awssdk.services.s3.model.S3Exception$BuilderImpl.build(S3Exception.java:113)
at
software.amazon.awssdk.services.s3.model.S3Exception$BuilderImpl.build(S3Exception.java:61)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.retryPolicyDisallowedRetryException(RetryableStageHelper.java:168)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:73)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36)
at
software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at
software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:53)
at
software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:35)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:82)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:62)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:43)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32)
at
software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at
software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
at
software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:210)
at
software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
at
software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:173)
at
software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:80)
at
software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182)
at
software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74)
at
software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
at
software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53)
at
software.amazon.awssdk.services.s3.DefaultS3Client.listObjectsV2(DefaultS3Client.java:8724)
at
software.amazon.awssdk.services.s3.DelegatingS3Client.lambda$listObjectsV2$69(DelegatingS3Client.java:7076)
at
software.amazon.awssdk.services.s3.internal.crossregion.S3CrossRegionSyncClient.invokeOperation(S3CrossRegionSyncClient.java:74)
at
software.amazon.awssdk.services.s3.DelegatingS3Client.listObjectsV2(DelegatingS3Client.java:7076)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$listObjects$13(S3AFileSystem.java:3497)
at
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:547)
at
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:528)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:431)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.listObjects(S3AFileSystem.java:3488)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:4516)
... 195 more
`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]