Hfal91 opened a new issue, #13571:
URL: https://github.com/apache/hudi/issues/13571
Describe the problem you faced
Using Hudi 0.15, on EMR, streaming job fails after exactly 1h.
Issue seems related with HiveSync. Has not happening on 0.14.1
**with metadata disabled:**
{"message":"25/07/17 13:05:25 ERROR MicroBatchExecution: Query [id =
81196902-5cc5-45b5-86ca-8adcbc5bc236, runId =
9d7e67e3-a38b-4423-a19d-40d47663f944] terminated with
error","time":"2025-07-17T13:05:25+00:00"}
{"message":"py4j.Py4JException: An exception was raised by the Python Proxy.
Return Message: Traceback (most recent call
last):","time":"2025-07-17T13:05:25+00:00"}
{"message":" File
\"/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py\", line
617, in _call_proxy","time":"2025-07-17T13:05:25+00:00"}
2025-07-17T13:05:25.374Z
{"message":" return_value = getattr(self.pool[obj_id],
method)(*params)","time":"2025-07-17T13:05:25+00:00"}
{"message":" File
\"/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py\", line 120, in
call","time":"2025-07-17T13:05:25+00:00"}
{"message":" raise e","time":"2025-07-17T13:05:25+00:00"}
{"message":" File
\"/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py\", line 117, in
call","time":"2025-07-17T13:05:25+00:00"}
{"message":" self.func(DataFrame(jdf, wrapped_session_jdf),
batch_id)","time":"2025-07-17T13:05:25+00:00"}
{"message":" File
\"/tmp/spark-a13bcd8a-97f4-493e-9632-f1992f945236/advance-curation-Medium.py\",
line 178, in foreach_batch_function","time":"2025-07-17T13:05:25+00:00"}
{"message":" avroDf.write \\","time":"2025-07-17T13:05:25+00:00"}
{"message":" File
\"/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py\", line 1461,
in save","time":"2025-07-17T13:05:25+00:00"}
{"message":" self._jwrite.save()","time":"2025-07-17T13:05:25+00:00"}
{"message":" File
\"/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py\", line
1322, in __call__","time":"2025-07-17T13:05:25+00:00"}
{"message":" return_value =
get_return_value(","time":"2025-07-17T13:05:25+00:00"}
{"message":" File
\"/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py\",
line 179, in deco","time":"2025-07-17T13:05:25+00:00"}
{"message":" return f(*a, **kw)","time":"2025-07-17T13:05:25+00:00"}
{"message":" File
\"/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py\", line 326,
in get_return_value","time":"2025-07-17T13:05:25+00:00"}
{"message":" raise Py4JJavaError(","time":"2025-07-17T13:05:25+00:00"}
{"message":"py4j.protocol.Py4JJavaError: An error occurred while calling
o6783.save.","time":"2025-07-
{"message":": **org.apache.hudi.exception.HoodieMetaSyncException: Could not
sync using the meta sync class
org.apache.hudi.hive.HiveSyncTool**","time":"2025-07-17T13:05:25+00:00"}
**with metadata enabled:**
{"message":"25/07/17 11:47:48 WARN HoodieSparkSqlWriterInternal: Closing
write client","time":"2025-07-17T11:47:48+00:00"}
{"message":"25/07/17 11:48:34 WARN HttpParser: URI is too large
>8192","time":"2025-07-17T11:48:34+00:00"}
{"message":"25/07/17 11:48:34 ERROR PriorityBasedFileSystemView: Got error
running preferred function. Trying
secondary","time":"2025-07-17T11:48:34+00:00"}
{"message":"**org.apache.hudi.exception.HoodieRemoteException: status code:
414, reason phrase: URI Too Long**","time":"2025-07-17T11:48:34+00:00"}
HUDI_OPTIONS = {
# Writer config
'hoodie.datasource.write.table.type': INSERT,
'hoodie.table.name': STREAMING_TABLENAME,
'hoodie.database.name': DATABASE_NAME,
'hoodie.datasource.write.recordkey.field': PRIMARY_KEY,
'hoodie.datasource.write.partitionpath.field': PART_KEYS,
'hoodie.datasource.write.table.name': STREAMING_TABLENAME,
'hoodie.datasource.write.operation': COPY_ON_WRITE,
'hoodie.datasource.write.precombine.field': 'TS',
'hoodie.datasource.write.hive_style_partitioning': 'true',
'hoodie.write.set.null.for.missing.columns': 'true',
'hoodie.parquet.compression.codec': 'snappy',
'hoodie.merge.allow.duplicate.on.inserts': 'true',
'hoodie.enable.data.skipping': 'true',
'hoodie.clean.automatic': 'true',
'hoodie.clean.async': 'false',
'hoodie.cleaner.commits.retained': '5',
'hoodie.schema.on.read.enable': 'true',
'hoodie.parquet.small.file.limit': 64 * 1020 * 1024,
'hoodie.parquet.max.file.size': 128 * 1020 * 1024,
'hoodie.index.type':'RECORD_INDEX',
# v0.15
'hoodie.parquet.bloom.filter.enabled': 'false',
'hoodie.datasource.meta.sync.glue.partition_index_fields.enable': 'true',
'hoodie.datasource.meta.sync.glue.all_partitions_read_parallelism':10,
'hoodie.datasource.hive_sync.ignore_exceptions': 'false',
'hoodie.metadata.enable': 'false',
'hoodie.metadata.log.compaction.enable': 'false',
# Hive Sync config
'hoodie.datasource.meta.sync.enable': 'true',
'hoodie.datasource.hive_sync.database': DATABASE_NAME,
'hoodie.datasource.hive_sync.table': STREAMING_TABLENAME,
'hoodie.datasource.hive_sync.partition_fields': PART_KEYS,
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.mode': 'hms',
'hoodie.datasource.hive_sync.use_jdbc': 'false',
'hoodie.datasource.hive_sync.skip_ro_suffix': 'true',
'hoodie.datasource.hive_sync.auto_create_database': 'false',
'hoodie.datasource.hive_sync.create_managed_table': 'false',
'hoodie.datasource.hive_sync.ignore_exceptions': 'false',
'hoodie.datasource.hive_sync.omit_metadata_fields': 'false',
'hoodie.datasource.hive_sync.support_timestamp': 'true'
}
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]