easonwood opened a new issue, #5290:
URL: https://github.com/apache/hudi/issues/5290
I have a spark job with such pipeline:
mysql binlog data -> parquet files on S3 -> hudi (external tables on S3).
Codes Like This:
val dataDF = spark.read.option("mergeSchema","true").parquet(parquetPaths:_*)
val hudiOptions = {"hoodie.metadata.enable" -> "true",
"hoodie.datasource.write.operation" ->"upsert",
"hoodie.datasource.write.table.type"->"COPY_ON_WRITE" }
dataDF
.write
.format("org.apache.hudi")
.options(hudiOptions).
.mode(SaveMode.Append)
.save(tablePath)
The task runs well before it faced with the problem of mysql column
deletion. I got the error:
Caused by: org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema
mismatch: Avro field 'col1' not found
To handle it, I used the method below:
https://hudi.apache.org/docs/troubleshooting/#caused-by-orgapacheparquetioinvalidrecordexception-parquetavro-schema-mismatch-avro-field-col1-not-found
And changed the code to:
val dataDF = spark.read.schema(tableSchema).parquet(pathsNeedConsume:_*)
-- tableSchema contains the deleted old columns.
-- using dataDF.show here can see the schema and data are compatible, the
deleted columns set to null
val hudiOptions = {"hoodie.metadata.enable" -> "true",
"hoodie.datasource.write.operation" ->"upsert",
"hoodie.datasource.write.table.type"->"COPY_ON_WRITE" }
dataDF
.write
.format("org.apache.hudi")
.options(hudiOptions).
.mode(SaveMode.Append)
.save(tablePath)
-- save returns the error:
22/04/11 10:32:17 ERROR BaseTableMetadata: Failed to retrieve files in
partition (..... tablePath .....) from metadata
org.apache.hudi.exception.HoodieMetadataException: Metadata record for
partition db_cluster=qa01 is inconsistent: HoodieMetadataPayload
{key=db_cluster=qa01, type=2,
creations=[5969c7b9-a1f5-4bcb-8382-8809ec0cd067-0_0-85-1188_20220411102336.parquet,
5969c7b9-a1f5-4bcb-8382-8809ec0cd067-0_0-85-1195_20220411101502.parquet,
5969c7b9-a1f5-4bcb-8382-8809ec0cd067-0_0-87-537_20220411092820.parquet .......
a lot parquet in tablePath],
deletions=[5969c7b9-a1f5-4bcb-8382-8809ec0cd067-0_0-29-245_20220331071910.parquet,
5969c7b9-a1f5-4bcb-8382-8809ec0cd067-0_0-29-246_20220331071910.parquet
.......... a lot other parquet in tablePath], }
at
org.apache.hudi.metadata.BaseTableMetadata.fetchAllFilesInPartition(BaseTableMetadata.java:212)
at
org.apache.hudi.metadata.BaseTableMetadata.getAllFilesInPartition(BaseTableMetadata.java:130)
at
org.apache.hudi.metadata.HoodieMetadataFileSystemView.listPartition(HoodieMetadataFileSystemView.java:65)
at
org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$9(AbstractTableFileSystemView.java:280)
at
java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
at
org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:269)
at
org.apache.hudi.common.table.view.AbstractTableFileSystemView.getLatestBaseFilesBeforeOrOn(AbstractTableFileSystemView.java:455)
at
org.apache.hudi.timeline.service.handlers.BaseFileHandler.getLatestDataFilesBeforeOrOn(BaseFileHandler.java:57)
at
org.apache.hudi.timeline.service.RequestHandler.lambda$registerDataFilesAPI$6(RequestHandler.java:239)
at
org.apache.hudi.timeline.service.RequestHandler$ViewHandler.handle(RequestHandler.java:430)
at
io.javalin.core.security.SecurityUtil.noopAccessManager(SecurityUtil.kt:22)
at
io.javalin.http.JavalinServlet$addHandler$protectedHandler$1.handle(JavalinServlet.kt:116)
at
io.javalin.http.JavalinServlet$service$2$1.invoke(JavalinServlet.kt:45)
at
io.javalin.http.JavalinServlet$service$2$1.invoke(JavalinServlet.kt:24)
at
io.javalin.http.JavalinServlet$service$1.invoke(JavalinServlet.kt:123)
at io.javalin.http.JavalinServlet$service$2.invoke(JavalinServlet.kt:40)
at io.javalin.http.JavalinServlet.service(JavalinServlet.kt:75)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:590)
at
org.apache.hudi.org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:852)
at
org.apache.hudi.org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:544)
at
org.apache.hudi.org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at
org.apache.hudi.org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1581)
at
org.apache.hudi.org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at
io.javalin.core.JavalinServer$start$httpHandler$1.doHandle(JavalinServer.kt:53)
at
org.apache.hudi.org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
at
org.apache.hudi.org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:482)
at
org.apache.hudi.org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1549)
at
org.apache.hudi.org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
at
org.apache.hudi.org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1204)
at
org.apache.hudi.org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.apache.hudi.org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
at
org.apache.hudi.org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:173)
at
org.apache.hudi.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at
org.apache.hudi.org.eclipse.jetty.server.Server.handle(Server.java:494)
at
org.apache.hudi.org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:374)
at
org.apache.hudi.org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:268)
at
org.apache.hudi.org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
at
org.apache.hudi.org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
at
org.apache.hudi.org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
at
org.apache.hudi.org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:782)
at
org.apache.hudi.org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:918)
at java.lang.Thread.run(Thread.java:750)
* Hudi version : 0.8
* Spark version : 3.1.2
* Storage: S3
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]