Consistency problems with Iceberg + EMRFS

Scott Kruger Tue, 08 Jun 2021 17:20:41 -0700

We’re using the Iceberg API (0.11.1) over raw parquet data in S3/EMRFS, 
basically just using the table API to issues overwrites/appends. Everything 
works great for the most part, but we’ve recently started to have problems with 
the iceberg metadata directory going out of sync. See the following stacktrace:


org.apache.iceberg.exceptions.RuntimeIOException: Failed to read file: 
s3://mybucket/db/table/metadata/v2504.metadata.json
at org.apache.iceberg.TableMetadataParser.read(TableMetadataParser.java:241)
at org.apache.iceberg.TableMetadataParser.read(TableMetadataParser.java:233)
at 
org.apache.iceberg.hadoop.HadoopTableOperations.updateVersionAndMetadata(HadoopTableOperations.java:93)
at 
org.apache.iceberg.hadoop.HadoopTableOperations.refresh(HadoopTableOperations.java:116)
at 
org.apache.iceberg.hadoop.HadoopTableOperations.current(HadoopTableOperations.java:80)
at org.apache.iceberg.hadoop.HadoopTables.load(HadoopTables.java:86)
at 
com.braintree.data.common.snapshot.iceberg.IcebergUtils$Builder.load(IcebergUtils.java:639)
at 
com.braintree.data.snapshot.actions.UpdateTableMetadata.run(UpdateTableMetadata.java:53)
at 
com.braintree.data.snapshot.actions.UpdateMetastore.lambda$run$0(UpdateMetastore.java:104)
at 
com.braintree.data.base.util.StreamUtilities.lambda$null$7(StreamUtilities.java:306)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Unexpected end of stream pos=0, 
contentLength=214601
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.read(S3FSInputStream.java:297)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.DataInputStream.read(DataInputStream.java:149)
at 
org.apache.iceberg.hadoop.HadoopStreams$HadoopSeekableInputStream.read(HadoopStreams.java:113)
at 
org.apache.iceberg.shaded.com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.ensureLoaded(ByteSourceJsonBootstrapper.java:524)
at 
org.apache.iceberg.shaded.com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.detectEncoding(ByteSourceJsonBootstrapper.java:129)
at 
org.apache.iceberg.shaded.com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.constructParser(ByteSourceJsonBootstrapper.java:247)
at 
org.apache.iceberg.shaded.com.fasterxml.jackson.core.JsonFactory._createParser(JsonFactory.java:1481)
at 
org.apache.iceberg.shaded.com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:972)
at 
org.apache.iceberg.shaded.com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3242)
at org.apache.iceberg.TableMetadataParser.read(TableMetadataParser.java:239)
... 15 more
Caused by: 
com.amazon.ws.emr.hadoop.fs.consistency.exception.ConsistencyException: eTag in 
metadata for File mybucket/db/table/metadata/v2504.metadata.json' does not 
match eTag from S3!
at 
com.amazon.ws.emr.hadoop.fs.s3.GetObjectInputStreamWithInfoFactory.create(GetObjectInputStreamWithInfoFactory.java:69)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.open(S3FSInputStream.java:200)
at 
com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.retrieveInputStreamWithInfo(S3FSInputStream.java:391)
at 
com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.reopenStream(S3FSInputStream.java:378)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.read(S3FSInputStream.java:260)
... 27 more

Earlier in my logs, I see the following similar warning:

21/06/08 23:20:32 pool-117-thread-1 WARN HadoopTableOperations: Error reading 
version hint file s3://mybucket/db/table/metadata/version-hint.text
java.io.IOException: Unexpected end of stream pos=0, contentLength=4
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.read(S3FSInputStream.java:297)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.DataInputStream.read(DataInputStream.java:149)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at 
org.apache.iceberg.hadoop.HadoopTableOperations.findVersion(HadoopTableOperations.java:318)
at 
org.apache.iceberg.hadoop.HadoopTableOperations.refresh(HadoopTableOperations.java:99)
at 
org.apache.iceberg.hadoop.HadoopTableOperations.current(HadoopTableOperations.java:80)
at org.apache.iceberg.hadoop.HadoopTables.load(HadoopTables.java:86)
… INTERNAL STUFF…
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: 
com.amazon.ws.emr.hadoop.fs.consistency.exception.ConsistencyException: eTag in 
metadata for File ‘mybucket/db/table/metadata/version-hint.text' does not match 
eTag from S3!
at 
com.amazon.ws.emr.hadoop.fs.s3.GetObjectInputStreamWithInfoFactory.create(GetObjectInputStreamWithInfoFactory.java:69)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.open(S3FSInputStream.java:200)
at 
com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.retrieveInputStreamWithInfo(S3FSInputStream.java:391)
at 
com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.reopenStream(S3FSInputStream.java:378)
at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.read(S3FSInputStream.java:260)
... 25 more

This only happens every once in a while, so my best guess is that there’s some 
weird eventual consistency problem or perhaps something with retry logic?

My question is: is there a correct way of using iceberg on EMRFS? FWIW, I 
haven’t included the AWS v2 SDK in my classpath.

Consistency problems with Iceberg + EMRFS

Reply via email to