stackfun opened a new issue #2367:
URL: https://github.com/apache/hudi/issues/2367
**Describe the problem you faced**
When reading avro log files, the query for MOR tables fails in GCP.
**To Reproduce**
Steps to reproduce the behavior:
1. Create a fairly large MOR Table on GCS (Seems to force the use of the
buffered input stream)
2. Snapshot Query on MOR Table using spark datasource.
**Expected behavior**
Snapshot query returns successfully
**Environment Description**
* Hudi version : 0.6.0
* Dataproc 1.4
* Spark version : 2.4.5
* Hive version : 2.3.7
* Hadoop version : 2.9.2
* Storage (HDFS/S3/GCS..) : GCS
* Running on Docker? (yes/no) : no
**Stacktrace**
```
Caused by: java.io.EOFException: Invalid seek offset: position value
(18631040) must be between 0 and 18631040 for
'gs://useast1-gcs-dev-sed-certrep-01/sdl/star_certrep.db/test-yo/metadata_precert/cr_cert_type=leaf/cr_expiration_month=2022-01/.22f13c15-861f-4c2b-b976-5473307af9e5-0_20201221203356.log.1_121-61-30572'
at
com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.validatePosition(GoogleCloudStorageReadChannel.java:710)
at
com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.position(GoogleCloudStorageReadChannel.java:597)
at
com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.seek(GoogleHadoopFSInputStream.java:198)
at
org.apache.hadoop.fs.BufferedFSInputStream.seek(BufferedFSInputStream.java:96)
at
org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65)
at
org.apache.hudi.common.table.log.block.HoodieLogBlock.safeSeek(HoodieLogBlock.java:267)
at
org.apache.hudi.common.table.log.block.HoodieLogBlock.readOrSkipContent(HoodieLogBlock.java:225)
at
org.apache.hudi.common.table.log.HoodieLogFileReader.createCorruptBlock(HoodieLogFileReader.java:226)
at
org.apache.hudi.common.table.log.HoodieLogFileReader.readBlock(HoodieLogFileReader.java:146)
at
org.apache.hudi.common.table.log.HoodieLogFileReader.next(HoodieLogFileReader.java:346)
... 25 more
```
Here's the code in FSUtils.isGCSInputStream called by HoodieLogBlock.safeSeek
```java
public static boolean isGCSInputStream(FSDataInputStream inputStream) {
return
inputStream.getClass().getCanonicalName().equals("com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream")
|| inputStream.getWrappedStream().getClass().getCanonicalName()
.equals("com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream");
}
```
In my callstack, the input stream's canonical name is
`org.apache.hadoop.fs.FSDataInputStream` and the wrapped stream is
`org.apache.hadoop.fs.BufferedFSInputStream` so this function fails to detect
that the input stream is the `GoogleHadoopFSInputStream`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]