[
https://issues.apache.org/jira/browse/HUDI-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HUDI-1532:
---------------------------------
Labels: pull-request-available (was: )
> Super slow magic sequence search within the log files on GCS
> -------------------------------------------------------------
>
> Key: HUDI-1532
> URL: https://issues.apache.org/jira/browse/HUDI-1532
> Project: Apache Hudi
> Issue Type: Bug
> Components: DeltaStreamer
> Reporter: Volodymyr Burenin
> Priority: Major
> Labels: pull-request-available
>
> HudiDeltaStreamer freezes for a very long time(days) when scanning through
> the log file looking for the magic sequence. Java stacktrace points to this
> location:
> ```
> "Executor task launch worker for task 183" #233 daemon prio=5 os_prio=0
> tid=0x00005629fb650000 nid=0xff runnable [0x00007f378a433000]
> java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:171)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.read(InputRecord.java:503)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
> - locked <0x00000007ba998608> (a java.lang.Object)
> at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)
> at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
> - locked <0x00000007ba9995c0> (a sun.security.ssl.AppInputStream)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
> - locked <0x0000000779e6f360> (a java.io.BufferedInputStream)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587)
> - locked <0x0000000779e69ba0> (a
> sun.net.www.protocol.https.DelegateHttpsURLConnection)
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
> - locked <0x0000000779e69ba0> (a
> sun.net.www.protocol.https.DelegateHttpsURLConnection)
> at
> java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
> at
> sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:347)
> at
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:36)
> at
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:144)
> at
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:79)
> at
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:995)
> at
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:549)
> at
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:482)
> at
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeMedia(AbstractGoogleClientRequest.java:510)
> at
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.services.storage.Storage$Objects$Get.executeMedia(Storage.java:6981)
> at
> com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.openStream(GoogleCloudStorageReadChannel.java:967)
> at
> com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.openContentChannel(GoogleCloudStorageReadChannel.java:772)
> at
> com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.performLazySeek(GoogleCloudStorageReadChannel.java:763)
> at
> com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.read(GoogleCloudStorageReadChannel.java:365)
> at
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.read(GoogleHadoopFSInputStream.java:131)
> - locked <0x0000000616319fb8> (a
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream)
> at java.io.DataInputStream.read(DataInputStream.java:149)
> at java.io.DataInputStream.readFully(DataInputStream.java:195)
> at
> org.apache.hudi.common.table.log.HoodieLogFileReader.hasNextMagic(HoodieLogFileReader.java:339)
> at
> org.apache.hudi.common.table.log.HoodieLogFileReader.scanForNextAvailableBlockOffset(HoodieLogFileReader.java:280)
> at
> org.apache.hudi.common.table.log.HoodieLogFileReader.createCorruptBlock(HoodieLogFileReader.java:221)
> at
> org.apache.hudi.common.table.log.HoodieLogFileReader.readBlock(HoodieLogFileReader.java:147)
> at
> org.apache.hudi.common.table.log.HoodieLogFileReader.next(HoodieLogFileReader.java:347)
> ```
> After deeper research I discovered it happens due to the unbuffered access to
> the GCS as well as due to the inefficient algorithm that searches for the
> magic sequence technically making request to GCS to read next 6 bytes, then
> offsets +1 and tries again. With 50-60ms latency and 5-6MB file sizes it may
> take forever.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)