[ 
https://issues.apache.org/jira/browse/HUDI-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1532:
---------------------------------
    Labels: pull-request-available  (was: )

> Super slow magic sequence search within the log files on GCS 
> -------------------------------------------------------------
>
>                 Key: HUDI-1532
>                 URL: https://issues.apache.org/jira/browse/HUDI-1532
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: DeltaStreamer
>            Reporter: Volodymyr Burenin
>            Priority: Major
>              Labels: pull-request-available
>
> HudiDeltaStreamer freezes for a very long time(days) when scanning through 
> the log file looking for the magic sequence. Java stacktrace points to this 
> location:
> ```
> "Executor task launch worker for task 183" #233 daemon prio=5 os_prio=0 
> tid=0x00005629fb650000 nid=0xff runnable [0x00007f378a433000]
>    java.lang.Thread.State: RUNNABLE
>       at java.net.SocketInputStream.socketRead0(Native Method)
>       at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>       at java.net.SocketInputStream.read(SocketInputStream.java:171)
>       at java.net.SocketInputStream.read(SocketInputStream.java:141)
>       at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
>       at sun.security.ssl.InputRecord.read(InputRecord.java:503)
>       at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
>       - locked <0x00000007ba998608> (a java.lang.Object)
>       at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)
>       at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
>       - locked <0x00000007ba9995c0> (a sun.security.ssl.AppInputStream)
>       at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>       at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>       at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>       - locked <0x0000000779e6f360> (a java.io.BufferedInputStream)
>       at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
>       at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
>       at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587)
>       - locked <0x0000000779e69ba0> (a 
> sun.net.www.protocol.https.DelegateHttpsURLConnection)
>       at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
>       - locked <0x0000000779e69ba0> (a 
> sun.net.www.protocol.https.DelegateHttpsURLConnection)
>       at 
> java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
>       at 
> sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:347)
>       at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:36)
>       at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:144)
>       at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:79)
>       at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:995)
>       at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:549)
>       at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:482)
>       at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeMedia(AbstractGoogleClientRequest.java:510)
>       at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.api.services.storage.Storage$Objects$Get.executeMedia(Storage.java:6981)
>       at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.openStream(GoogleCloudStorageReadChannel.java:967)
>       at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.openContentChannel(GoogleCloudStorageReadChannel.java:772)
>       at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.performLazySeek(GoogleCloudStorageReadChannel.java:763)
>       at 
> com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.read(GoogleCloudStorageReadChannel.java:365)
>       at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.read(GoogleHadoopFSInputStream.java:131)
>       - locked <0x0000000616319fb8> (a 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream)
>       at java.io.DataInputStream.read(DataInputStream.java:149)
>       at java.io.DataInputStream.readFully(DataInputStream.java:195)
>       at 
> org.apache.hudi.common.table.log.HoodieLogFileReader.hasNextMagic(HoodieLogFileReader.java:339)
>       at 
> org.apache.hudi.common.table.log.HoodieLogFileReader.scanForNextAvailableBlockOffset(HoodieLogFileReader.java:280)
>       at 
> org.apache.hudi.common.table.log.HoodieLogFileReader.createCorruptBlock(HoodieLogFileReader.java:221)
>       at 
> org.apache.hudi.common.table.log.HoodieLogFileReader.readBlock(HoodieLogFileReader.java:147)
>       at 
> org.apache.hudi.common.table.log.HoodieLogFileReader.next(HoodieLogFileReader.java:347)
> ```
> After deeper research I discovered it happens due to the unbuffered access to 
> the GCS as well as due to the inefficient algorithm that searches for the 
> magic sequence technically making request to GCS to read next 6 bytes, then 
> offsets +1 and tries again. With 50-60ms latency and 5-6MB file sizes it may 
> take forever.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to