leixm opened a new issue, #239:
URL: https://github.com/apache/incubator-uniffle/issues/239

   Recently, we found that the following exception occurred occasionally in the 
production environment - `Blocks read inconsistent: expectedxxxxxx`. After 
investigation, we found that there seem to be some problems with the client's 
read hdfs.
   The following is some information about cluster related configuration and 
troubleshooting:
   Environment
   1. rss.client.read.buffer.size=14m
   2. storageType=MEMORY_LOCALFILE_HDFS
   3. io.file.buffer.size=65536(core-site.xml on ShuffleServer)
   
   
   We investigate abnormal partition files to see what ShuffleServer/Client 
does at various points in time, 
   The result is as follows.
   
   1. 2022-09-20 00:45:07 partition data file creation on hdfs
   2. 2022-09-20 00:45:12 flush the first batch of 1690 blocks and close 
outputstream, At this point the size of the data file is 63101342
   3. 2022-09-20 01:26:56 ShuffleServer append partition data file
   4. 2022-09-20 01:27:12 getInMemory returns a total of 310 blocks
   5. 2022-09-20 01:27:13 The following exceptions appear in sequence: Read 
index data under flow, read data file EOFException(**with offset[58798659], 
length[4743511]**), and Blocks read inconsistent, and the task retries four 
times within 10 seconds or fails, which eventually causes the app to fail
   6. 2022-09-20 01:27:37 flush the remaining 310 blocks and close outputstream
   
   Read segments
   
   1. offset=0, length=14709036
   2. offset=14709036, length=14698098
   3. offset=29407134, length=14700350
   4. offset=44107484, length=14691175
   5. offset=58798659, length=4743511
   
   The key point of the problem is that the data file has only 63101342 bytes, 
but at this time, it has to read 4743511 bytes from the 58798659 offset, which 
eventually leads to OEFException,And at this time, the block from offset 
58798659 to offset 63101342 will be discarded, Eventually lead to block missing.
   
   Another question is, why are there more blocks displayed in the index file 
than in the data file? This depends on the buffer of the hdfs client. If the 
data is currently being flushed, we cannot guarantee that the number of blocks 
in the index is the same as the number of blocks in the data.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@uniffle.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to