[
https://issues.apache.org/jira/browse/THRIFT-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
James E. King III updated THRIFT-4591:
--------------------------------------
Description:
(jking): C++ TFramedTransport reads the frame size then attempts to read the
message. If it only gets part of the message it returns the partial read, and
the upper layer will not be able to decode the message, further read may be
called again, when it will go and try to read a frame size again, but it could
be in the middle of message payload the underlying transport hadn't yet
received. It's amazing to see this in code that's been around so long!
Original Bug report:
1) realize thrift server with TNonblockingServer via c++;
2) realize thrift client via lua lib and choose frame transport.
3) call remote interface failed with "TTransportException:0: Default (unknown)"
print, and the server show "TConnection::workSocket(): THRIFT_EAGAIN
(unavailable resources)" error.
4)investigate this fault with tcpdump tool, attachment 9090.pcap show the frame
msg doesnot contains frame size field, the rifht situation of attachment
9090_1.pcap show the frame msg contains 4 bytes (00 00 00 25) before protocol
id field.
5) dig into the fault and tried to find root cause, then i found there is an
fault in TFramedTransport:flush function in TFramedTransport.lua file. the
original realization is:
-----
function TFramedTransport:flush()
if self.doWrite == false then
return self.trans:flush()
end
-- If the write fails we still want wBuf to be clear
local tmp = self.wBuf
self.wBuf = ''
local frame_len_buf = libluabpack.bpack("i", string.len(tmp))
self.trans:write(frame_len_buf)
self.trans:write(tmp)
self.trans:flush()
end
-----
which send frame size file and reset msg content independently.
----------------------
(jking) Analysis of original report: it fixes the sender to send once, but it
shouldn't matter if the size is sent separately from the payload. It's the
receiver where the root cause is, in this case the C++ library. This issue may
not be limited to the C++ implementation, but we need a test to insert a pause
between sending a frame size and sending the payload and see what happens on
all the implementations.
We're not going to merge the lua client fix as it doubles the memory
requirements to send, despite reducing the write() count from 2 to 1.
was:
1) realize thrift server with TNonblockingServer via c++;
2) realize thrift client via lua lib and choose frame transport.
3) call remote interface failed with "TTransportException:0: Default (unknown)"
print, and the server show "TConnection::workSocket(): THRIFT_EAGAIN
(unavailable resources)" error.
4)investigate this fault with tcpdump tool, attachment 9090.pcap show the frame
msg doesnot contains frame size field, the rifht situation of attachment
9090_1.pcap show the frame msg contains 4 bytes (00 00 00 25) before protocol
id field.
5) dig into the fault and tried to find root cause, then i found there is an
fault in TFramedTransport:flush function in TFramedTransport.lua file. the
original realization is:
-----
function TFramedTransport:flush()
if self.doWrite == false then
return self.trans:flush()
end
-- If the write fails we still want wBuf to be clear
local tmp = self.wBuf
self.wBuf = ''
local frame_len_buf = libluabpack.bpack("i", string.len(tmp))
self.trans:write(frame_len_buf)
self.trans:write(tmp)
self.trans:flush()
end
-----
which send frame size file and reset msg content independently.
> C++ TFramedTransport fails miserably on partial message read
> ------------------------------------------------------------
>
> Key: THRIFT-4591
> URL: https://issues.apache.org/jira/browse/THRIFT-4591
> Project: Thrift
> Issue Type: Bug
> Components: C++ - Library
> Affects Versions: 0.11.0
> Reporter: allen_lee
> Assignee: James E. King III
> Priority: Blocker
> Fix For: 0.12.0
>
> Attachments: 9090.pcap, 9090_1.pcap
>
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> (jking): C++ TFramedTransport reads the frame size then attempts to read the
> message. If it only gets part of the message it returns the partial read,
> and the upper layer will not be able to decode the message, further read may
> be called again, when it will go and try to read a frame size again, but it
> could be in the middle of message payload the underlying transport hadn't yet
> received. It's amazing to see this in code that's been around so long!
> Original Bug report:
> 1) realize thrift server with TNonblockingServer via c++;
> 2) realize thrift client via lua lib and choose frame transport.
> 3) call remote interface failed with "TTransportException:0: Default
> (unknown)" print, and the server show "TConnection::workSocket():
> THRIFT_EAGAIN (unavailable resources)" error.
> 4)investigate this fault with tcpdump tool, attachment 9090.pcap show the
> frame msg doesnot contains frame size field, the rifht situation of
> attachment 9090_1.pcap show the frame msg contains 4 bytes (00 00 00 25)
> before protocol id field.
> 5) dig into the fault and tried to find root cause, then i found there is an
> fault in TFramedTransport:flush function in TFramedTransport.lua file. the
> original realization is:
> -----
> function TFramedTransport:flush()
> if self.doWrite == false then
> return self.trans:flush()
> end
> -- If the write fails we still want wBuf to be clear
> local tmp = self.wBuf
> self.wBuf = ''
> local frame_len_buf = libluabpack.bpack("i", string.len(tmp))
> self.trans:write(frame_len_buf)
> self.trans:write(tmp)
> self.trans:flush()
> end
> -----
> which send frame size file and reset msg content independently.
> ----------------------
> (jking) Analysis of original report: it fixes the sender to send once, but it
> shouldn't matter if the size is sent separately from the payload. It's the
> receiver where the root cause is, in this case the C++ library. This issue
> may not be limited to the C++ implementation, but we need a test to insert a
> pause between sending a frame size and sending the payload and see what
> happens on all the implementations.
> We're not going to merge the lua client fix as it doubles the memory
> requirements to send, despite reducing the write() count from 2 to 1.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)