[ 
https://issues.apache.org/jira/browse/THRIFT-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arthur Lazzaretti updated THRIFT-5928:
--------------------------------------
    Description: 
*TInputProtocol::skip_till_depth()* fails when skipping a field that contains 
non-UTF-8 binary data.

In {*}src/protocol/mod.rs{*}, the *TType::String* branch calls 
{*}read_string(){*}:
{noformat}
TType::String => self.read_string().map(|_| ()){noformat}
*read_string()* internally does *read_bytes()* followed by 
{*}String::from_utf8(){*}, which rejects any byte sequence that is not valid 
UTF-8. Since *string* and *binary* share the same wire type ({*}TType::String = 
11{*}), *skip()* cannot distinguish between them. When it encounters a binary 
field containing raw bytes. it produces:
{noformat}
ProtocolError { kind: InvalidData, message: "invalid utf-8 sequence of 1 bytes 
from index 1" }{noformat}
 

 *When the bug triggers:* Any time a newer schema adds a *binary* field to a 
Thrift struct - if an older client tries to deserialize the new field it 
doesn't recognize, it hits this from the *skip()* path in 
*read_from_in_protocol* 

 

*Fix:* Inside {*}skip_till_depth(){*}, replace: 
{noformat}
TType::String => self.read_string().map(|_| ()){noformat}
with
{noformat}
TType::String => self.read_bytes().map(|_| ()){noformat}
 

Safe because:
 - The result is discarded immediately ({*}.map(|_| ()){*})

 - *read_string()* is *read_bytes()* + *String::from_utf8()* — both consume 
identical wire bytes

 - No behavioral change for valid UTF-8 string data

 - No effect on parsing of known fields, which use their own typed readers

  was:
*TInputProtocol::skip_till_depth()* fails when skipping a field that contains 
non-UTF-8 binary data.

 

In *src/protocol/mod.rs*, the *TType::String* branch calls *read_string()*:

 
{noformat}
TType::String => self.read_string().map(|_| ()),{noformat}
 

 

 

*read_string()* internally does *read_bytes()* followed by 
*String::from_utf8()*, which rejects any byte sequence that is not valid UTF-8. 
Since *string* and *binary* share the same wire type (*TType::String = 11*), 
*skip()* cannot distinguish between them. When it encounters a binary field 
containing raw bytes. it produces:

 
{noformat}
ProtocolError { kind: InvalidData, message: "invalid utf-8 sequence of 1 bytes 
from index 1" }{noformat}
 

 

 ***When this triggers:*** Any time a newer schema adds a *binary* field to a 
Thrift struct - if an older client tries to deserialize the new field it 
doesn't recognize, it hits this from the *skip()* path in 
*read_from_in_protocol* 

 

***Fix:*** Inside {*}skip_till_depth(){*}, replace: 

 
{noformat}
TType::String => self.read_string().map(|_| ()){noformat}
 

with

 
{noformat}
TType::String => self.read_bytes().map(|_| ()){noformat}
 

 

This is safe because:

- The result is discarded immediately (*.map(|_| ())*)

- *read_string()* is *read_bytes()* + *String::from_utf8()* — both consume 
identical wire bytes

- No behavioral change for valid UTF-8 string data

- No effect on parsing of known fields, which use their own typed readers


> skip() call on unknown binary field fails deserialization instead of graceful 
> skipping over field
> -------------------------------------------------------------------------------------------------
>
>                 Key: THRIFT-5928
>                 URL: https://issues.apache.org/jira/browse/THRIFT-5928
>             Project: Thrift
>          Issue Type: Bug
>          Components: Rust - Library
>    Affects Versions: 0.17.0, 0.21.0
>            Reporter: Arthur Lazzaretti
>            Priority: Major
>
> *TInputProtocol::skip_till_depth()* fails when skipping a field that contains 
> non-UTF-8 binary data.
> In {*}src/protocol/mod.rs{*}, the *TType::String* branch calls 
> {*}read_string(){*}:
> {noformat}
> TType::String => self.read_string().map(|_| ()){noformat}
> *read_string()* internally does *read_bytes()* followed by 
> {*}String::from_utf8(){*}, which rejects any byte sequence that is not valid 
> UTF-8. Since *string* and *binary* share the same wire type ({*}TType::String 
> = 11{*}), *skip()* cannot distinguish between them. When it encounters a 
> binary field containing raw bytes. it produces:
> {noformat}
> ProtocolError { kind: InvalidData, message: "invalid utf-8 sequence of 1 
> bytes from index 1" }{noformat}
>  
>  *When the bug triggers:* Any time a newer schema adds a *binary* field to a 
> Thrift struct - if an older client tries to deserialize the new field it 
> doesn't recognize, it hits this from the *skip()* path in 
> *read_from_in_protocol* 
>  
> *Fix:* Inside {*}skip_till_depth(){*}, replace: 
> {noformat}
> TType::String => self.read_string().map(|_| ()){noformat}
> with
> {noformat}
> TType::String => self.read_bytes().map(|_| ()){noformat}
>  
> Safe because:
>  - The result is discarded immediately ({*}.map(|_| ()){*})
>  - *read_string()* is *read_bytes()* + *String::from_utf8()* — both consume 
> identical wire bytes
>  - No behavioral change for valid UTF-8 string data
>  - No effect on parsing of known fields, which use their own typed readers



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to