[ 
https://issues.apache.org/jira/browse/THRIFT-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621936#comment-16621936
 ] 

James E. King III commented on THRIFT-4638:
-------------------------------------------

Excellent analysis and write-up!  I agree with your assumption that a file 
based serdes for thrift should be compatible across languages. I think it would 
be a very good idea to add file transport to the list of cross-tested 
transports.  The crosstest would have picked this up if it were there.  In the 
meantime, it would be good to look at Java's file transport (assuming there is 
one) and see if it implements the event concept or not.  

> When I use Thrift to serialize map in C++ to disk, and then de-serialize it 
> using Python, I do not get back the same object
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: THRIFT-4638
>                 URL: https://issues.apache.org/jira/browse/THRIFT-4638
>             Project: Thrift
>          Issue Type: Bug
>          Components: C++ - Library, Python - Library
>    Affects Versions: 1.0
>            Reporter: Bruno Rijsman
>            Priority: Major
>
> Summary: when I use Thrift to serialize map in C++ to disk, and then 
> de-serialize it using Python, I do not get back the same object.
>  
> See also 
> https://stackoverflow.com/questions/52377410/when-i-use-thrift-to-serialize-map-in-c-to-disk-and-then-de-serialize-it-usin
> I have the following Thrift model file, which (as you can see) contains a map 
> indexed by a struct:
> struct Coordinate {
>  1: required i32 x;
>  2: required i32 y;
>  }
> struct Terrain {
>  1: required map<Coordinate, i32> altitude_samples;
>  }
> I use the following C++ code to create an object with 3 coordinates in the 
> map (see the repo for complete code for all snippets below):
> Terrain terrain;
>  add_sample_to_terrain(terrain, 10, 10, 100);
>  add_sample_to_terrain(terrain, 20, 20, 200);
>  add_sample_to_terrain(terrain, 30, 30, 300);
> where:
> void add_sample_to_terrain(Terrain& terrain, int32_t x, int32_t y, int32_t 
> altitude)
>  {
>  Coordinate coordinate;
>  coordinate.x = x;
>  coordinate.y = y;
>  std::pair<Coordinate, int32_t> sample(coordinate, altitude);
>  terrain.altitude_samples.insert(sample);
>  }
> I use the following C++ code to serialize an object to disk:
> shared_ptr<TFileTransport> transport(new TFileTransport("terrain.dat"));
>  shared_ptr<TBinaryProtocol> protocol(new TBinaryProtocol(transport));
>  terrain.write(protocol.get());
> Important note: for this to work correctly, I had to implement the function 
> Coordinate::operator<. Thrift does generate the declaration for the 
> Coordinate::operator< but does not generate the implementation of 
> Coordinate::operator<. The reason for this is that Thrift does not understand 
> the semantics of the struct and hence cannot guess the correct implementation 
> of the comparison operator. This is discussed at 
> http://mail-archives.apache.org/mod_mbox/thrift-user/201007.mbox/%[email protected]%3E
> // Thrift generates the declaration but not the implementation of operator< 
> because it has no way
>  // of knowning what the criteria for the comparison are. So, provide the 
> implementation here.
>  bool Coordinate::operator<(const Coordinate& other) const
>  {
>  if (x < other.x) {
>  return true;
>  } else if (x > other.x) {
>  return false;
>  } else if (y < other.y) {
>  return true;
>  } else {
>  return false;
>  }
>  }
> Then, finally, I use the following Python code to de-serialize the same 
> object from disk:
> file = open("terrain.dat", "rb")
>  transport = thrift.transport.TTransport.TFileObjectTransport(file)
>  protocol = thrift.protocol.TBinaryProtocol.TBinaryProtocol(transport)
>  terrain = Terrain()
>  terrain.read(protocol)
>  print(terrain)
> This Python program outputs:
> Terrain(altitude_samples=None)
> In other words, the de-serialized Terrain contains no terrain_samples, 
> instead of the expected dictionary with 3 coordinates.
> I am 100% sure that the file terrain.dat contains valid data: I also 
> de-serialized the same data using C++ and in that case, I *do* get the 
> expected results (see repo for details)
> I suspect that this has something to do with the comparison operator.
> I gut feeling is that I should have done something similar in Python with 
> respect to the comparison operator as I did in C++. But I don't know what 
> that missing something would be.
> Additional information added on 19-Sep-2018:
> Here is a hexdump of the encoding produced by the C++ encoding program:
> Offset: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 
>  00000000: 01 00 00 00 0D 02 00 00 00 00 01 01 00 00 00 0C ................
>  00000010: 01 00 00 00 08 04 00 00 00 00 00 00 03 01 00 00 ................
>  00000020: 00 08 02 00 00 00 00 01 04 00 00 00 00 00 00 0A ................
>  00000030: 01 00 00 00 08 02 00 00 00 00 02 04 00 00 00 00 ................
>  00000040: 00 00 0A 01 00 00 00 00 04 00 00 00 00 00 00 64 ...............d
>  00000050: 01 00 00 00 08 02 00 00 00 00 01 04 00 00 00 00 ................
>  00000060: 00 00 14 01 00 00 00 08 02 00 00 00 00 02 04 00 ................
>  00000070: 00 00 00 00 00 14 01 00 00 00 00 04 00 00 00 00 ................
>  00000080: 00 00 C8 01 00 00 00 08 02 00 00 00 00 01 04 00 ..H.............
>  00000090: 00 00 00 00 00 1E 01 00 00 00 08 02 00 00 00 00 ................
>  000000a0: 02 04 00 00 00 00 00 00 1E 01 00 00 00 00 04 00 ................
>  000000b0: 00 00 00 00 01 2C 01 00 00 00 00 .....,.....
> The first 4 bytes are 01 00 00 00
> Using a debugger to step through the Python decoding function reveals that:
> * This is being decoded as a struct (which is expected)
> * The first byte 01 is interpreted as the field type. 01 means field type 
> VOID.
> * The next two bytes are interpreted as the field id. 00 00 means field ID 0.
> * For field type VOID, nothing else is read and we continue to the next field.
> * The next byte is interpreted as the field type. 00 means STOP.
> * We top reading data for the struct.
> * The final result is an empty struct.
> All off the above is consistent with the information at 
> https://github.com/apache/thrift/blob/master/doc/specs/thrift-binary-protocol.md
>  which describes the Thrift binary encoding format
> My conclusion thus far is that the C++ encoder appears to produce an 
> "incorrect" binary encoding (I put incorrect in quotes because certainly 
> something as blatant as that would have been discovered by lots of other 
> people, so I am sure that I am still missing something).
> Additional information added on 19-Sep-2018:
> It appears that the C++ implementation of TFileTransport has the concept of 
> "events" when writing to disk.
> The output which is written to disk is divided into a sequence of "events" 
> where each "event" is preceded by a 4-byte length field of the event, 
> followed by the contents of the event.
> Looking at the hexdump above, the first couple of events are:
> `0100 0000 0d` : Event length 1, event value `0d`
> `02 0000 0000 01` : Event length 2, event value `00 01`
> Etc.
> The Python implementation of TFileTransport does not understand this concept 
> of events when parsing the file.
> It appears that the problem is one of the following two:
> 1) Either the C++ code should not be inserting these event lengths into the 
> encoded file,
> 2) Or the Python code should understand these event lengths when decoding the 
> file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to