[jira] [Updated] (THRIFT-4638) When I use Thrift to serialize map in C++ to disk, and then de-serialize it using Python, I do not get back the same object

James E. King III (JIRA) Sat, 26 Jan 2019 06:09:03 -0800


     [ 
https://issues.apache.org/jira/browse/THRIFT-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


James E. King III updated THRIFT-4638:
--------------------------------------
    Affects Version/s:     (was: 1.0)

This issue was marked as affecting version 1.0 which is not possible.   If you 
know which version this issue was discovered in, please mark it as such.

> When I use Thrift to serialize map in C++ to disk, and then de-serialize it 
> using Python, I do not get back the same object
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: THRIFT-4638
>                 URL: https://issues.apache.org/jira/browse/THRIFT-4638
>             Project: Thrift
>          Issue Type: Bug
>          Components: C++ - Library, Python - Library
>            Reporter: Bruno Rijsman
>            Priority: Major
>
> Summary: when I use Thrift to serialize map in C++ to disk, and then 
> de-serialize it using Python, I do not get back the same object.
>  
> See also 
> https://stackoverflow.com/questions/52377410/when-i-use-thrift-to-serialize-map-in-c-to-disk-and-then-de-serialize-it-usin
> I have the following Thrift model file, which (as you can see) contains a map 
> indexed by a struct:
> struct Coordinate {
>  1: required i32 x;
>  2: required i32 y;
>  }
> struct Terrain {
>  1: required map<Coordinate, i32> altitude_samples;
>  }
> I use the following C++ code to create an object with 3 coordinates in the 
> map (see the repo for complete code for all snippets below):
> Terrain terrain;
>  add_sample_to_terrain(terrain, 10, 10, 100);
>  add_sample_to_terrain(terrain, 20, 20, 200);
>  add_sample_to_terrain(terrain, 30, 30, 300);
> where:
> void add_sample_to_terrain(Terrain& terrain, int32_t x, int32_t y, int32_t 
> altitude)
>  {
>  Coordinate coordinate;
>  coordinate.x = x;
>  coordinate.y = y;
>  std::pair<Coordinate, int32_t> sample(coordinate, altitude);
>  terrain.altitude_samples.insert(sample);
>  }
> I use the following C++ code to serialize an object to disk:
> shared_ptr<TFileTransport> transport(new TFileTransport("terrain.dat"));
>  shared_ptr<TBinaryProtocol> protocol(new TBinaryProtocol(transport));
>  terrain.write(protocol.get());
> Important note: for this to work correctly, I had to implement the function 
> Coordinate::operator<. Thrift does generate the declaration for the 
> Coordinate::operator< but does not generate the implementation of 
> Coordinate::operator<. The reason for this is that Thrift does not understand 
> the semantics of the struct and hence cannot guess the correct implementation 
> of the comparison operator. This is discussed at 
> http://mail-archives.apache.org/mod_mbox/thrift-user/201007.mbox/%[email protected]%3E
> // Thrift generates the declaration but not the implementation of operator< 
> because it has no way
>  // of knowning what the criteria for the comparison are. So, provide the 
> implementation here.
>  bool Coordinate::operator<(const Coordinate& other) const
>  {
>  if (x < other.x) {
>  return true;
>  } else if (x > other.x) {
>  return false;
>  } else if (y < other.y) {
>  return true;
>  } else {
>  return false;
>  }
>  }
> Then, finally, I use the following Python code to de-serialize the same 
> object from disk:
> file = open("terrain.dat", "rb")
>  transport = thrift.transport.TTransport.TFileObjectTransport(file)
>  protocol = thrift.protocol.TBinaryProtocol.TBinaryProtocol(transport)
>  terrain = Terrain()
>  terrain.read(protocol)
>  print(terrain)
> This Python program outputs:
> Terrain(altitude_samples=None)
> In other words, the de-serialized Terrain contains no terrain_samples, 
> instead of the expected dictionary with 3 coordinates.
> I am 100% sure that the file terrain.dat contains valid data: I also 
> de-serialized the same data using C++ and in that case, I *do* get the 
> expected results (see repo for details)
> I suspect that this has something to do with the comparison operator.
> I gut feeling is that I should have done something similar in Python with 
> respect to the comparison operator as I did in C++. But I don't know what 
> that missing something would be.
> Additional information added on 19-Sep-2018:
> Here is a hexdump of the encoding produced by the C++ encoding program:
> Offset: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 
>  00000000: 01 00 00 00 0D 02 00 00 00 00 01 01 00 00 00 0C ................
>  00000010: 01 00 00 00 08 04 00 00 00 00 00 00 03 01 00 00 ................
>  00000020: 00 08 02 00 00 00 00 01 04 00 00 00 00 00 00 0A ................
>  00000030: 01 00 00 00 08 02 00 00 00 00 02 04 00 00 00 00 ................
>  00000040: 00 00 0A 01 00 00 00 00 04 00 00 00 00 00 00 64 ...............d
>  00000050: 01 00 00 00 08 02 00 00 00 00 01 04 00 00 00 00 ................
>  00000060: 00 00 14 01 00 00 00 08 02 00 00 00 00 02 04 00 ................
>  00000070: 00 00 00 00 00 14 01 00 00 00 00 04 00 00 00 00 ................
>  00000080: 00 00 C8 01 00 00 00 08 02 00 00 00 00 01 04 00 ..H.............
>  00000090: 00 00 00 00 00 1E 01 00 00 00 08 02 00 00 00 00 ................
>  000000a0: 02 04 00 00 00 00 00 00 1E 01 00 00 00 00 04 00 ................
>  000000b0: 00 00 00 00 01 2C 01 00 00 00 00 .....,.....
> The first 4 bytes are 01 00 00 00
> Using a debugger to step through the Python decoding function reveals that:
> * This is being decoded as a struct (which is expected)
> * The first byte 01 is interpreted as the field type. 01 means field type 
> VOID.
> * The next two bytes are interpreted as the field id. 00 00 means field ID 0.
> * For field type VOID, nothing else is read and we continue to the next field.
> * The next byte is interpreted as the field type. 00 means STOP.
> * We top reading data for the struct.
> * The final result is an empty struct.
> All off the above is consistent with the information at 
> https://github.com/apache/thrift/blob/master/doc/specs/thrift-binary-protocol.md
>  which describes the Thrift binary encoding format
> My conclusion thus far is that the C++ encoder appears to produce an 
> "incorrect" binary encoding (I put incorrect in quotes because certainly 
> something as blatant as that would have been discovered by lots of other 
> people, so I am sure that I am still missing something).
> Additional information added on 19-Sep-2018:
> It appears that the C++ implementation of TFileTransport has the concept of 
> "events" when writing to disk.
> The output which is written to disk is divided into a sequence of "events" 
> where each "event" is preceded by a 4-byte length field of the event, 
> followed by the contents of the event.
> Looking at the hexdump above, the first couple of events are:
> `0100 0000 0d` : Event length 1, event value `0d`
> `02 0000 0000 01` : Event length 2, event value `00 01`
> Etc.
> The Python implementation of TFileTransport does not understand this concept 
> of events when parsing the file.
> It appears that the problem is one of the following two:
> 1) Either the C++ code should not be inserting these event lengths into the 
> encoded file,
> 2) Or the Python code should understand these event lengths when decoding the 
> file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (THRIFT-4638) When I use Thrift to serialize map in C++ to disk, and then de-serialize it using Python, I do not get back the same object

Reply via email to