[
https://issues.apache.org/jira/browse/THRIFT-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
James E. King III updated THRIFT-4638:
--------------------------------------
Affects Version/s: (was: 1.0)
This issue was marked as affecting version 1.0 which is not possible. If you
know which version this issue was discovered in, please mark it as such.
> When I use Thrift to serialize map in C++ to disk, and then de-serialize it
> using Python, I do not get back the same object
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: THRIFT-4638
> URL: https://issues.apache.org/jira/browse/THRIFT-4638
> Project: Thrift
> Issue Type: Bug
> Components: C++ - Library, Python - Library
> Reporter: Bruno Rijsman
> Priority: Major
>
> Summary: when I use Thrift to serialize map in C++ to disk, and then
> de-serialize it using Python, I do not get back the same object.
>
> See also
> https://stackoverflow.com/questions/52377410/when-i-use-thrift-to-serialize-map-in-c-to-disk-and-then-de-serialize-it-usin
> I have the following Thrift model file, which (as you can see) contains a map
> indexed by a struct:
> struct Coordinate {
> 1: required i32 x;
> 2: required i32 y;
> }
> struct Terrain {
> 1: required map<Coordinate, i32> altitude_samples;
> }
> I use the following C++ code to create an object with 3 coordinates in the
> map (see the repo for complete code for all snippets below):
> Terrain terrain;
> add_sample_to_terrain(terrain, 10, 10, 100);
> add_sample_to_terrain(terrain, 20, 20, 200);
> add_sample_to_terrain(terrain, 30, 30, 300);
> where:
> void add_sample_to_terrain(Terrain& terrain, int32_t x, int32_t y, int32_t
> altitude)
> {
> Coordinate coordinate;
> coordinate.x = x;
> coordinate.y = y;
> std::pair<Coordinate, int32_t> sample(coordinate, altitude);
> terrain.altitude_samples.insert(sample);
> }
> I use the following C++ code to serialize an object to disk:
> shared_ptr<TFileTransport> transport(new TFileTransport("terrain.dat"));
> shared_ptr<TBinaryProtocol> protocol(new TBinaryProtocol(transport));
> terrain.write(protocol.get());
> Important note: for this to work correctly, I had to implement the function
> Coordinate::operator<. Thrift does generate the declaration for the
> Coordinate::operator< but does not generate the implementation of
> Coordinate::operator<. The reason for this is that Thrift does not understand
> the semantics of the struct and hence cannot guess the correct implementation
> of the comparison operator. This is discussed at
> http://mail-archives.apache.org/mod_mbox/thrift-user/201007.mbox/%[email protected]%3E
> // Thrift generates the declaration but not the implementation of operator<
> because it has no way
> // of knowning what the criteria for the comparison are. So, provide the
> implementation here.
> bool Coordinate::operator<(const Coordinate& other) const
> {
> if (x < other.x) {
> return true;
> } else if (x > other.x) {
> return false;
> } else if (y < other.y) {
> return true;
> } else {
> return false;
> }
> }
> Then, finally, I use the following Python code to de-serialize the same
> object from disk:
> file = open("terrain.dat", "rb")
> transport = thrift.transport.TTransport.TFileObjectTransport(file)
> protocol = thrift.protocol.TBinaryProtocol.TBinaryProtocol(transport)
> terrain = Terrain()
> terrain.read(protocol)
> print(terrain)
> This Python program outputs:
> Terrain(altitude_samples=None)
> In other words, the de-serialized Terrain contains no terrain_samples,
> instead of the expected dictionary with 3 coordinates.
> I am 100% sure that the file terrain.dat contains valid data: I also
> de-serialized the same data using C++ and in that case, I *do* get the
> expected results (see repo for details)
> I suspect that this has something to do with the comparison operator.
> I gut feeling is that I should have done something similar in Python with
> respect to the comparison operator as I did in C++. But I don't know what
> that missing something would be.
> Additional information added on 19-Sep-2018:
> Here is a hexdump of the encoding produced by the C++ encoding program:
> Offset: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
> 00000000: 01 00 00 00 0D 02 00 00 00 00 01 01 00 00 00 0C ................
> 00000010: 01 00 00 00 08 04 00 00 00 00 00 00 03 01 00 00 ................
> 00000020: 00 08 02 00 00 00 00 01 04 00 00 00 00 00 00 0A ................
> 00000030: 01 00 00 00 08 02 00 00 00 00 02 04 00 00 00 00 ................
> 00000040: 00 00 0A 01 00 00 00 00 04 00 00 00 00 00 00 64 ...............d
> 00000050: 01 00 00 00 08 02 00 00 00 00 01 04 00 00 00 00 ................
> 00000060: 00 00 14 01 00 00 00 08 02 00 00 00 00 02 04 00 ................
> 00000070: 00 00 00 00 00 14 01 00 00 00 00 04 00 00 00 00 ................
> 00000080: 00 00 C8 01 00 00 00 08 02 00 00 00 00 01 04 00 ..H.............
> 00000090: 00 00 00 00 00 1E 01 00 00 00 08 02 00 00 00 00 ................
> 000000a0: 02 04 00 00 00 00 00 00 1E 01 00 00 00 00 04 00 ................
> 000000b0: 00 00 00 00 01 2C 01 00 00 00 00 .....,.....
> The first 4 bytes are 01 00 00 00
> Using a debugger to step through the Python decoding function reveals that:
> * This is being decoded as a struct (which is expected)
> * The first byte 01 is interpreted as the field type. 01 means field type
> VOID.
> * The next two bytes are interpreted as the field id. 00 00 means field ID 0.
> * For field type VOID, nothing else is read and we continue to the next field.
> * The next byte is interpreted as the field type. 00 means STOP.
> * We top reading data for the struct.
> * The final result is an empty struct.
> All off the above is consistent with the information at
> https://github.com/apache/thrift/blob/master/doc/specs/thrift-binary-protocol.md
> which describes the Thrift binary encoding format
> My conclusion thus far is that the C++ encoder appears to produce an
> "incorrect" binary encoding (I put incorrect in quotes because certainly
> something as blatant as that would have been discovered by lots of other
> people, so I am sure that I am still missing something).
> Additional information added on 19-Sep-2018:
> It appears that the C++ implementation of TFileTransport has the concept of
> "events" when writing to disk.
> The output which is written to disk is divided into a sequence of "events"
> where each "event" is preceded by a 4-byte length field of the event,
> followed by the contents of the event.
> Looking at the hexdump above, the first couple of events are:
> `0100 0000 0d` : Event length 1, event value `0d`
> `02 0000 0000 01` : Event length 2, event value `00 01`
> Etc.
> The Python implementation of TFileTransport does not understand this concept
> of events when parsing the file.
> It appears that the problem is one of the following two:
> 1) Either the C++ code should not be inserting these event lengths into the
> encoded file,
> 2) Or the Python code should understand these event lengths when decoding the
> file.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)