Bruno Rijsman created THRIFT-4638:
-------------------------------------
Summary: When I use Thrift to serialize map in C++ to disk, and
then de-serialize it using Python, I do not get back the same object
Key: THRIFT-4638
URL: https://issues.apache.org/jira/browse/THRIFT-4638
Project: Thrift
Issue Type: Bug
Components: C++ - Library, Python - Library
Affects Versions: 1.0
Reporter: Bruno Rijsman
Summary: when I use Thrift to serialize map in C++ to disk, and then
de-serialize it using Python, I do not get back the same object.
See also
https://stackoverflow.com/questions/52377410/when-i-use-thrift-to-serialize-map-in-c-to-disk-and-then-de-serialize-it-usin
I have the following Thrift model file, which (as you can see) contains a map
indexed by a struct:
struct Coordinate {
1: required i32 x;
2: required i32 y;
}
struct Terrain {
1: required map<Coordinate, i32> altitude_samples;
}
I use the following C++ code to create an object with 3 coordinates in the map
(see the repo for complete code for all snippets below):
Terrain terrain;
add_sample_to_terrain(terrain, 10, 10, 100);
add_sample_to_terrain(terrain, 20, 20, 200);
add_sample_to_terrain(terrain, 30, 30, 300);
where:
void add_sample_to_terrain(Terrain& terrain, int32_t x, int32_t y, int32_t
altitude)
{
Coordinate coordinate;
coordinate.x = x;
coordinate.y = y;
std::pair<Coordinate, int32_t> sample(coordinate, altitude);
terrain.altitude_samples.insert(sample);
}
I use the following C++ code to serialize an object to disk:
shared_ptr<TFileTransport> transport(new TFileTransport("terrain.dat"));
shared_ptr<TBinaryProtocol> protocol(new TBinaryProtocol(transport));
terrain.write(protocol.get());
Important note: for this to work correctly, I had to implement the function
Coordinate::operator<. Thrift does generate the declaration for the
Coordinate::operator< but does not generate the implementation of
Coordinate::operator<. The reason for this is that Thrift does not understand
the semantics of the struct and hence cannot guess the correct implementation
of the comparison operator. This is discussed at
http://mail-archives.apache.org/mod_mbox/thrift-user/201007.mbox/%[email protected]%3E
// Thrift generates the declaration but not the implementation of operator<
because it has no way
// of knowning what the criteria for the comparison are. So, provide the
implementation here.
bool Coordinate::operator<(const Coordinate& other) const
{
if (x < other.x) {
return true;
} else if (x > other.x) {
return false;
} else if (y < other.y) {
return true;
} else {
return false;
}
}
Then, finally, I use the following Python code to de-serialize the same object
from disk:
file = open("terrain.dat", "rb")
transport = thrift.transport.TTransport.TFileObjectTransport(file)
protocol = thrift.protocol.TBinaryProtocol.TBinaryProtocol(transport)
terrain = Terrain()
terrain.read(protocol)
print(terrain)
This Python program outputs:
Terrain(altitude_samples=None)
In other words, the de-serialized Terrain contains no terrain_samples, instead
of the expected dictionary with 3 coordinates.
I am 100% sure that the file terrain.dat contains valid data: I also
de-serialized the same data using C++ and in that case, I *do* get the expected
results (see repo for details)
I suspect that this has something to do with the comparison operator.
I gut feeling is that I should have done something similar in Python with
respect to the comparison operator as I did in C++. But I don't know what that
missing something would be.
Additional information added on 19-Sep-2018:
Here is a hexdump of the encoding produced by the C++ encoding program:
Offset: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000: 01 00 00 00 0D 02 00 00 00 00 01 01 00 00 00 0C ................
00000010: 01 00 00 00 08 04 00 00 00 00 00 00 03 01 00 00 ................
00000020: 00 08 02 00 00 00 00 01 04 00 00 00 00 00 00 0A ................
00000030: 01 00 00 00 08 02 00 00 00 00 02 04 00 00 00 00 ................
00000040: 00 00 0A 01 00 00 00 00 04 00 00 00 00 00 00 64 ...............d
00000050: 01 00 00 00 08 02 00 00 00 00 01 04 00 00 00 00 ................
00000060: 00 00 14 01 00 00 00 08 02 00 00 00 00 02 04 00 ................
00000070: 00 00 00 00 00 14 01 00 00 00 00 04 00 00 00 00 ................
00000080: 00 00 C8 01 00 00 00 08 02 00 00 00 00 01 04 00 ..H.............
00000090: 00 00 00 00 00 1E 01 00 00 00 08 02 00 00 00 00 ................
000000a0: 02 04 00 00 00 00 00 00 1E 01 00 00 00 00 04 00 ................
000000b0: 00 00 00 00 01 2C 01 00 00 00 00 .....,.....
The first 4 bytes are 01 00 00 00
Using a debugger to step through the Python decoding function reveals that:
* This is being decoded as a struct (which is expected)
* The first byte 01 is interpreted as the field type. 01 means field type VOID.
* The next two bytes are interpreted as the field id. 00 00 means field ID 0.
* For field type VOID, nothing else is read and we continue to the next field.
* The next byte is interpreted as the field type. 00 means STOP.
* We top reading data for the struct.
* The final result is an empty struct.
All off the above is consistent with the information at
https://github.com/apache/thrift/blob/master/doc/specs/thrift-binary-protocol.md
which describes the Thrift binary encoding format
My conclusion thus far is that the C++ encoder appears to produce an
"incorrect" binary encoding (I put incorrect in quotes because certainly
something as blatant as that would have been discovered by lots of other
people, so I am sure that I am still missing something).
Additional information added on 19-Sep-2018:
It appears that the C++ implementation of TFileTransport has the concept of
"events" when writing to disk.
The output which is written to disk is divided into a sequence of "events"
where each "event" is preceded by a 4-byte length field of the event, followed
by the contents of the event.
Looking at the hexdump above, the first couple of events are:
`0100 0000 0d` : Event length 1, event value `0d`
`02 0000 0000 01` : Event length 2, event value `00 01`
Etc.
The Python implementation of TFileTransport does not understand this concept of
events when parsing the file.
It appears that the problem is one of the following two:
1) Either the C++ code should not be inserting these event lengths into the
encoded file,
2) Or the Python code should understand these event lengths when decoding the
file.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)