Hi Rob,

Your (de)serialization protocol is insecure if it must rely on EOF detection for correctness. Sorry to say this to you, but no matter amount of workarounds and hacks, such will break in amusing ways.

An important design difference here: Thrift statically knows, at compile / codegen time, how many fields of which width to expect next. If I understand correctly, the formats of Jena are defined as simply concatenations of Thrift Compact (or ProtoBuf) object serializations. This implies a protocol design flaw.

Even as a "Jena outsider" and having not checked the code, what I immediately know about Jena streams receiver logic: it has no idea whatsoever, even at runtime (!), how many more objects to expect. Thus you think you need the EOF checks so badly.

The classic ways to address this flaw are all about adding appropriate framing. Two basic options for the sender: When it has the luxury of knowing the array size in advance: send a "stream header" telling this size to the receiver. Examples: HTTP 'Content-Length' header, 'File Size' header field in BMP images.Otherwise, it must add signalling of "this is the last item" /in the layer of your protocol/. Examples: FIN flag in TCP, IEND tag in PNG images.
In Thrift IDL, option 1 might look like:

struct RDF_StreamHeader {
1: required i32 protoVersion
2: required i32 nStreamItems
}

And option 2 might look like:

union RDF_StreamRowVariant {
1: RDF_PrefixDecl   prefixDecl
2: RDF_Triple       triple
3: RDF_Quad         quad
}
struct RDF_StreamRow {
1: required RDF_StreamRowVariant datum
2: required bool isLastDatumInStream
}

HTH & Best Regards
Max


On Wed, June 19 2024 at 08:42:38 +0000, Rob @ DNR <rve...@dotnetrdf.org> wrote:
Hey All

Over in the Apache Jena [1] project one of the RDF serialisations we support is based upon a binary encoding [2] of RDF using Thrift [3]. We use the Thrift compact protocol and have various implementations of our high-level reader [4] and writer [5] interfaces wrapped around those.

However, it was recently noticed that when presented with completely junk data our readers would silently return. And more worryingly, taking an otherwise valid Thrift data stream and arbitrarily truncating it would produce some valid output and silently ignore the trailing malformed data thus losing some data.

Upon debugging this was determined to be due to our own EOF exception check at [6]. Historically this appears to have been added due to THRIFT-5022 which was fixed in your codebase back in Nov 2019 [7]. However, even with that fix calling protocol.getTransport().isOpen() isn’t a usable check for EOF as even if EOF has been reached this can still return true as the underlying stream might not have been closed yet. Similarly, protocol.getTransport().peek() doesn’t seem a reliable indicator of whether EOF has been reached (it just defers to isOpen() in the default implementation), and we don’t know whether the underlying InputStream is buffered or not (though we do try to enforce that in our wrapper code).

So, we’ve currently ended up with this draft PR [8] which feels like a complete hack as I’m basically examining the Exception stack trace to see where in the read attempt we were when we failed and use that as an indicator of malformed input vs true EOF.

I feel like we’re missing something obvious about how best to use the Thrift Java APIs, any guidance/suggestions on how to do differentiate between EOF and malformed inputs better would be much appreciated!

Or if it’s preferable to handle this as a JIRA Issue let me know and I can get that filed

Cheers,

Rob

[1]: https://jena.apache.org <https://jena.apache.org/>
[2]: <https://jena.apache.org/documentation/io/rdf-binary.html>
[3]: <https://github.com/apache/jena/blob/main/jena-arq/Grammar/RDF-Thrift/BinaryRDF.thrift> [4]: <https://github.com/apache/jena/blob/93335dd1b3b2502968ef3cdcf5b7d60540577e9b/jena-arq/src/main/java/org/apache/jena/riot/thrift/ThriftRDF.java#L165-L185> [5]: <https://github.com/apache/jena/blob/main/jena-arq/src/main/java/org/apache/jena/riot/thrift/StreamRDF2Thrift.java> [6]: <https://github.com/apache/jena/blob/93335dd1b3b2502968ef3cdcf5b7d60540577e9b/jena-arq/src/main/java/org/apache/jena/riot/thrift/ThriftRDF.java#L177-L179> [7]: <https://lists.apache.org/thread/j03hf73yv1ljc0fk6ffohfc22yo3v7hd> [8]: <https://github.com/apache/jena/pull/2408/files#diff-2dce962e024e06557e133fe135c8e968e1a381aee892894f236b9a22046008d4>

Reply via email to