Hi Rob,
Your (de)serialization protocol is insecure if it must rely on EOF
detection for correctness. Sorry to say this to you, but no matter
amount of workarounds and hacks, such will break in amusing ways.
An important design difference here: Thrift statically knows, at
compile / codegen time, how many fields of which width to expect next.
If I understand correctly, the formats of Jena are defined as simply
concatenations of Thrift Compact (or ProtoBuf) object serializations.
This implies a protocol design flaw.
Even as a "Jena outsider" and having not checked the code, what I
immediately know about Jena streams receiver logic: it has no idea
whatsoever, even at runtime (!), how many more objects to expect. Thus
you think you need the EOF checks so badly.
The classic ways to address this flaw are all about adding appropriate
framing. Two basic options for the sender:
When it has the luxury of knowing the array size in advance: send a
"stream header" telling this size to the receiver. Examples: HTTP
'Content-Length' header, 'File Size' header field in BMP
images.Otherwise, it must add signalling of "this is the last item" /in
the layer of your protocol/. Examples: FIN flag in TCP, IEND tag in PNG
images.
In Thrift IDL, option 1 might look like:
struct RDF_StreamHeader {
1: required i32 protoVersion
2: required i32 nStreamItems
}
And option 2 might look like:
union RDF_StreamRowVariant {
1: RDF_PrefixDecl prefixDecl
2: RDF_Triple triple
3: RDF_Quad quad
}
struct RDF_StreamRow {
1: required RDF_StreamRowVariant datum
2: required bool isLastDatumInStream
}
HTH & Best Regards
Max
On Wed, June 19 2024 at 08:42:38 +0000, Rob @ DNR
<rve...@dotnetrdf.org> wrote:
Hey All
Over in the Apache Jena [1] project one of the RDF serialisations we
support is based upon a binary encoding [2] of RDF using Thrift [3].
We use the Thrift compact protocol and have various implementations
of our high-level reader [4] and writer [5] interfaces wrapped around
those.
However, it was recently noticed that when presented with completely
junk data our readers would silently return. And more worryingly,
taking an otherwise valid Thrift data stream and arbitrarily
truncating it would produce some valid output and silently ignore the
trailing malformed data thus losing some data.
Upon debugging this was determined to be due to our own EOF exception
check at [6]. Historically this appears to have been added due to
THRIFT-5022 which was fixed in your codebase back in Nov 2019 [7].
However, even with that fix calling protocol.getTransport().isOpen()
isn’t a usable check for EOF as even if EOF has been reached this
can still return true as the underlying stream might not have been
closed yet. Similarly, protocol.getTransport().peek() doesn’t seem
a reliable indicator of whether EOF has been reached (it just defers
to isOpen() in the default implementation), and we don’t know
whether the underlying InputStream is buffered or not (though we do
try to enforce that in our wrapper code).
So, we’ve currently ended up with this draft PR [8] which feels
like a complete hack as I’m basically examining the Exception stack
trace to see where in the read attempt we were when we failed and use
that as an indicator of malformed input vs true EOF.
I feel like we’re missing something obvious about how best to use
the Thrift Java APIs, any guidance/suggestions on how to do
differentiate between EOF and malformed inputs better would be much
appreciated!
Or if it’s preferable to handle this as a JIRA Issue let me know
and I can get that filed
Cheers,
Rob
[1]: https://jena.apache.org <https://jena.apache.org/>
[2]: <https://jena.apache.org/documentation/io/rdf-binary.html>
[3]:
<https://github.com/apache/jena/blob/main/jena-arq/Grammar/RDF-Thrift/BinaryRDF.thrift>
[4]:
<https://github.com/apache/jena/blob/93335dd1b3b2502968ef3cdcf5b7d60540577e9b/jena-arq/src/main/java/org/apache/jena/riot/thrift/ThriftRDF.java#L165-L185>
[5]:
<https://github.com/apache/jena/blob/main/jena-arq/src/main/java/org/apache/jena/riot/thrift/StreamRDF2Thrift.java>
[6]:
<https://github.com/apache/jena/blob/93335dd1b3b2502968ef3cdcf5b7d60540577e9b/jena-arq/src/main/java/org/apache/jena/riot/thrift/ThriftRDF.java#L177-L179>
[7]:
<https://lists.apache.org/thread/j03hf73yv1ljc0fk6ffohfc22yo3v7hd>
[8]:
<https://github.com/apache/jena/pull/2408/files#diff-2dce962e024e06557e133fe135c8e968e1a381aee892894f236b9a22046008d4>