Hey Ryan,
Thanks for the information. The test case you added covers the Thrift
"round-trip" case, which first writes a Thrift record to a Parquet file,
and then reads it back from the Parquet file. Since Thrift always has
schema information at hand, it's possible for Thrift to interpret the
`BINARY (UTF8)` field in the Parquet file correctly as a Thrift BINARY
field. However, when taken interoperability into account, say, writing a
Thrift record containing a BINARY field into Parquet, and then reading
the Parquet file with Spark SQL, Spark SQL doesn't have the knowledge of
the Thrift schema, and just interpreted the field as a UTF-8 string. And
values in that field may not be valid UTF-8 strings.
Fortunately, after some testing with Spark SQL, it seems that this
behavior doesn't cause data corruption. If a user does know that a field
is a Thrift BINARY, she can always do a SQL type casting and everything
goes fine.
So the only annoying parts is that, there can be temporary string
objects which are expected to be UTF-8 by both Parquet and Spark SQL,
but they actually contain random bytes which may have invalid UTF-8 byte
sequence(s). For now I tend to just live with it...
Cheng
On 7/7/15 1:28 PM, Ryan Blue wrote:
We had a bug report about this a while back and I wrote some tests to
verify that the binary behavior was correct:
https://github.com/apache/parquet-mr/pull/145/files
I believe that Thrift binary is correctly stored as binary and Thrift
String is correctly stored as binary (UTF8). Is that not the case?
Could you write a Thrift schema conversion unit test that demonstrates
the case you're talking about here?
rb
On 07/07/2015 01:19 PM, Cheng Lian wrote:
Hi all,
I’m working on Spark SQL Parquet support. While doing compatibility test
with parquet-thrift, I noticed that parquet-thrift always treats Thrift
BINARY types as UTF-8 strings. But this isn’t Parquet’s fault, because
Thrift doesn’t even have a genuine BINARY type at all. Below is quoted
from Thrift docs <https://thrift.apache.org/docs/types>:
Base Types
…
* string: A text string encoded using UTF-8 encoding
…
Special Types
binary: a sequence of unencoded bytes
N.B.: This is currently a specialized form of the string type above,
added to provide better interoperability with Java. The current
plan-of-record is to elevate this to a base type at some point.
I think this implies we can’t treat Thrift STRING as |BINARY (UTF8)| in
parquet-thrift, since it may contain unencoded random bytes. My proposal
here is to always treat Thrift STRING as raw Parquet BINARY without any
annotation. But still need to think about backwards compatibility
though. Thoughts?
Will try to give a minimum test case for this later.
Cheng