Re: How do I send binary data to avatica?

Josh Elser Thu, 21 Apr 2016 07:04:48 -0700

Oh! That is easier. I had missed that detail that the b64-encoded datais being set in string_value (and not bytes_value). I was (incorrectly)thinking that we had to somehow differentiate between when bytes_valueshad b64 data and when it had "normal" data :)


F21 wrote:

I think the best way would be to:
1. Check to see if there are raw bytes in bytes_value. If so, use these
raw bytes.
2. If bytes_value is empty, check if there's something in string_value.
If the string_value is correctly decoded using base64, then use the
decoded raw bytes.


It would be nice to deprecate the wrong behavior in a future release,
but I think this solution should make the transition to the correct
behavior pretty smooth.

On 20/04/2016 11:27 PM, Josh Elser wrote:

Awesome work!

This definitely sounds like a bug. It must be happening inadvertently
in the translation. A JIRA issue for this would be great.

It'll be tricky making this "just work" for older versions, but
that'll be something to figure out when fixing it.

- Josh

F21 wrote:

Hey Josh,

I was able to discover the source of the problem. I found your Github
repo showing how to use the phoenix thin client here:
https://github.com/joshelser/phoenix-queryserver-jdbc-client

I hacked up CreateTables.java to create a table with a VARBINARY column
and to read a file into a byte array and upsert it into the column.

I was able to use wireshark to capture the traffic to the phoenix query
server, export the http body and decode it using: protoc --decode_raw <
java-request > java-decoded.txt

At the same time, I made my golang driver dump the request into a file
and used the protoc tool to decode it.

I then used a diff tool to compare both requests.

I noticed the following:
- The binary data must be base64 encoded.
- Even though the rep type is BYTE_STRING, the base64 encoded string
must be set in string_value and not bytes_values.

However, when I query the binary column, it comes back correctly as raw
bytes in the bytes_values.

Let me know if you want me to open an issue regarding this on JIRA :)

Here's a gist containing the java source code and decoded protocol
buffers for the java and golang requests:
https://gist.github.com/F21/381a62b11bfa2b4fe212dfa328bf7053


On 19/04/2016 9:12 AM, Josh Elser wrote:

F21 wrote:

Hey Josh and Julian,

Thanks for your input regarding this. I also spent hours and hours
over
the last few days trying to get to the bottom of this, and it honestly
wasn't the best use of my time either. I am sure being in different
timezones also makes it much harder to this problem resolved quicker.

Because protobufs is a binary protocol, it can be somewhat
difficult to
debug. I tried installing a no-longer-maintained wireshark
dissector to
but I wasn't able to get it to compile. I also tried a few Fiddler2
plugins and the Charles debugger, but they also ended up being dead
ends.


There's a low-hanging fruit here of having PQS dump out "stringified"
message in the console (similar to how I was generating the JSON for
you). This is some that we need to standardize in Avatica and provide
as a "feature". It should be trivial to enable RPC logging.

This is also step one and would, likely, be enough for you to diff the
output between sqlline-thin and your driver to figure out how they
vary.

While writing a script to demonstrate the problem might be helpful
for a
lot of cases, I don't think it would be helpful here. For example, we
could mandate that a reproduction be written in Java, but the problem
could actually be how the golang protobuf library
(https://github.com/golang/protobuf) is encoding the messages and the
test script would not be able to catch it. At the same time, having
people submitting scripts written in all sorts of different languages
isn't going to help either.

Go has a concept of using "golden files", which I think would be
extremely useful here. This is basically running some test input and
capturing the output as a file (if the test and other checks
passes). In
future runs, the tests runs and is compared against the golden files.
Also, if the tests are updated, the golden files can be automatically
updated by just running the tests. This video gives a pretty good
overview of how they work:


Yep. Exactly what I was thinking about (was actually reminded of Hive
-- they do this to verify their query parser correctness. I think I've
stumbled along the same idea in some of Julian's other projects).

In the case of driver implementers, I think it would be extremely
useful
to have a bunch of "golden files" for us to test against. In our case,
this might just be dumps of encoded protobuf messages for a given
input.
These golden files would be automatically generated from the avatica
remote driver test suite. After a successful build, it can save the
protobufs encoded by the remote driver somewhere. Driver implementers
can then just take the input used, generate the output and do a
byte by
byte comparison against the golden files. If the driver does not
encode
the message correctly, it should be pretty evident.

I would love to hear what ideas you guys have regarding presenting the
input in a language agnostic way. In a lot of cases, it's not simply
having a SQL statement. For example, parameters provided to a
statement
might be an external file. Also, meta data like the connection id and
the statement id also needs to be dealt with too.


Again, totally right. SQL only is part of the problem. More thought is
definitely required from my end :). I also would like to avoid getting
in the game of maintaining our own full-blown test-harness project if
at all avoidable.

Maybe something like toml-test
(https://github.com/BurntSushi/toml-test)
would be useful, but it lacks a java encoder/decoder.

Anyway, those are just some of my thoughts and I would love to hear
what
you guys think would be the best way to implement the test framework.
Hopefully, we can get the issue of inserting binary data resolved
soon :)

Cheers,
Francis

On 19/04/2016 3:24 AM, Julian Hyde wrote:

+1 on some kind of test framework

I can tell that Josh has put in a lot of effort trying to reproduce
this problem based on descriptions in emails. For the sake of the
sanity of our committers, these kind of problems should be
accompanied
by a pull request that contains a script that reproduces the issue.
What language that script is written in is up for debate.

Julian

On Apr 18, 2016, at 7:20 AM, Josh Elser <josh.el...@gmail.com>
wrote:

Ok, I will try to circle around to this.

Meanwhile, let's start thinking about ways that you and I can better
collaborate :)

We want to make building drivers for Avatica in other languages as
painless as possible, but it seems like we have a big impedance
mismatch already -- not having a common language/test framework we
can both use. What kind of tooling would help you in testing your
driver and give us a common-base to analyze failures that you run
into?

* Dumping of protobufs to stdout by the Avatica server(PQS)?
* Known protobuf requests/responses for executing some SQL?
* ... something else entirely?

Re: How do I send binary data to avatica?

Reply via email to