Github user carlosfuertes commented on the pull request:
https://github.com/apache/spark/pull/1682#issuecomment-54774014
@JoshRosen I can imagine that avoiding duplication of keys can save easily
50% or more. If we are aiming at big data sizes that can matter a lot. In the
JIRA post I explain how I was seeing already 15Mb data sizes with just 50,000
jobs. Gzip can make sense but if you start with something that is already 50%
smaller that's even better.
In any case that is very simple to test and benchmark (with this PR for
example is very simple) and an optimization at the end of the day.
To be very concrete, something like
[ {"key1": row1_value1, "key2": row1_value2},
{"key1": row2_value1, "key2": row2_value2},
{"key1": row3_value1, "key2": row3_value2} ]
versus using
{ meta: { keys: [ "key1", "key2" ] },
data: [ ["row1_value1", "row1_value1"],
["row2_value1", "row2_value1"],
["row3_value1", "row3_value1"] ] }
I would stick to using JSON and not csv or tsv, since JSON interacts
extremely well with javascript (it was design to), and pretty much anything,
and you have the flexibility to add other meta information and parse it.
I have seen some HTTP APIs that incorporate pagination and so forth (ex.
steam web page) but I do not have a particular one in mind freely available in
the wild... I'm thinking that talking about JSON API is a bit confusing and it
would be better to refer to the HTTP API (which returns JSON for some calls). I
think it may make sense in the doc that you are working on to delineate the
full (RESTful) HTTP API for the WebUI rather than just the JSON part so that
the global picture and design is clear.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]