Can give it a try end of may, sure. (holidays and work constraints will
make it hard before).

Le 27 avr. 2018 18:26, "Anton Kedin" <ke...@google.com> a écrit :

> Romain,
>
> I don't believe that JSON approach was investigated very thoroughIy. I
> mentioned few reasons which will make it not the best choice my opinion,
> but I may be wrong. Can you put together a design doc or a prototype?
>
> Thank you,
> Anton
>
>
> On Thu, Apr 26, 2018 at 10:17 PM Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>>
>>
>> Le 26 avr. 2018 23:13, "Anton Kedin" <ke...@google.com> a écrit :
>>
>> BeamRecord (Row) has very little in common with JsonObject (I assume
>> you're talking about javax.json), except maybe some similarities of the
>> API. Few reasons why JsonObject doesn't work:
>>
>>    - it is a Java EE API:
>>       - Beam SDK is not limited to Java. There are probably similar APIs
>>       for other languages but they might not necessarily carry the same 
>> semantics
>>       / APIs;
>>
>>
>> Not a big deal I think. At least not a technical blocker.
>>
>>
>>    - It can change between Java versions;
>>
>> No, this is javaee ;).
>>
>>
>>
>>    - Current Beam java implementation is an experimental feature to
>>       identify what's needed from such API, in the end we might end up with
>>       something similar to JsonObject API, but likely not
>>
>>
>> I dont get that point as a blocker
>>
>>
>>    - ;
>>       - represents JSON, which is not an API but an object notation:
>>       - it is defined as unicode string in a certain format. If you
>>       choose to adhere to ECMA-404, then it doesn't sound like JsonObject can
>>       represent an Avro object, if I'm reading it right;
>>
>>
>> It is in the generator impl, you can impl an avrogenerator.
>>
>>
>>    - doesn't define a type system (JSON does, but it's lacking):
>>       - for example, JSON doesn't define semantics for numbers;
>>       - doesn't define date/time types;
>>       - doesn't allow extending JSON type system at all;
>>
>>
>> That is why you need a metada object, or simpler, a schema with that
>> data. Json or beam record doesnt help here and you end up on the same
>> outcome if you think about it.
>>
>>
>>    - lacks schemas;
>>
>> Jsonschema are standard, widely spread and tooled compared to alternative.
>>
>> You can definitely try loosen the requirements and define everything in
>> JSON in userland, but the point of Row/Schema is to avoid it and define
>> everything in Beam model, which can be extended, mapped to JSON, Avro,
>> BigQuery Schemas, custom binary format etc., with same semantics across
>> beam SDKs.
>>
>>
>> This is what jsonp would allow with the benefit of a natural pojo support
>> through jsonb.
>>
>>
>>
>> On Thu, Apr 26, 2018 at 12:28 PM Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>> Just to let it be clear and let me understand: how is BeamRecord
>>> different from a JsonObject which is an API without implementation (not
>>> event a json one OOTB)? Advantage of json *api* are indeed natural mapping
>>> (jsonb is based on jsonp so no new binding to reinvent) and simple
>>> serialization (json+gzip for ex, or avro if you want to be geeky).
>>>
>>> I fail to see the point to rebuild an ecosystem ATM.
>>>
>>> Le 26 avr. 2018 19:12, "Reuven Lax" <re...@google.com> a écrit :
>>>
>>>> Exactly what JB said. We will write a generic conversion from Avro (or
>>>> json) to Beam schemas, which will make them work transparently with SQL.
>>>> The plan is also to migrate Anton's work so that POJOs works generically
>>>> for any schema.
>>>>
>>>> Reuven
>>>>
>>>> On Thu, Apr 26, 2018 at 1:17 AM Jean-Baptiste Onofré <j...@nanthrax.net>
>>>> wrote:
>>>>
>>>>> For now we have a generic schema interface. Json-b can be an impl,
>>>>> avro could be another one.
>>>>>
>>>>> Regards
>>>>> JB
>>>>> Le 26 avr. 2018, à 12:08, Romain Manni-Bucau <rmannibu...@gmail.com>
>>>>> a écrit:
>>>>>>
>>>>>> Hmm,
>>>>>>
>>>>>> avro has still the pitfalls to have an uncontrolled stack which
>>>>>> brings way too much dependencies to be part of any API,
>>>>>> this is why I proposed a JSON-P based API (JsonObject) with a custom
>>>>>> beam entry for some metadata (headers "à la Camel").
>>>>>>
>>>>>>
>>>>>> Romain Manni-Bucau
>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |   Blog
>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>> <http://rmannibucau.wordpress.com> |  Github
>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>>>
>>>>>> 2018-04-26 9:59 GMT+02:00 Jean-Baptiste Onofré <j...@nanthrax.net>:
>>>>>>
>>>>>>> Hi Ismael
>>>>>>>
>>>>>>> You mean directly in Beam SQL ?
>>>>>>>
>>>>>>> That will be part of schema support: generic record could be one of
>>>>>>> the payload with across schema.
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>> Le 26 avr. 2018, à 11:39, "Ismaël Mejía" < ieme...@gmail.com> a
>>>>>>> écrit:
>>>>>>>>
>>>>>>>> Hello Anton,
>>>>>>>>
>>>>>>>> Thanks for the descriptive email and the really useful work. Any plans
>>>>>>>> to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>>>>>> is a natural fit for this approach too.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Ismaël
>>>>>>>>
>>>>>>>> On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <ke...@google.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>>            Hi,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  I want to highlight a couple of improvements to Beam SQL we have been
>>>>>>>>>
>>>>>>>>>  working on recently which are targeted to make Beam SQL API easier 
>>>>>>>>> to use.
>>>>>>>>>
>>>>>>>>>  Specifically these features simplify conversion of Java Beans and 
>>>>>>>>> JSON
>>>>>>>>>
>>>>>>>>>  strings to Rows.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Feel free to try this and send any bugs/comments/PRs my way.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  **Caveat: this is still work in progress, and has known bugs and 
>>>>>>>>> incomplete
>>>>>>>>>
>>>>>>>>>  features, see below for details.**
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Background
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Beam SQL queries can only be applied to PCollection<Row>. This means 
>>>>>>>>> that
>>>>>>>>>
>>>>>>>>>  users need to convert whatever PCollection elements they have to 
>>>>>>>>> Rows before
>>>>>>>>>
>>>>>>>>>  querying them with SQL. This usually requires manually creating a 
>>>>>>>>> Schema and
>>>>>>>>>
>>>>>>>>>  implementing a custom conversion PTransform<PCollection<
>>>>>>>>>           Element>,
>>>>>>>>>
>>>>>>>>>  PCollection<Row>> (see Beam SQL Guide).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  The improvements described here are an attempt to reduce this 
>>>>>>>>> overhead for
>>>>>>>>>
>>>>>>>>>  few common cases, as a start.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Status
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Introduced a InferredRowCoder to automatically generate rows from 
>>>>>>>>> beans.
>>>>>>>>>
>>>>>>>>>  Removes the need to manually define a Schema and Row conversion 
>>>>>>>>> logic;
>>>>>>>>>
>>>>>>>>>  Introduced JsonToRow transform to automatically parse JSON objects 
>>>>>>>>> to Rows.
>>>>>>>>>
>>>>>>>>>  Removes the need to manually implement a conversion logic;
>>>>>>>>>
>>>>>>>>>  This is still experimental work in progress, APIs will likely change;
>>>>>>>>>
>>>>>>>>>  There are known bugs/unsolved problems;
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Java Beans
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Introduced a coder which facilitates Rows generation from Java Beans.
>>>>>>>>>
>>>>>>>>>  Reduces the overhead to:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>             /** Some user-defined Java Bean */
>>>>>>>>>>
>>>>>>>>>>  class JavaBeanObject implements Serializable {
>>>>>>>>>>
>>>>>>>>>>  String getName() { ... }
>>>>>>>>>>
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  // Obtain the objects:
>>>>>>>>>>
>>>>>>>>>>  PCollection<JavaBeanObject> javaBeans = ...;
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>
>>>>>>>>>>  PCollection<Row> queryResult =
>>>>>>>>>>
>>>>>>>>>>  javaBeans
>>>>>>>>>>
>>>>>>>>>>  .setCoder(InferredRowCoder.
>>>>>>>>>>            ofSerializable(JavaBeanObject.
>>>>>>>>>>            class))
>>>>>>>>>>
>>>>>>>>>>  .apply(BeamSql.query("SELECT name FROM PCOLLECTION"));
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Notice, there is no more manual Schema definition or custom 
>>>>>>>>> conversion
>>>>>>>>>
>>>>>>>>>  logic.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Links
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   example;
>>>>>>>>>
>>>>>>>>>   InferredRowCoder;
>>>>>>>>>
>>>>>>>>>   test;
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  JSON
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Introduced JsonToRow transform. It is possible to query a
>>>>>>>>>
>>>>>>>>>  PCollection<String> that contains JSON objects like this:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>             // Assuming JSON objects look like this:
>>>>>>>>>>
>>>>>>>>>>  // { "type" : "foo", "size" : 333 }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  // Define a Schema:
>>>>>>>>>>
>>>>>>>>>>  Schema jsonSchema =
>>>>>>>>>>
>>>>>>>>>>  Schema
>>>>>>>>>>
>>>>>>>>>>  .builder()
>>>>>>>>>>
>>>>>>>>>>  .addStringField("type")
>>>>>>>>>>
>>>>>>>>>>  .addInt32Field("size")
>>>>>>>>>>
>>>>>>>>>>  .build();
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  // Obtain PCollection of the objects in JSON format:
>>>>>>>>>>
>>>>>>>>>>  PCollection<String> jsonObjects = ...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>
>>>>>>>>>>  PCollection<Row> queryResults =
>>>>>>>>>>
>>>>>>>>>>  jsonObjects
>>>>>>>>>>
>>>>>>>>>>  .apply(JsonToRow.withSchema(
>>>>>>>>>>            jsonSchema))
>>>>>>>>>>
>>>>>>>>>>  .apply(BeamSql.query("SELECT type, AVG(size) FROM PCOLLECTION GROUP 
>>>>>>>>>> BY
>>>>>>>>>>
>>>>>>>>>>  type"));
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Notice, JSON to Row conversion is done by JsonToRow transform. It is
>>>>>>>>>
>>>>>>>>>  currently required to supply a Schema.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Links
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   JsonToRow;
>>>>>>>>>
>>>>>>>>>   test/example;
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Going Forward
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  fix bugs (BEAM-4163, BEAM-4161 ...)
>>>>>>>>>
>>>>>>>>>  implement more features (BEAM-4167, more types of objects);
>>>>>>>>>
>>>>>>>>>  wire this up with sources/sinks to further simplify SQL API;
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Thank you,
>>>>>>>>>
>>>>>>>>>  Anton
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>
>>

Reply via email to