Re: Beam SQL Improvements

Reuven Lax Tue, 22 May 2018 11:28:18 -0700

We can do even better btw. Building a SchemaRegistry where automatic
conversions can be registered between schema and Java data types. With this
the user won't even need a DoFn to do the conversion.


On Tue, May 22, 2018, 10:13 AM Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

> Hi guys,
>
> Checked out what has been done on schema model and think it is acceptable
> - regarding the json debate -  if
> https://issues.apache.org/jira/browse/BEAM-4381 can be fixed.
>
> High level, it is about providing a mainstream and not too impacting model
> OOTB and JSON seems the most valid option for now, at least for IO and some
> user transforms.
>
> Wdyt?
>
> Le ven. 27 avr. 2018 18:36, Romain Manni-Bucau <rmannibu...@gmail.com> a
> écrit :
>
>>  Can give it a try end of may, sure. (holidays and work constraints will
>> make it hard before).
>>
>> Le 27 avr. 2018 18:26, "Anton Kedin" <ke...@google.com> a écrit :
>>
>>> Romain,
>>>
>>> I don't believe that JSON approach was investigated very thoroughIy. I
>>> mentioned few reasons which will make it not the best choice my opinion,
>>> but I may be wrong. Can you put together a design doc or a prototype?
>>>
>>> Thank you,
>>> Anton
>>>
>>>
>>> On Thu, Apr 26, 2018 at 10:17 PM Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
>>>>
>>>>
>>>> Le 26 avr. 2018 23:13, "Anton Kedin" <ke...@google.com> a écrit :
>>>>
>>>> BeamRecord (Row) has very little in common with JsonObject (I assume
>>>> you're talking about javax.json), except maybe some similarities of the
>>>> API. Few reasons why JsonObject doesn't work:
>>>>
>>>>    - it is a Java EE API:
>>>>       - Beam SDK is not limited to Java. There are probably similar
>>>>       APIs for other languages but they might not necessarily carry the 
>>>> same
>>>>       semantics / APIs;
>>>>
>>>>
>>>> Not a big deal I think. At least not a technical blocker.
>>>>
>>>>
>>>>    - It can change between Java versions;
>>>>
>>>> No, this is javaee ;).
>>>>
>>>>
>>>>
>>>>    - Current Beam java implementation is an experimental feature to
>>>>       identify what's needed from such API, in the end we might end up with
>>>>       something similar to JsonObject API, but likely not
>>>>
>>>>
>>>> I dont get that point as a blocker
>>>>
>>>>
>>>>    - ;
>>>>       - represents JSON, which is not an API but an object notation:
>>>>       - it is defined as unicode string in a certain format. If you
>>>>       choose to adhere to ECMA-404, then it doesn't sound like JsonObject 
>>>> can
>>>>       represent an Avro object, if I'm reading it right;
>>>>
>>>>
>>>> It is in the generator impl, you can impl an avrogenerator.
>>>>
>>>>
>>>>    - doesn't define a type system (JSON does, but it's lacking):
>>>>       - for example, JSON doesn't define semantics for numbers;
>>>>       - doesn't define date/time types;
>>>>       - doesn't allow extending JSON type system at all;
>>>>
>>>>
>>>> That is why you need a metada object, or simpler, a schema with that
>>>> data. Json or beam record doesnt help here and you end up on the same
>>>> outcome if you think about it.
>>>>
>>>>
>>>>    - lacks schemas;
>>>>
>>>> Jsonschema are standard, widely spread and tooled compared to
>>>> alternative.
>>>>
>>>> You can definitely try loosen the requirements and define everything in
>>>> JSON in userland, but the point of Row/Schema is to avoid it and define
>>>> everything in Beam model, which can be extended, mapped to JSON, Avro,
>>>> BigQuery Schemas, custom binary format etc., with same semantics across
>>>> beam SDKs.
>>>>
>>>>
>>>> This is what jsonp would allow with the benefit of a natural pojo
>>>> support through jsonb.
>>>>
>>>>
>>>>
>>>> On Thu, Apr 26, 2018 at 12:28 PM Romain Manni-Bucau <
>>>> rmannibu...@gmail.com> wrote:
>>>>
>>>>> Just to let it be clear and let me understand: how is BeamRecord
>>>>> different from a JsonObject which is an API without implementation (not
>>>>> event a json one OOTB)? Advantage of json *api* are indeed natural mapping
>>>>> (jsonb is based on jsonp so no new binding to reinvent) and simple
>>>>> serialization (json+gzip for ex, or avro if you want to be geeky).
>>>>>
>>>>> I fail to see the point to rebuild an ecosystem ATM.
>>>>>
>>>>> Le 26 avr. 2018 19:12, "Reuven Lax" <re...@google.com> a écrit :
>>>>>
>>>>>> Exactly what JB said. We will write a generic conversion from Avro
>>>>>> (or json) to Beam schemas, which will make them work transparently with
>>>>>> SQL. The plan is also to migrate Anton's work so that POJOs works
>>>>>> generically for any schema.
>>>>>>
>>>>>> Reuven
>>>>>>
>>>>>> On Thu, Apr 26, 2018 at 1:17 AM Jean-Baptiste Onofré <j...@nanthrax.net>
>>>>>> wrote:
>>>>>>
>>>>>>> For now we have a generic schema interface. Json-b can be an impl,
>>>>>>> avro could be another one.
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>> Le 26 avr. 2018, à 12:08, Romain Manni-Bucau <rmannibu...@gmail.com>
>>>>>>> a écrit:
>>>>>>>>
>>>>>>>> Hmm,
>>>>>>>>
>>>>>>>> avro has still the pitfalls to have an uncontrolled stack which
>>>>>>>> brings way too much dependencies to be part of any API,
>>>>>>>> this is why I proposed a JSON-P based API (JsonObject) with a
>>>>>>>> custom beam entry for some metadata (headers "à la Camel").
>>>>>>>>
>>>>>>>>
>>>>>>>> Romain Manni-Bucau
>>>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |   Blog
>>>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>>>> <http://rmannibucau.wordpress.com> |  Github
>>>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>>>>>
>>>>>>>> 2018-04-26 9:59 GMT+02:00 Jean-Baptiste Onofré <j...@nanthrax.net>:
>>>>>>>>
>>>>>>>>> Hi Ismael
>>>>>>>>>
>>>>>>>>> You mean directly in Beam SQL ?
>>>>>>>>>
>>>>>>>>> That will be part of schema support: generic record could be one
>>>>>>>>> of the payload with across schema.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> JB
>>>>>>>>> Le 26 avr. 2018, à 11:39, "Ismaël Mejía" < ieme...@gmail.com> a
>>>>>>>>> écrit:
>>>>>>>>>>
>>>>>>>>>> Hello Anton,
>>>>>>>>>>
>>>>>>>>>> Thanks for the descriptive email and the really useful work. Any 
>>>>>>>>>> plans
>>>>>>>>>> to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>>>>>>>> is a natural fit for this approach too.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Ismaël
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <ke...@google.com> 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>            Hi,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  I want to highlight a couple of improvements to Beam SQL we have 
>>>>>>>>>>> been
>>>>>>>>>>>
>>>>>>>>>>>  working on recently which are targeted to make Beam SQL API easier 
>>>>>>>>>>> to use.
>>>>>>>>>>>
>>>>>>>>>>>  Specifically these features simplify conversion of Java Beans and 
>>>>>>>>>>> JSON
>>>>>>>>>>>
>>>>>>>>>>>  strings to Rows.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Feel free to try this and send any bugs/comments/PRs my way.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  **Caveat: this is still work in progress, and has known bugs and 
>>>>>>>>>>> incomplete
>>>>>>>>>>>
>>>>>>>>>>>  features, see below for details.**
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Background
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Beam SQL queries can only be applied to PCollection<Row>. This 
>>>>>>>>>>> means that
>>>>>>>>>>>
>>>>>>>>>>>  users need to convert whatever PCollection elements they have to 
>>>>>>>>>>> Rows before
>>>>>>>>>>>
>>>>>>>>>>>  querying them with SQL. This usually requires manually creating a 
>>>>>>>>>>> Schema and
>>>>>>>>>>>
>>>>>>>>>>>  implementing a custom conversion PTransform<PCollection<
>>>>>>>>>>>           Element>,
>>>>>>>>>>>
>>>>>>>>>>>  PCollection<Row>> (see Beam SQL Guide).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  The improvements described here are an attempt to reduce this 
>>>>>>>>>>> overhead for
>>>>>>>>>>>
>>>>>>>>>>>  few common cases, as a start.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Status
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Introduced a InferredRowCoder to automatically generate rows from 
>>>>>>>>>>> beans.
>>>>>>>>>>>
>>>>>>>>>>>  Removes the need to manually define a Schema and Row conversion 
>>>>>>>>>>> logic;
>>>>>>>>>>>
>>>>>>>>>>>  Introduced JsonToRow transform to automatically parse JSON objects 
>>>>>>>>>>> to Rows.
>>>>>>>>>>>
>>>>>>>>>>>  Removes the need to manually implement a conversion logic;
>>>>>>>>>>>
>>>>>>>>>>>  This is still experimental work in progress, APIs will likely 
>>>>>>>>>>> change;
>>>>>>>>>>>
>>>>>>>>>>>  There are known bugs/unsolved problems;
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Java Beans
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Introduced a coder which facilitates Rows generation from Java 
>>>>>>>>>>> Beans.
>>>>>>>>>>>
>>>>>>>>>>>  Reduces the overhead to:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>             /** Some user-defined Java Bean */
>>>>>>>>>>>>
>>>>>>>>>>>>  class JavaBeanObject implements Serializable {
>>>>>>>>>>>>
>>>>>>>>>>>>  String getName() { ... }
>>>>>>>>>>>>
>>>>>>>>>>>>  }
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  // Obtain the objects:
>>>>>>>>>>>>
>>>>>>>>>>>>  PCollection<JavaBeanObject> javaBeans = ...;
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>>>
>>>>>>>>>>>>  PCollection<Row> queryResult =
>>>>>>>>>>>>
>>>>>>>>>>>>  javaBeans
>>>>>>>>>>>>
>>>>>>>>>>>>  .setCoder(InferredRowCoder.
>>>>>>>>>>>>            ofSerializable(JavaBeanObject.
>>>>>>>>>>>>            class))
>>>>>>>>>>>>
>>>>>>>>>>>>  .apply(BeamSql.query("SELECT name FROM PCOLLECTION"));
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Notice, there is no more manual Schema definition or custom 
>>>>>>>>>>> conversion
>>>>>>>>>>>
>>>>>>>>>>>  logic.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Links
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>   example;
>>>>>>>>>>>
>>>>>>>>>>>   InferredRowCoder;
>>>>>>>>>>>
>>>>>>>>>>>   test;
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  JSON
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Introduced JsonToRow transform. It is possible to query a
>>>>>>>>>>>
>>>>>>>>>>>  PCollection<String> that contains JSON objects like this:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>             // Assuming JSON objects look like this:
>>>>>>>>>>>>
>>>>>>>>>>>>  // { "type" : "foo", "size" : 333 }
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  // Define a Schema:
>>>>>>>>>>>>
>>>>>>>>>>>>  Schema jsonSchema =
>>>>>>>>>>>>
>>>>>>>>>>>>  Schema
>>>>>>>>>>>>
>>>>>>>>>>>>  .builder()
>>>>>>>>>>>>
>>>>>>>>>>>>  .addStringField("type")
>>>>>>>>>>>>
>>>>>>>>>>>>  .addInt32Field("size")
>>>>>>>>>>>>
>>>>>>>>>>>>  .build();
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  // Obtain PCollection of the objects in JSON format:
>>>>>>>>>>>>
>>>>>>>>>>>>  PCollection<String> jsonObjects = ...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>>>>>>
>>>>>>>>>>>>  PCollection<Row> queryResults =
>>>>>>>>>>>>
>>>>>>>>>>>>  jsonObjects
>>>>>>>>>>>>
>>>>>>>>>>>>  .apply(JsonToRow.withSchema(
>>>>>>>>>>>>            jsonSchema))
>>>>>>>>>>>>
>>>>>>>>>>>>  .apply(BeamSql.query("SELECT type, AVG(size) FROM PCOLLECTION 
>>>>>>>>>>>> GROUP BY
>>>>>>>>>>>>
>>>>>>>>>>>>  type"));
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Notice, JSON to Row conversion is done by JsonToRow transform. It 
>>>>>>>>>>> is
>>>>>>>>>>>
>>>>>>>>>>>  currently required to supply a Schema.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Links
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>   JsonToRow;
>>>>>>>>>>>
>>>>>>>>>>>   test/example;
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Going Forward
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  fix bugs (BEAM-4163, BEAM-4161 ...)
>>>>>>>>>>>
>>>>>>>>>>>  implement more features (BEAM-4167, more types of objects);
>>>>>>>>>>>
>>>>>>>>>>>  wire this up with sources/sinks to further simplify SQL API;
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Thank you,
>>>>>>>>>>>
>>>>>>>>>>>  Anton
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>

Re: Beam SQL Improvements

Reply via email to