Re: Beam SQL Improvements

Romain Manni-Bucau Thu, 26 Apr 2018 12:28:58 -0700

Just to let it be clear and let me understand: how is BeamRecord different
from a JsonObject which is an API without implementation (not event a json
one OOTB)? Advantage of json *api* are indeed natural mapping (jsonb is
based on jsonp so no new binding to reinvent) and simple serialization
(json+gzip for ex, or avro if you want to be geeky).


I fail to see the point to rebuild an ecosystem ATM.

Le 26 avr. 2018 19:12, "Reuven Lax" <re...@google.com> a écrit :

> Exactly what JB said. We will write a generic conversion from Avro (or
> json) to Beam schemas, which will make them work transparently with SQL.
> The plan is also to migrate Anton's work so that POJOs works generically
> for any schema.
>
> Reuven
>
> On Thu, Apr 26, 2018 at 1:17 AM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
>> For now we have a generic schema interface. Json-b can be an impl, avro
>> could be another one.
>>
>> Regards
>> JB
>> Le 26 avr. 2018, à 12:08, Romain Manni-Bucau <rmannibu...@gmail.com> a
>> écrit:
>>>
>>> Hmm,
>>>
>>> avro has still the pitfalls to have an uncontrolled stack which brings
>>> way too much dependencies to be part of any API,
>>> this is why I proposed a JSON-P based API (JsonObject) with a custom
>>> beam entry for some metadata (headers "à la Camel").
>>>
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau <https://twitter.com/rmannibucau> |   Blog
>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>> <http://rmannibucau.wordpress.com> |  Github
>>> <https://github.com/rmannibucau> | LinkedIn
>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>
>>> 2018-04-26 9:59 GMT+02:00 Jean-Baptiste Onofré <j...@nanthrax.net>:
>>>
>>>> Hi Ismael
>>>>
>>>> You mean directly in Beam SQL ?
>>>>
>>>> That will be part of schema support: generic record could be one of the
>>>> payload with across schema.
>>>>
>>>> Regards
>>>> JB
>>>> Le 26 avr. 2018, à 11:39, "Ismaël Mejía" < ieme...@gmail.com> a écrit:
>>>>>
>>>>> Hello Anton,
>>>>>
>>>>> Thanks for the descriptive email and the really useful work. Any plans
>>>>> to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>>> is a natural fit for this approach too.
>>>>>
>>>>> Regards,
>>>>> Ismaël
>>>>>
>>>>> On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <ke...@google.com> wrote:
>>>>>
>>>>>
>>>>>>            Hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>>  I want to highlight a couple of improvements to Beam SQL we have been
>>>>>>
>>>>>>  working on recently which are targeted to make Beam SQL API easier to 
>>>>>> use.
>>>>>>
>>>>>>  Specifically these features simplify conversion of Java Beans and JSON
>>>>>>
>>>>>>  strings to Rows.
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Feel free to try this and send any bugs/comments/PRs my way.
>>>>>>
>>>>>>
>>>>>>
>>>>>>  **Caveat: this is still work in progress, and has known bugs and 
>>>>>> incomplete
>>>>>>
>>>>>>  features, see below for details.**
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Background
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Beam SQL queries can only be applied to PCollection<Row>. This means 
>>>>>> that
>>>>>>
>>>>>>  users need to convert whatever PCollection elements they have to Rows 
>>>>>> before
>>>>>>
>>>>>>  querying them with SQL. This usually requires manually creating a 
>>>>>> Schema and
>>>>>>
>>>>>>  implementing a custom conversion PTransform<PCollection<
>>>>>>           Element>,
>>>>>>
>>>>>>  PCollection<Row>> (see Beam SQL Guide).
>>>>>>
>>>>>>
>>>>>>
>>>>>>  The improvements described here are an attempt to reduce this overhead 
>>>>>> for
>>>>>>
>>>>>>  few common cases, as a start.
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Status
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Introduced a InferredRowCoder to automatically generate rows from beans.
>>>>>>
>>>>>>  Removes the need to manually define a Schema and Row conversion logic;
>>>>>>
>>>>>>  Introduced JsonToRow transform to automatically parse JSON objects to 
>>>>>> Rows.
>>>>>>
>>>>>>  Removes the need to manually implement a conversion logic;
>>>>>>
>>>>>>  This is still experimental work in progress, APIs will likely change;
>>>>>>
>>>>>>  There are known bugs/unsolved problems;
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Java Beans
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Introduced a coder which facilitates Rows generation from Java Beans.
>>>>>>
>>>>>>  Reduces the overhead to:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>             /** Some user-defined Java Bean */
>>>>>>>
>>>>>>>  class JavaBeanObject implements Serializable {
>>>>>>>
>>>>>>>  String getName() { ... }
>>>>>>>
>>>>>>>  }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  // Obtain the objects:
>>>>>>>
>>>>>>>  PCollection<JavaBeanObject> javaBeans = ...;
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>
>>>>>>>  PCollection<Row> queryResult =
>>>>>>>
>>>>>>>  javaBeans
>>>>>>>
>>>>>>>  .setCoder(InferredRowCoder.
>>>>>>>            ofSerializable(JavaBeanObject.
>>>>>>>            class))
>>>>>>>
>>>>>>>  .apply(BeamSql.query("SELECT name FROM PCOLLECTION"));
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Notice, there is no more manual Schema definition or custom conversion
>>>>>>
>>>>>>  logic.
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Links
>>>>>>
>>>>>>
>>>>>>
>>>>>>   example;
>>>>>>
>>>>>>   InferredRowCoder;
>>>>>>
>>>>>>   test;
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  JSON
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Introduced JsonToRow transform. It is possible to query a
>>>>>>
>>>>>>  PCollection<String> that contains JSON objects like this:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>             // Assuming JSON objects look like this:
>>>>>>>
>>>>>>>  // { "type" : "foo", "size" : 333 }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  // Define a Schema:
>>>>>>>
>>>>>>>  Schema jsonSchema =
>>>>>>>
>>>>>>>  Schema
>>>>>>>
>>>>>>>  .builder()
>>>>>>>
>>>>>>>  .addStringField("type")
>>>>>>>
>>>>>>>  .addInt32Field("size")
>>>>>>>
>>>>>>>  .build();
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  // Obtain PCollection of the objects in JSON format:
>>>>>>>
>>>>>>>  PCollection<String> jsonObjects = ...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  // Convert to Rows and apply a SQL query:
>>>>>>>
>>>>>>>  PCollection<Row> queryResults =
>>>>>>>
>>>>>>>  jsonObjects
>>>>>>>
>>>>>>>  .apply(JsonToRow.withSchema(
>>>>>>>            jsonSchema))
>>>>>>>
>>>>>>>  .apply(BeamSql.query("SELECT type, AVG(size) FROM PCOLLECTION GROUP BY
>>>>>>>
>>>>>>>  type"));
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Notice, JSON to Row conversion is done by JsonToRow transform. It is
>>>>>>
>>>>>>  currently required to supply a Schema.
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Links
>>>>>>
>>>>>>
>>>>>>
>>>>>>   JsonToRow;
>>>>>>
>>>>>>   test/example;
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Going Forward
>>>>>>
>>>>>>
>>>>>>
>>>>>>  fix bugs (BEAM-4163, BEAM-4161 ...)
>>>>>>
>>>>>>  implement more features (BEAM-4167, more types of objects);
>>>>>>
>>>>>>  wire this up with sources/sinks to further simplify SQL API;
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Thank you,
>>>>>>
>>>>>>  Anton
>>>>>>
>>>>>>
>>>>>>
>>>

Re: Beam SQL Improvements

Reply via email to