Nice, this should bring a great performance improvement for SQL. Thanks for
your work!

On Thu, Nov 28, 2019 at 6:33 AM Kenneth Knowles <[email protected]> wrote:

> Nice! Thanks for the very thorough summary. I think this will be a really
> good thing for Beam. Most of the IO sources are very highly optimized for
> querying and will do it more efficiently than the Beam runner when the
> structure of the query matches. I'm really excited to see the performance
> measurements.
>
> A have a thought: your update did not mention a few extensions that we
> might consider: ParquetIO, CassandraIO/HBaseIO/BigTableIO (all should be
> about the same), JdbcIO, IcebergIO (doesn't exist yet, but is basically
> generalized schema-aware files as I understand it). Are these things you
> are thinking about doing, or would these be Jiras that could potentially be
> tagged "starter"? They seem complex but maybe your framework will make it
> feasible for someone with slightly less experience to implement new
> versions of what you have already finished?
>
> Kenn
>
> On Tue, Nov 26, 2019 at 12:19 PM Kirill Kozlov <[email protected]>
> wrote:
>
>> Hello everyone!
>>
>> I have been working on a push-down feature and would like to give a brief
>> update on what is done and is still under works.
>>
>> *Things that are done*:
>> General API for SQL IOs to provide information about what
>> filters/projects they support [1]:
>> - *Filter* can be unsupported, supported with field reordering, and
>> supported without field reordering.
>> - *Predicate* is broken down into a conjunctive normal form (CNF) and
>> passed to a validator class to check what parts are supported or
>> unsupported by an IO.
>>
>> A Calcite rule [2] that checks for push-down support, constructs a new IO
>> source Rel [3] with pushed-down projects and filters when applicable, and
>> preserves unsupported filters/projects.
>>
>> BigQuery should perform push-down when running queries in DIRECT_READ
>> method [4].
>>
>> MongoDB project push-down support is in a PR [5] and predicate support
>> will be added soon.
>>
>>
>> *Things that are in progress:*
>> Documenting how developers can enable push-down for IOs that support it.
>>
>> Documenting certain limitation for BigQuery push-down (ex: comparing
>> values of 2 columns is not supported at the moment, so it is being
>> preserved in a Calc).
>>
>> Updating google-cloud-bigquerystorage to 0.117.0-beta. Earlier versions
>> have a gRPC message limit set to ~11MB, which may cause some pipelies to
>> break when reading from a table with rows larger than the limit.
>>
>> Adding some sort of performance tests to run continuously to
>> measure speed-up and detect regressions.
>>
>> Deciding how cost should be computed for the IO source Rel with push-down
>> [6]. Right now the following formula is used: cost of an IO without
>> push-down minus the normalized (between 0.0 and 1.0) benefit of a performed
>> push-down.
>> The challenge here is to make the change to the cost small enough to not
>> break join reordering, but large enough to make the optimizer favor
>> pushed-down IO.
>>
>>
>> If you have any suggestions/questions/concerns I would love to hear them.
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/BeamSqlTable.java#L36
>> [2]
>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamIOPushDownRule.java
>> [3]
>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamPushDownIOSourceRel.java
>> [4]
>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java#L128
>> [5] https://github.com/apache/beam/pull/10095
>> [6] https://github.com/apache/beam/pull/10060
>>
>> --
>> Kirill
>>
>

Reply via email to