Great !
It seems this pattern (COPY + parallel file read) is becoming a standard for
'data warehouses' we are using something similar too in the AWS Redshift PR
(WIP)
for details: https://github.com/apache/beam/pull/10206

Maybe worth for all of us to check and se eif we can converge the
implementations as
much as possible to provide users a consistent experience.


On Tue, Mar 24, 2020 at 10:02 AM Elias Djurfeldt <elias.djurfe...@mirado.com>
wrote:

> Awesome job! I'm very interested in the cross-language support.
>
> Cheers,
>
> On Tue, 24 Mar 2020 at 01:20, Chamikara Jayalath <chamik...@google.com>
> wrote:
>
>> Sounds great. Looks like operation of the Snowflake source will be
>> similar to BigQuery source (export files to GCS and read files). This will
>> allow you to better parallelize reading (current JDBC source is limited to
>> one worker when reading).
>>
>> Seems like you already support initial splitting using files -
>> https://github.com/PolideaInternal/beam/blob/snowflake-io/sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java#L374
>> Prob. also consider supporting dynamic work rebalancing when runners
>> support this through SDF.
>>
>> Thanks,
>> Cham
>>
>>
>>
>>
>> On Mon, Mar 23, 2020 at 9:49 AM Alexey Romanenko <
>> aromanenko....@gmail.com> wrote:
>>
>>> Great! This is always welcomed to have more IOs in Beam. I’d be happy to
>>> take look on your PR once it will be created.
>>>
>>> Just a couple of questions for now.
>>>
>>> 1) Afaik, you can connect to Snowflake using standard JDBC driver. Do
>>> you plan to compare a performance between this SnowflakeIO and Beam JdbcIO?
>>> 2) Are you going to support staging in other locations, like S3 and
>>> Azure?
>>> 3) Does “ withSchema()” allows to infer Snowflake schema to Beam schema?
>>>
>>> On 23 Mar 2020, at 15:23, Katarzyna Kucharczyk <ka.kucharc...@gmail.com>
>>> wrote:
>>>
>>> Hi all,
>>>
>>> Me and my colleagues have developed a new Java connector for Snowflake
>>> that we would like to add to Beam.
>>>
>>> Snowflake is an analytic data warehouse provided as
>>> Software-as-a-Service (SaaS). It uses a new SQL database engine with a
>>> unique architecture designed for the cloud. To read more details please
>>> check [1] and [2].
>>>
>>> Proposed Snowflake IOs use JDBC Snowflake library [3]. The IOs are batch
>>> write and batch read that use the Snowflake COPY [4] operation underneath.
>>> In both cases ParDo IOs load files on a stage and then they are inserted
>>> into the Snowflake table of choice using the COPY API. The currently
>>> supported stage is Google Cloud Storage[5].
>>>
>>> The schema how Snowflake Read IO works (write operation works similarly
>>> but in opposite direction):
>>> Here is an Apache Beam fork [6] with current work of the Snowflake IO.
>>>
>>> In the near future we would like to also add IO for writing streams
>>> which will use SnowPipe - Snowflake mechanism for continuous loading[7].
>>> Also, we would like to use cross language to provide Python connectors as
>>> well.
>>>
>>> We are open for all opinions and suggestions. In case of any
>>> questions/comments please do not hesitate to post them.
>>>
>>> In case of no objection I will create jira tickets and share them in
>>> this thread. Cheers, Kasia
>>>
>>> [1] https://www.snowflake.com
>>> [2]
>>> https://docs.snowflake.net/manuals/user-guide/intro-key-concepts.html
>>> [3] https://docs.snowflake.net/manuals/user-guide/jdbc.html
>>> [4] https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
>>>
>>> [5]
>>> https://github.com/PolideaInternal/beam/tree/snowflake-io/sdks/java/io/snowflake
>>>
>>> [6] https://cloud.google.com/storage
>>> [7]
>>> https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe.html
>>>
>>>
>>>
>
> --
> Elias Djurfeldt
> Mirado Consulting
>

Reply via email to