[
https://issues.apache.org/jira/browse/BEAM-5964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16683748#comment-16683748
]
Gleb Kanterov commented on BEAM-5964:
-------------------------------------
[~iemejia] make sense, agree with starting with writes, reading could be
probably covered with JdbcIO because data sizes are small.
I did an integration test suite using testcontainers, not sure how it will fit
CI, but it uses ClickHouse and Zookeeper to test various failure scenarios to
ensure that inserts are atomic and idempotent in case of retries.
> Add ClickHouseIO.Write
> ----------------------
>
> Key: BEAM-5964
> URL: https://issues.apache.org/jira/browse/BEAM-5964
> Project: Beam
> Issue Type: New Feature
> Components: io-ideas
> Reporter: Gleb Kanterov
> Assignee: Gleb Kanterov
> Priority: Major
>
> h3. Motivation
> ClickHouse is open-source columnar DBMS for OLAP. It allows analysis of data
> that is updated in real time. The project was released as open-source
> software under the Apache 2 license in June 2016.
> h3. Design and implementation
> 1. Do only writes, reads aren't useful because ClickHouse is designed for
> OLAP queries
> 2. For writes, do write in batches and rely on idempotent and atomic inserts
> supported by replicated tables in ClickHouse
> 3. Implement ClickHouseIO.Write as PTransform<PCollection<Row>, PDone>
> 4. Rely on having logic for casting rows between schemas in BEAM-5918, and
> don't put it in ClickHouseIO.Write
> h3. References
> [1]
> http://highscalability.com/blog/2017/9/18/evolution-of-data-structures-in-yandexmetrica.html
> [2]
> https://blog.cloudflare.com/how-cloudflare-analyzes-1m-dns-queries-per-second/
> [3]
> https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)