[GitHub] [drill] jnturton commented on issue #2421: ValueVectors replacement

GitBox Tue, 04 Jan 2022 04:05:28 -0800


jnturton commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1004751937

Cong Luo wrote:

@paul-rogers Thanks for your knowledge base. For the vectorization and
Arrow, I may need to keep pouring cold water for everyone. I think we need to
figure out what we need here is, not just see it as powerful, then say it's
really powerful.

**SQL, NoSQL and NewSQL**
In my work, I went through these three databases architecture. Initially, we
had performance bottlenecks, so choose the NoSQL database, eg : mongo, es and
redis. After entering the production environment, most developers were not good
at using the API(clients) designed for these databases. The key is that
developers do not have the patience to learn the advanced syntax of each
database(NoSQL). So we come back to use the SQL-style database : NewSQL. But,
we found that NewSQL also has many bottlenecks(once you don't think need any
optimization), especially once you comes to deploy a private cluster, you'll
get complex operation and maintenance conditions and user complaints.

Here, the lesson is that choosing the right one is better than choosing it
blindly.

**OLTP, OLAP and HTAP**
The row format is good for OLTP, because need to insert / update / delete
and batch fetch entire rows as soon as possible. The columnar format is good
for OLAP, because need to analyze large amounts of data. To keep low latency,
need to filter data size from the disk and reduce network IO bottlenecks. What
does HTAP look like? let's see that Internal implementation of TiDB :

```
TiFlash provides the columnar storage, it conducts real-time replication of
data in the TiKV nodes
at a low cost that does not block writes in TiKV. Meanwhile, it provides the
same read consistency
as in TiKV and ensures that the latest data is read.
```

As is well known, database vendors prefer to use row format to apply in OLTP
and columnar format to support OLAP. However, they are well aware that there is
no perfect data format. So that, I agree with Paul that the row format is
better for Drill.

**CPU Cache miss**
- N-ary Storage Model
<img
src="https://user-images.githubusercontent.com/50079619/148023833-dcd92d2c-38ae-41fa-a80f-97496f50e647.png";
width="50%" height="50%">

- Cache miss
<img
src="https://user-images.githubusercontent.com/50079619/148022521-f623b105-9057-4286-8ee4-3a2231b8bc68.png";
width="50%" height="50%">

The figure shows the processing of CPU caches, and we can see that a lot of
invalid data is being filled into the cache, crowding out data that would
otherwise have been reusable.

- Decomposition Storage Mode
<img
src="https://user-images.githubusercontent.com/50079619/148023968-6f0db55b-63d1-448a-9165-2001a59aea01.png";
width="50%" height="50%">

Reduce cache-miss is one of the advantages of the columnar format, but note
that Paul has explained that there will also be cache-miss(calculation based in
SQL syntax) in the columnar format. So that, we cannot expect the maximum
performance with Arrow.

**Write amplification**
Avoid becoming a database system, but Drill has several key points on the
read-write path.
- Drill support split vector value to disk.
- Drill is an engine, isn't a database, unable to unify the initial source
format(made in the database vendor).
- Complex read-write path :
- (1) Read from a variety of data sources.
- (2) Write to vector value or split to disk.
- (3) Combined to row sets.
- (4) Client side.

As noted above, Impossible to avoid read and write multiple times in a
query, and the write amplification will increase latency. So that, there's a
lot to optimize, not only the columnar format.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] jnturton commented on issue #2421: ValueVectors replacement

Reply via email to