jnturton commented on issue #2421: URL: https://github.com/apache/drill/issues/2421#issuecomment-1004751937
Cong Luo wrote: @paul-rogers Thanks for your knowledge base. For the vectorization and Arrow, I may need to keep pouring cold water for everyone. I think we need to figure out what we need here is, not just see it as powerful, then say it's really powerful. **SQL, NoSQL and NewSQL** In my work, I went through these three databases architecture. Initially, we had performance bottlenecks, so choose the NoSQL database, eg : mongo, es and redis. After entering the production environment, most developers were not good at using the API(clients) designed for these databases. The key is that developers do not have the patience to learn the advanced syntax of each database(NoSQL). So we come back to use the SQL-style database : NewSQL. But, we found that NewSQL also has many bottlenecks(once you don't think need any optimization), especially once you comes to deploy a private cluster, you'll get complex operation and maintenance conditions and user complaints. Here, the lesson is that choosing the right one is better than choosing it blindly. **OLTP, OLAP and HTAP** The row format is good for OLTP, because need to insert / update / delete and batch fetch entire rows as soon as possible. The columnar format is good for OLAP, because need to analyze large amounts of data. To keep low latency, need to filter data size from the disk and reduce network IO bottlenecks. What does HTAP look like? let's see that Internal implementation of TiDB : <img src="https://download.pingcap.com/images/docs/tiflash/tiflash-architecture.png" width="50%" height="50%"> ``` TiFlash provides the columnar storage, it conducts real-time replication of data in the TiKV nodes at a low cost that does not block writes in TiKV. Meanwhile, it provides the same read consistency as in TiKV and ensures that the latest data is read. ``` As is well known, database vendors prefer to use row format to apply in OLTP and columnar format to support OLAP. However, they are well aware that there is no perfect data format. So that, I agree with Paul that the row format is better for Drill. **CPU Cache miss** - N-ary Storage Model <img src="https://user-images.githubusercontent.com/50079619/148023833-dcd92d2c-38ae-41fa-a80f-97496f50e647.png" width="50%" height="50%"> - Cache miss <img src="https://user-images.githubusercontent.com/50079619/148022521-f623b105-9057-4286-8ee4-3a2231b8bc68.png" width="50%" height="50%"> The figure shows the processing of CPU caches, and we can see that a lot of invalid data is being filled into the cache, crowding out data that would otherwise have been reusable. - Decomposition Storage Mode <img src="https://user-images.githubusercontent.com/50079619/148023968-6f0db55b-63d1-448a-9165-2001a59aea01.png" width="50%" height="50%"> Reduce cache-miss is one of the advantages of the columnar format, but note that Paul has explained that there will also be cache-miss(calculation based in SQL syntax) in the columnar format. So that, we cannot expect the maximum performance with Arrow. **Write amplification** Avoid becoming a database system, but Drill has several key points on the read-write path. - Drill support split vector value to disk. - Drill is an engine, isn't a database, unable to unify the initial source format(made in the database vendor). - Complex read-write path : - (1) Read from a variety of data sources. - (2) Write to vector value or split to disk. - (3) Combined to row sets. - (4) Client side. As noted above, Impossible to avoid read and write multiple times in a query, and the write amplification will increase latency. So that, there's a lot to optimize, not only the columnar format. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org