jnturton commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1004751937


   Cong Luo wrote:
   
   @paul-rogers Thanks for your knowledge base. For the vectorization and 
Arrow, I may need to keep pouring cold water for everyone. I think we need to 
figure out what we need here is, not just see it as powerful, then say it's 
really powerful.
   
   **SQL, NoSQL and NewSQL**
   In my work, I went through these three databases architecture. Initially, we 
had performance bottlenecks, so choose the NoSQL database, eg : mongo, es and 
redis. After entering the production environment, most developers were not good 
at using the API(clients) designed for these databases. The key is that 
developers do not have the patience to learn the advanced syntax of each 
database(NoSQL). So we come back to use the SQL-style database : NewSQL. But, 
we found that NewSQL also has many bottlenecks(once you don't think need any 
optimization), especially once you comes to deploy a private cluster, you'll 
get complex operation and maintenance conditions and user complaints.
   
   Here, the lesson is that choosing the right one is better than choosing it 
blindly.
   
   **OLTP, OLAP and HTAP**
   The row format is good for OLTP, because need to insert / update / delete 
and batch fetch entire rows as soon as possible. The columnar format is good 
for OLAP, because need to analyze large amounts of data. To keep low latency, 
need to filter data size from the disk and reduce network IO bottlenecks. What 
does HTAP look like? let's see that Internal implementation of TiDB :
   
   <img 
src="https://download.pingcap.com/images/docs/tiflash/tiflash-architecture.png"; 
width="50%" height="50%">
   
   ```
   TiFlash provides the columnar storage, it conducts real-time replication of 
data in the TiKV nodes 
   at a low cost that does not block writes in TiKV. Meanwhile, it provides the 
same read consistency 
   as in TiKV and ensures that the latest data is read.
   ```
   
   As is well known, database vendors prefer to use row format to apply in OLTP 
and columnar format to support OLAP. However, they are well aware that there is 
no perfect data format. So that, I agree with Paul that the row format is 
better for Drill.
   
   **CPU Cache miss**
   - N-ary Storage Model
   <img 
src="https://user-images.githubusercontent.com/50079619/148023833-dcd92d2c-38ae-41fa-a80f-97496f50e647.png";
 width="50%" height="50%">
   
   - Cache miss
   <img 
src="https://user-images.githubusercontent.com/50079619/148022521-f623b105-9057-4286-8ee4-3a2231b8bc68.png";
 width="50%" height="50%">
   
   The figure shows the processing of CPU caches, and we can see that a lot of 
invalid data is being filled into the cache, crowding out data that would 
otherwise have been reusable.
   
   - Decomposition Storage Mode
   <img 
src="https://user-images.githubusercontent.com/50079619/148023968-6f0db55b-63d1-448a-9165-2001a59aea01.png";
 width="50%" height="50%">
   
   Reduce cache-miss is one of the advantages of the columnar format, but note 
that Paul has explained that there will also be cache-miss(calculation based in 
SQL syntax) in the columnar format. So that, we cannot expect the maximum 
performance with Arrow.
   
   **Write amplification**
   Avoid becoming a database system, but Drill has several key points on the 
read-write path.
   - Drill support split vector value to disk.
   - Drill is an engine, isn't a database, unable to unify the initial source 
format(made in the database vendor).
   - Complex read-write path :
     - (1) Read from a variety of data sources.
     - (2) Write to vector value or split to disk.
     - (3) Combined to row sets.
     - (4) Client side.
   
   As noted above, Impossible to avoid read and write multiple times in a 
query, and the write amplification will increase latency. So that, there's a 
lot to optimize, not only the columnar format.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to