Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207436035
  
    --- Diff: integration/presto/presto-integration-technical-note.md ---
    @@ -0,0 +1,253 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# Presto Integration Technical Note
    +Presto Integration with Carbon data include the below steps:
    +
    +* Setting up Presto Cluster
    +
    +* Setting up cluster to use carbondata as a catalog along with other 
catalogs provided by presto.
    +
    +In this technical note we will first learn about the above two points and 
after that we will see how we can do performance tuning with Presto.
    +
    +## **Let us begin with the first step of Presto Cluster Setup:**
    +
    +
    +* ### Installing Presto
    +
    + 1. Download the 0.187 version of Presto using:
    +  `wget 
https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.187/presto-server-0.187.tar.gz`
    +
    + 2. Extract Presto tar file: `tar zxvf presto-server-0.187.tar.gz`.
    +
    + 3. Download the Presto CLI for the coordinator and name it presto.
    +
    +  ```
    +    wget 
https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.187/presto-cli-0.187-executable.jar
    +
    +    mv presto-cli-0.187-executable.jar presto
    +
    +    chmod +x presto
    +  ```
    +
    +### Create Configuration Files
    +
    +  1. Create `etc` folder in presto-server-0.187 directory.
    +  2. Create `config.properties`, `jvm.config`, `log.properties`, and 
`node.properties` files.
    +  3. Install uuid to generate a node.id.
    +
    +      ```
    +      sudo apt-get install uuid
    +
    +      uuid
    +      ```
    +
    +
    +##### Contents of your node.properties file
    +
    +  ```
    +  node.environment=production
    +  node.id=<generated uuid>
    +  node.data-dir=/home/ubuntu/data
    +  ```
    +
    +##### Contents of your jvm.config file
    +
    +  ```
    +  -server
    +  -Xmx16G
    +  -XX:+UseG1GC
    +  -XX:G1HeapRegionSize=32M
    +  -XX:+UseGCOverheadLimit
    +  -XX:+ExplicitGCInvokesConcurrent
    +  -XX:+HeapDumpOnOutOfMemoryError
    +  -XX:OnOutOfMemoryError=kill -9 %p
    +  ```
    +
    +##### Contents of your log.properties file
    +  ```
    +  com.facebook.presto=INFO
    +  ```
    +
    + The default minimum level is `INFO`. There are four levels: `DEBUG`, 
`INFO`, `WARN` and `ERROR`.
    +
    +### Coordinator Configurations
    +
    +##### Contents of your config.properties
    +  ```
    +  coordinator=true
    +  node-scheduler.include-coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery-server.enabled=true
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +The options `node-scheduler.include-coordinator=false` and 
`coordinator=true` indicate that the node is the coordinator and tells the 
coordinator not to do any of the computation work itself and to use the workers.
    +
    +**Note**: We recommend setting `query.max-memory-per-node` to half of the 
JVM config max memory, though if your workload is highly concurrent, you may 
want to use a lower value for `query.max-memory-per-node`.
    +
    +Also relation between below two configuration-properties should be like:
    +If, `query.max-memory-per-node=30GB`
    +Then, `query.max-memory=<30GB * number of nodes>`.
    +
    +### Worker Configurations
    +
    +##### Contents of your config.properties
    +
    +  ```
    +  coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +
    +**Note**: `jvm.config` and `node.properties` files are same for all the 
nodes (worker + coordinator). All the nodes should have different 
`node.id`.(generated by uuid command).
    +
    +### **With this we are ready with the Presto Cluster setup but to 
integrate with carbon data further steps are required which are as follows:**
    +
    +### Catalog Configurations
    +
    +1. Create a folder named `catalog` in etc directory of presto on all the 
nodes of the cluster including the coordinator.
    +
    +##### Configuring Carbondata in Presto
    +1. Create a file named `carbondata.properties` in the `catalog` folder and 
set the required properties on all the nodes.
    +
    +### Add Plugins
    +
    +1. Create a directory named `carbondata` in plugin directory of presto.
    +2. Copy `carbondata` jars to `plugin/carbondata` directory on all nodes.
    +
    +### Start Presto Server on all nodes
    +
    +```
    +./presto-server-0.187/bin/launcher start
    +```
    +To run it as a background process.
    +
    +```
    +./presto-server-0.187/bin/launcher run
    +```
    +To run it in foreground.
    +
    +### Start Presto CLI
    +```
    +./presto
    +```
    +To connect to carbondata catalog use the following command:
    +
    +```
    +./presto --server <coordinator_ip>:8086 --catalog carbondata --schema 
<schema_name>
    +```
    +Execute the following command to ensure the workers are connected.
    +
    +```
    +select * from system.runtime.nodes;
    +```
    +Now you can use the Presto CLI on the coordinator to query data sources in 
the catalog using the Presto workers.
    +
    +**Note :** Create Tables and data loads should be done before executing 
queries as we can not create carbon table from this interface.
    +
    +## **Presto Performance Tuning**
    +
    +**Performance Optimizations according to data types and schema:**
    +
    +- When the data could be stored in Int as well as String. Example: keys 
for a table then using Int gives a better performance
    +
    +- Use Double instead of Decimal if required precision is low.
    +
    +- Columns having low-cardinality should be created as dictionary columns. 
This will improve query performance to a great extent.
    +
    +**Performance Optimization by changing Queries:**
    +
    +- There’s a probability where GROUP BY becomes a little bit faster, by 
carefully ordering a list of fields within GROUP BY in an order of high 
cardinality.
    +
    +- Aggregating a series of LIKE clauses in one single regexp_like clause.
    +
    +**For example :**
    +
    +```
    +SELECT
    +  ...
    +FROM
    +  access
    +WHERE
    +  method LIKE '%GET%' OR
    +  method LIKE '%POST%' OR
    +  method LIKE '%PUT%' OR
    +  method LIKE '%DELETE%'
    + ```
    +
    + can be optimized by replacing the 4 LIKE clauses with a single 
regexp_like clause:
    +
    + ```
    + SELECT
    +  ...
    +FROM
    +  access
    +WHERE
    +  regexp_like(method, 'GET|POST|PUT|DELETE')
    + ```
    +
    +- Specifying large tables first in join clause
    +
    +
    +The default join algorithm of Presto is broadcast join, which partitions 
the left-hand side table of a join and sends (broadcasts) a copy of the entire 
right-hand side table to all of the worker nodes that have the partitions. This 
works when your right table is small enough to fit within one node (usually 
less than 2GB). If you observe ‘Exceeded max memory xxGB’ error, this 
usually means the right-hand side table is too large. Presto does not perform 
automatic join-reordering, so please make sure your large table preceeds small 
tables in any join clause.
    +
    +**Note :** If you still see the memory issue, try distributed hash join. 
This algorithm partitions both the left and right tables using the hash values 
of the join keys. So the distributed join would work even if the right-hand 
side table is large, but the performance can be slower because it increases the 
number of network data transfers. To turn on the distributed join, embed the 
following session property as an SQL comment:
    +
    +```
    +PrestoCli> -- set session distributed_join = 'true'
    +SELECT ... FROM large_table l, small_table s WHERE l.id = s.id
    +```
    +
    +**Performance optimizations by using certain Configuration properties:**
    +- **Presto Properties (location: presto/etc/config.properties)**
    +
    +```
    +query.max-memory=210GB
    +```
    +This property value should be given according to available RAM as per all 
cluster worker nodes.
    --- End diff --
    
    Total RAM available in the cluster(Sum of all nodes RAM memory)


---

Reply via email to