This is an automated email from the ASF dual-hosted git repository.

jerrypeng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-pulsar.wiki.git


The following commit(s) were added to refs/heads/master by this push:
     new 4d73a26  Updated PIP 18: Pulsar SQL (markdown)
4d73a26 is described below

commit 4d73a26e0dfc3970cd84efa0616008d8455b275a
Author: Boyang Jerry Peng <[email protected]>
AuthorDate: Thu Jul 19 12:38:01 2018 -0700

    Updated PIP 18: Pulsar SQL (markdown)
---
 PIP-18:-Pulsar-SQL.md | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)

diff --git a/PIP-18:-Pulsar-SQL.md b/PIP-18:-Pulsar-SQL.md
index 53691a3..50333c7 100644
--- a/PIP-18:-Pulsar-SQL.md
+++ b/PIP-18:-Pulsar-SQL.md
@@ -3,3 +3,93 @@
 * **Mailing List discussion**: 
 N/A
 * **Prototype**: https://github.com/jerrypeng/presto/tree/pulsar_connector
+
+## What are we trying to do?
+
+We are trying to create a method in which users can explore, in a natural 
manner,  the data already stored within Pulsar topics.  We believe the best way 
to accomplish this is to expose SQL interface that allows users to query 
existing data within a Pulsar cluster.  
+
+Just to be absolutely clear,  the SQL we are proposing is for querying data 
already in Pulsar and we are currently not proposing the implementation of any 
sort of SQL on data streams
+Why are we doing this?
+
+Many users are interested for such a feature.  For example, many users store 
large amounts of historical data in Pulsar for various purposes.  Giving them 
to capability to query that that data gives them huge value.  Users will 
typically need to stream the data out of Pulsar and into another platform to do 
any sort of analysis, but with Pulsar SQL, users can just use one platform.
+How are we going to do it?
+
+With the implementation of a schema registry in Pulsar, data can be structured 
so that it can be easily mapped to tables that can be queried by SQL. We plan 
on using Presto (https://prestodb.io/) as the backbone of Pulsar SQL.  A 
connector can be implemented using the Presto connector SPI that allows presto 
to ingest data from Pulsar and to be queried using Presto’s existing SQL 
framework.
+
+The schema registry will be used to generate the structure of tables that will 
be used in Presto.  Presto workers will load data directly from bookies through 
a read-only managed-ledger interface, so that we can have a many to many 
throughput and avoid impacting brokers with read activity.
+
+Thus, Pulsar will be queried for metadata concerning topics and schemas and 
from that metadata, we will go directly to the bookies to load and deserialize 
the data.
+
+
+
+## Goals
+
+* Allow users to submit SQL queries using a Pulsar CLI
+* Throttling
+       * Maintain SLAs between reads for SQL and consumer reads
+* Resource Isolation
+       * Read priorities at BookKeeper level - Bookies should be able to 
support different classes of read requests and prioritize regular consumers vs 
read requests coming from SQL queries
+       * Queries from users or groups should not overly impact other queries 
from other users or groups
+* We would like to target two types of deployments:
+       1. Embedded with pulsar cluster
+               * Run Presto coordinator and workers as part of the existing 
function worker service. 
+       2. Allow existing presto clusters to be used to query a Pulsar cluster
+
+## Implementation
+
+### Presto-Pulsar connector
+
+Presto has a SPI system to add connectors. We need to create a connector able 
to perform the following tasks: 
+* Discover schema for a particular topic
+       * This will be done through Pulsar Schema REST API
+* Discover the amount of entries stored in a topic and break down the entries 
across multiple workers
+* On each worker we need to fetch the entries from Bookies
+       * Use managed ledger in read-only mode
+       * Each worker positions itself on a particular entry and read a 
determined number of entries
+       * Parse Pulsar message metadata and extract messages from the 
BookKeeper entries
+       * Deserialize entries, based on schema definition, and feed the objects 
to Presto
+       
+    
+### Presto deployment
+
+Presto is composed of a coordination and multiple worker nodes 
(https://prestodb.io/overview.html). We plan to have Presto run in embedded 
mode with regular Pulsar scripts for uniform operations. 
+
+One of the nodes will be elected as coordinator (in the same way as function 
workers elect a leader already). Since all the requests to Presto coordinator 
are made through HTTP/HTTPS, it would be possible to proxy/redirect requests 
made through the regular Pulsar service URL. 
+
+Regarding worker nodes, one possibility is to co-locate with Pulsar function 
worker nodes, which can in their turn be co-located with brokers, when 
deploying on a small single-tenant cluster.
+
+## Execution Plan
+
+Let’s break the implementation into multiple phases:
+
+### Phase 1 (target Pulsar 2.2)
+1. Implement Pulsar connector for Presto
+       * Have basic queries working
+2. Work with all schemas
+       * JSON, Protobuf, Avro, String, etc.
+3. Basic integration Test
+4. Standalone pulsar and standalone presto
+5. Deployment
+       * Deploy alongside pulsar / embedded with pulsar cluster
+6. Metrics
+       * Prometheus
+       * How long queries are taking, etc
+       * Number of requests per sec, etc..
+7. System Columns
+       * Automatically generate columns such as publish time, event time, 
partitionId, etc for each table
+8. Integrate with pulsar CLI
+9. Integration Tiered Storage
+       * Pass credentials e.g. S3
+
+
+### Phase2
+
+1. Resource Isolation/Management
+2. Throttling
+3. Priorities
+4. Time boxed queries
+5. When doing a query over a subset of the data, based on publish time, we 
should be able to only scan the relevant data instead of everything stored in 
the topic
+6. Performance testing and optimizing
+
+
+More to come...
\ No newline at end of file

Reply via email to