[Architecture] [Archi] [DAS] Changing the schema setting behavior in Spark SQL when using CarbonAnalytics connector

Niranda Perera Fri, 24 Jun 2016 04:17:42 -0700

Hi all,

This is to inform you that we have made a small change to the existing
schema setting behavior in DAS when using the CarbonAnalytics connector in
SparkSQL.


Let me clarify the approaches here.

*Previous approach *
Assume that there is a table corresponding to a stream 'abcd' with the
schema 'int a, int b int c, int d'.

So, the following queries were available.

   1. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName
   "abcd");  --> Infers the schema from the DAL (data access layer)
   2. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName
   "abcd", schema "a int, b int, c int, d int");  --> this schema and the
   existing schema *will be merged and set in the DAL & in Spark*
   3. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName
   "abcd", schema "a int, b int, c int, d int, *e int*");  --> because of
   the schema merge, this is also supported (to define a new field)
   4. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName
   "abcd", schema "a int, b int, c int, d int, *_timestamp long*"); -->
   allows the timestamp to be used for queries

Implications:
Because of the merge approach in the #3 query, the final order of the
schema was not definite (which was set in the DAL).
Ex: (a,b,c,d) merge (a, b, c, d, e) --> (a, b, c, d, e )
BUT
(a, b, c, d) merger (e, d, c) --> (a, b,e, d, c)
This resulted in an issue where we had to put aliases for each field in the
insert statements.
Ex: INSERT INTO TABLE test SELECT 1, 2, 3, 4, 5; could result in a=1, b=2,
..., d = 5 OR a=1, b=2, e=3, d=4, c =5 depending on the merge
So, we had to use aliases.
INSERT INTO TABLE test SELECT 1 as a, 2 as b, 3 as c, 4 as d, 5 as e;

Because of this undefined nature of the merged schema, we had to fix the
position of the special field "_timestamp". So, "_timestamp" was put in as
the last element in the merged schema.

*New approach *

As new approach, have separated out the schema in spark and DAL. Now, when
a user explicitly mentions a schema, the merged schema will be set in the
DAL and the given schema will be used in Spark.

As per the same example before,

   1. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName
   "abcd");  --> No change. Infers the schema from the DAL (data access layer)
   2. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName
   "abcd", schema "a int, b int, c int, d int");  --> This schema and the
   existing schema will be merged and set in the DAL *only.* This given
   schema will be used in Spark
   3. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName
   "abcd", schema "a int, b int, c int, d int, *e int*");  --> Merged
   schema will be set in DAL. This given schema will be used in Spark
   4. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName
   "abcd", schema "a int, b int, c int, d int, *_timestamp long*"); -->
   allows the timestamp to be used for queries

So, now, there's no ambiguity in the schema setting. If you set a schema in
Spark SQL as 'int a, int b int c, int d', then it will be the final schema
in the Spark runtime.



This change should not ideally conflict with the current samples and
analytics4x implementations. Just wanted to keep you guys informed.

Best

-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 <https://twitter.com/N1R44>
https://pythagoreanscript.wordpress.com/

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

[Architecture] [Archi] [DAS] Changing the schema setting behavior in Spark SQL when using CarbonAnalytics connector

Reply via email to