Hi all, This is to inform you that we have made a small change to the existing schema setting behavior in DAS when using the CarbonAnalytics connector in SparkSQL.
Let me clarify the approaches here. *Previous approach * Assume that there is a table corresponding to a stream 'abcd' with the schema 'int a, int b int c, int d'. So, the following queries were available. 1. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName "abcd"); --> Infers the schema from the DAL (data access layer) 2. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName "abcd", schema "a int, b int, c int, d int"); --> this schema and the existing schema *will be merged and set in the DAL & in Spark* 3. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName "abcd", schema "a int, b int, c int, d int, *e int*"); --> because of the schema merge, this is also supported (to define a new field) 4. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName "abcd", schema "a int, b int, c int, d int, *_timestamp long*"); --> allows the timestamp to be used for queries Implications: Because of the merge approach in the #3 query, the final order of the schema was not definite (which was set in the DAL). Ex: (a,b,c,d) merge (a, b, c, d, e) --> (a, b, c, d, e ) BUT (a, b, c, d) merger (e, d, c) --> (a, b,e, d, c) This resulted in an issue where we had to put aliases for each field in the insert statements. Ex: INSERT INTO TABLE test SELECT 1, 2, 3, 4, 5; could result in a=1, b=2, ..., d = 5 OR a=1, b=2, e=3, d=4, c =5 depending on the merge So, we had to use aliases. INSERT INTO TABLE test SELECT 1 as a, 2 as b, 3 as c, 4 as d, 5 as e; Because of this undefined nature of the merged schema, we had to fix the position of the special field "_timestamp". So, "_timestamp" was put in as the last element in the merged schema. *New approach * As new approach, have separated out the schema in spark and DAL. Now, when a user explicitly mentions a schema, the merged schema will be set in the DAL and the given schema will be used in Spark. As per the same example before, 1. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName "abcd"); --> No change. Infers the schema from the DAL (data access layer) 2. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName "abcd", schema "a int, b int, c int, d int"); --> This schema and the existing schema will be merged and set in the DAL *only.* This given schema will be used in Spark 3. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName "abcd", schema "a int, b int, c int, d int, *e int*"); --> Merged schema will be set in DAL. This given schema will be used in Spark 4. CREATE TEMPORARY TABLE test USING CarbonAnalytics OPTIONS (tableName "abcd", schema "a int, b int, c int, d int, *_timestamp long*"); --> allows the timestamp to be used for queries So, now, there's no ambiguity in the schema setting. If you set a schema in Spark SQL as 'int a, int b int c, int d', then it will be the final schema in the Spark runtime. This change should not ideally conflict with the current samples and analytics4x implementations. Just wanted to keep you guys informed. Best -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 <https://twitter.com/N1R44> https://pythagoreanscript.wordpress.com/
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
