Re: [Architecture] [DAS] Overhauling the Spark JDBC Connector

Dulitha Wijewantha Mon, 20 Jun 2016 11:03:25 -0700

Hi Niranda,
INSERT INTO syntax is available but I can't insert arbitrary values without
using a select. This is for testing purposes.


Cheers~

On Fri, Jun 17, 2016 at 2:03 AM, Niranda Perera <[email protected]> wrote:

> Hi Dulitha,
>
> This is a new connector only. It does not affect Spark SQL queries (apart
> from the options you specify in the CREATE TEMPORARY TABLE queries). INSERT
> INTO was available in the previous CarbonJDBC connector as well.
>
> Best
>
> On Fri, Jun 17, 2016 at 12:47 AM, Dulitha Wijewantha <[email protected]>
> wrote:
>
>> Hi Gokul,
>> Will this allow us to perform INSERT INTO queries with sample data (not
>> from a table)? This is useful in the DEV phase.
>>
>> Cheers~
>>
>> On Tue, Jun 14, 2016 at 7:38 AM, Gokul Balakrishnan <[email protected]>
>> wrote:
>>
>>> Hi product analytics leads,
>>>
>>> Please make sure that the configuration file spark-jdbc-config.xml is
>>> added to the product-analytics packs, especially is you're using the
>>> CarbonJDBC provider. Example commit may be found at [1].
>>>
>>> [1]
>>> https://github.com/wso2/product-das/commit/4bdbf68833bd2bc8a20549eaf726873cacde468f
>>>
>>> Thanks,
>>>
>>> On 13 June 2016 at 17:37, Gokul Balakrishnan <[email protected]> wrote:
>>>
>>>> Hi Anjana, Nirmal,
>>>>
>>>> The schema being mandatory is an architectural decision we've had to
>>>> take. If I go into a bit more detail as to the reasons, Spark requires its
>>>> own catalyst schema to be constructed when a relation is being created. In
>>>> the previous implementation, this was achieved through dropping the target
>>>> RDBMS table and recreating it in a format Spark understands. However, in
>>>> the current implementation, we have removed the need for and DML operation
>>>> during table creation, unless specifically requested.
>>>>
>>>> The issue with making this parameter optional is that we will have to
>>>> again fall back to the earlier behaviour of the schema being inferred from
>>>> the table metadata, if not specified. This will mean having to maintain a
>>>> list of reverse mappings which will pollute the implementation. Moreover,
>>>> we will have inconsistencies when certain table schemata are inferred while
>>>> others are specified.
>>>>
>>>> Please note that this is not an API change nor is it a change in
>>>> deployable artefacts: the user merely has to do edit his/her DAS extensions
>>>> (i.e. Spark scripts) if applicable. We will clearly point out the changes
>>>> that need be done, in the DAS 3.1.0 migration guide.
>>>>
>>>> Thanks,
>>>>
>>>> On 13 June 2016 at 16:44, Anjana Fernando <[email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> On Mon, Jun 13, 2016 at 12:23 PM, Nirmal Fernando <[email protected]>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> The "schema" option is required, and is used to specify the schema to
>>>>>>> be utilised throughout the temporary table's lifetime. Here, the field
>>>>>>> types used for the schema match what we have for the CarbonAnalytics
>>>>>>> provider (i.e. not JDBC nor databridge), and correspond to Spark 
>>>>>>> catalyst
>>>>>>> types. Moreover, the optional "-i" keyword for a field if specified will
>>>>>>> create an RDBMS index for that field.
>>>>>>>
>>>>>>
>>>>>> Can't we make 'schema' optional as it was earlier? This introduces a
>>>>>> backward incompatible change otherwise.
>>>>>>
>>>>>
>>>>> The schema was optional before, because earlier it mandated the user
>>>>> to create the table beforehand, which was not desirable, where for
>>>>> subsequent "insert override" statements, they drop the table and tried to
>>>>> re-create it, and didn't do a good job in doing so. So this approach was
>>>>> done to make it more consistent in the way we create the tables. And in 
>>>>> the
>>>>> new implementation, we need the schema to be known beforehand to know 
>>>>> about
>>>>> its primary keys etc.. to do the operations properly. But yeah, for the
>>>>> sake of backward compatibility, we can do somewhat of a best effort
>>>>> implementation by, looking up the table schema using JDBC and trying to
>>>>> figure out the table schema, mainly the primary keys, which is the 
>>>>> critical
>>>>> information we need. But this is not always a definite thing we can expect
>>>>> from JDBC, where some DBMSs may not expose this properly through that. So
>>>>> anyway, it is highly recommended to move into the new approach when you're
>>>>> using CarbonJDBC. But we will try to do a best effort implementation to
>>>>> retain backward compatibility, @Gokul, please check on this.
>>>>>
>>>>> Cheers,
>>>>> Anjana.
>>>>>
>>>>>
>>>>>>
>>>>>>> The "primaryKeys" option is not mandatory, and may be used to denote
>>>>>>> unique key fields in the underlying RDBMS table. It is based on this 
>>>>>>> option
>>>>>>> that INSERT or UPSERT queries will be chosen when doing Spark INSERT 
>>>>>>> INTO
>>>>>>> queries, as explained above.
>>>>>>>
>>>>>>> We're in the process of documenting the usage patterns of this
>>>>>>> provider so that they can be better understood.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> On 10 June 2016 at 15:16, Inosh Goonewardena <[email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Gokul,
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 10, 2016 at 2:08 PM, Gokul Balakrishnan <[email protected]
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> In DAS 3.0.x, for interacting with relational databases directly
>>>>>>>>> from Spark (i.e. bypassing the data access layer), we have hitherto 
>>>>>>>>> been
>>>>>>>>> using the JDBC connector that comes directly with Apache Spark (with 
>>>>>>>>> added
>>>>>>>>> support for Carbon datasources).
>>>>>>>>>
>>>>>>>>> This connector has contained many issues that have been
>>>>>>>>> detrimental to proper user experience, including:
>>>>>>>>>
>>>>>>>>> - Having to create tables on the RDBMS beforehand, prior to query
>>>>>>>>> execution
>>>>>>>>> - Tables getting dropped and re-created with a Spark-dictated
>>>>>>>>> schema during initialization
>>>>>>>>> - No support for RDBMS unique keys
>>>>>>>>> - Not being able to perform INSERT INTO queries on RDBMS tables
>>>>>>>>> which have unique keys set, and as a result the user having to depend 
>>>>>>>>> upon
>>>>>>>>> INSERT OVERWRITE which clears the table. This would result in the 
>>>>>>>>> loss of
>>>>>>>>> historical data
>>>>>>>>>
>>>>>>>>> I have been working on overhauling this connector over the past
>>>>>>>>> couple of weeks to address the above flaws and bring it up to 
>>>>>>>>> scratch. A
>>>>>>>>> new a config file which contains the relevant information in a 
>>>>>>>>> particular
>>>>>>>>> RDBMS flavour (such as parameterised query formats, datatypes etc) 
>>>>>>>>> has also
>>>>>>>>> been introduced. An overview of all improvements is as follows;
>>>>>>>>>
>>>>>>>>> - RDBMS tables will be created dynamically (based on the schema
>>>>>>>>> provided by the user) if they don't exist already
>>>>>>>>>
>>>>>>>>
>>>>>>>> What is the data type to be used with fields in the schema? Is it
>>>>>>>> SQL types or data bridge data types? Could you please provide a sample
>>>>>>>> create table query.
>>>>>>>>
>>>>>>>>
>>>>>>>>> - Pre-existing tables will be appropriated for use without
>>>>>>>>> dropping/recreating
>>>>>>>>> - Recognition of primary keys and switching between INSERT/UPSERT
>>>>>>>>> modes automatically during Spark's INSERT INTO calls
>>>>>>>>> - Support for creating DB indices, based on an additional input
>>>>>>>>> parameter
>>>>>>>>> - Spark INSERT OVERWRITE calls can be used to clear the existing
>>>>>>>>> table without existing schema/index definitions being affected.
>>>>>>>>>
>>>>>>>>> This initial implementation can be found at [1]. It's written
>>>>>>>>> mostly in Scala.
>>>>>>>>>
>>>>>>>>> Initially, we've tested the connector against MySQL as part of the
>>>>>>>>> first cut, and we will be testing against all DBs supported by DAS 
>>>>>>>>> over the
>>>>>>>>> following days. The connector is expected to be shipped with the DAS 
>>>>>>>>> 3.1.0
>>>>>>>>> release.
>>>>>>>>>
>>>>>>>>> Thoughts welcome.
>>>>>>>>>
>>>>>>>>> [1] https://github.com/wso2/carbon-analytics/pull/187
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Gokul Balakrishnan
>>>>>>>>> Senior Software Engineer,
>>>>>>>>> WSO2, Inc. http://wso2.com
>>>>>>>>> M +94 77 5935 789 | +44 7563 570502
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Architecture mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks & Regards,
>>>>>>>>
>>>>>>>> Inosh Goonewardena
>>>>>>>> Associate Technical Lead- WSO2 Inc.
>>>>>>>> Mobile: +94779966317
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Architecture mailing list
>>>>>>>> [email protected]
>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Gokul Balakrishnan
>>>>>>> Senior Software Engineer,
>>>>>>> WSO2, Inc. http://wso2.com
>>>>>>> M +94 77 5935 789 | +44 7563 570502
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Architecture mailing list
>>>>>>> [email protected]
>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Thanks & regards,
>>>>>> Nirmal
>>>>>>
>>>>>> Team Lead - WSO2 Machine Learner
>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>>> Mobile: +94715779733
>>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Anjana Fernando*
>>>>> Senior Technical Lead
>>>>> WSO2 Inc. | http://wso2.com
>>>>> lean . enterprise . middleware
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Gokul Balakrishnan
>>>> Senior Software Engineer,
>>>> WSO2, Inc. http://wso2.com
>>>> M +94 77 5935 789 | +44 7563 570502
>>>>
>>>>
>>>
>>>
>>> --
>>> Gokul Balakrishnan
>>> Senior Software Engineer,
>>> WSO2, Inc. http://wso2.com
>>> M +94 77 5935 789 | +44 7563 570502
>>>
>>>
>>> _______________________________________________
>>> Architecture mailing list
>>> [email protected]
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>
>>>
>>
>>
>> --
>> Dulitha Wijewantha (Chan)
>> Software Engineer - Mobile Development
>> WSO2 Inc
>> Lean.Enterprise.Middleware
>>  * ~Email       [email protected] <[email protected]>*
>> *  ~Mobile     +94712112165 <%2B94712112165>*
>> *  ~Website   dulitha.me <http://dulitha.me>*
>> *  ~Twitter     @dulitharw <https://twitter.com/dulitharw>*
>>   *~Github     @dulichan <https://github.com/dulichan>*
>>   *~SO     @chan <http://stackoverflow.com/users/813471/chan>*
>>
>> _______________________________________________
>> Architecture mailing list
>> [email protected]
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>>
>
>
> --
> *Niranda Perera*
> Software Engineer, WSO2 Inc.
> Mobile: +94-71-554-8430
> Twitter: @n1r44 <https://twitter.com/N1R44>
> https://pythagoreanscript.wordpress.com/
>
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>


-- 
Dulitha Wijewantha (Chan)
Software Engineer - Mobile Development
WSO2 Inc
Lean.Enterprise.Middleware
 * ~Email       [email protected] <[email protected]>*
*  ~Mobile     +94712112165*
*  ~Website   dulitha.me <http://dulitha.me>*
*  ~Twitter     @dulitharw <https://twitter.com/dulitharw>*
  *~Github     @dulichan <https://github.com/dulichan>*
  *~SO     @chan <http://stackoverflow.com/users/813471/chan>*

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [DAS] Overhauling the Spark JDBC Connector

Reply via email to