Re: [Architecture] [DAS] Overhauling the Spark JDBC Connector

Gokul Balakrishnan Tue, 14 Jun 2016 04:40:32 -0700

Hi product analytics leads,

Please make sure that the configuration file spark-jdbc-config.xml is added
to the product-analytics packs, especially is you're using the CarbonJDBC
provider. Example commit may be found at [1].


[1]
https://github.com/wso2/product-das/commit/4bdbf68833bd2bc8a20549eaf726873cacde468f

Thanks,

On 13 June 2016 at 17:37, Gokul Balakrishnan <[email protected]> wrote:

> Hi Anjana, Nirmal,
>
> The schema being mandatory is an architectural decision we've had to take.
> If I go into a bit more detail as to the reasons, Spark requires its own
> catalyst schema to be constructed when a relation is being created. In the
> previous implementation, this was achieved through dropping the target
> RDBMS table and recreating it in a format Spark understands. However, in
> the current implementation, we have removed the need for and DML operation
> during table creation, unless specifically requested.
>
> The issue with making this parameter optional is that we will have to
> again fall back to the earlier behaviour of the schema being inferred from
> the table metadata, if not specified. This will mean having to maintain a
> list of reverse mappings which will pollute the implementation. Moreover,
> we will have inconsistencies when certain table schemata are inferred while
> others are specified.
>
> Please note that this is not an API change nor is it a change in
> deployable artefacts: the user merely has to do edit his/her DAS extensions
> (i.e. Spark scripts) if applicable. We will clearly point out the changes
> that need be done, in the DAS 3.1.0 migration guide.
>
> Thanks,
>
> On 13 June 2016 at 16:44, Anjana Fernando <[email protected]> wrote:
>
>> Hi,
>>
>> On Mon, Jun 13, 2016 at 12:23 PM, Nirmal Fernando <[email protected]>
>> wrote:
>>
>>>
>>> The "schema" option is required, and is used to specify the schema to be
>>>> utilised throughout the temporary table's lifetime. Here, the field types
>>>> used for the schema match what we have for the CarbonAnalytics provider
>>>> (i.e. not JDBC nor databridge), and correspond to Spark catalyst types.
>>>> Moreover, the optional "-i" keyword for a field if specified will create an
>>>> RDBMS index for that field.
>>>>
>>>
>>> Can't we make 'schema' optional as it was earlier? This introduces a
>>> backward incompatible change otherwise.
>>>
>>
>> The schema was optional before, because earlier it mandated the user to
>> create the table beforehand, which was not desirable, where for subsequent
>> "insert override" statements, they drop the table and tried to re-create
>> it, and didn't do a good job in doing so. So this approach was done to make
>> it more consistent in the way we create the tables. And in the new
>> implementation, we need the schema to be known beforehand to know about its
>> primary keys etc.. to do the operations properly. But yeah, for the sake of
>> backward compatibility, we can do somewhat of a best effort implementation
>> by, looking up the table schema using JDBC and trying to figure out the
>> table schema, mainly the primary keys, which is the critical information we
>> need. But this is not always a definite thing we can expect from JDBC,
>> where some DBMSs may not expose this properly through that. So anyway, it
>> is highly recommended to move into the new approach when you're using
>> CarbonJDBC. But we will try to do a best effort implementation to retain
>> backward compatibility, @Gokul, please check on this.
>>
>> Cheers,
>> Anjana.
>>
>>
>>>
>>>> The "primaryKeys" option is not mandatory, and may be used to denote
>>>> unique key fields in the underlying RDBMS table. It is based on this option
>>>> that INSERT or UPSERT queries will be chosen when doing Spark INSERT INTO
>>>> queries, as explained above.
>>>>
>>>> We're in the process of documenting the usage patterns of this provider
>>>> so that they can be better understood.
>>>>
>>>> Thanks,
>>>>
>>>> On 10 June 2016 at 15:16, Inosh Goonewardena <[email protected]> wrote:
>>>>
>>>>> Hi Gokul,
>>>>>
>>>>>
>>>>> On Fri, Jun 10, 2016 at 2:08 PM, Gokul Balakrishnan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> In DAS 3.0.x, for interacting with relational databases directly from
>>>>>> Spark (i.e. bypassing the data access layer), we have hitherto been using
>>>>>> the JDBC connector that comes directly with Apache Spark (with added
>>>>>> support for Carbon datasources).
>>>>>>
>>>>>> This connector has contained many issues that have been detrimental
>>>>>> to proper user experience, including:
>>>>>>
>>>>>> - Having to create tables on the RDBMS beforehand, prior to query
>>>>>> execution
>>>>>> - Tables getting dropped and re-created with a Spark-dictated schema
>>>>>> during initialization
>>>>>> - No support for RDBMS unique keys
>>>>>> - Not being able to perform INSERT INTO queries on RDBMS tables which
>>>>>> have unique keys set, and as a result the user having to depend upon 
>>>>>> INSERT
>>>>>> OVERWRITE which clears the table. This would result in the loss of
>>>>>> historical data
>>>>>>
>>>>>> I have been working on overhauling this connector over the past
>>>>>> couple of weeks to address the above flaws and bring it up to scratch. A
>>>>>> new a config file which contains the relevant information in a particular
>>>>>> RDBMS flavour (such as parameterised query formats, datatypes etc) has 
>>>>>> also
>>>>>> been introduced. An overview of all improvements is as follows;
>>>>>>
>>>>>> - RDBMS tables will be created dynamically (based on the schema
>>>>>> provided by the user) if they don't exist already
>>>>>>
>>>>>
>>>>> What is the data type to be used with fields in the schema? Is it SQL
>>>>> types or data bridge data types? Could you please provide a sample create
>>>>> table query.
>>>>>
>>>>>
>>>>>> - Pre-existing tables will be appropriated for use without
>>>>>> dropping/recreating
>>>>>> - Recognition of primary keys and switching between INSERT/UPSERT
>>>>>> modes automatically during Spark's INSERT INTO calls
>>>>>> - Support for creating DB indices, based on an additional input
>>>>>> parameter
>>>>>> - Spark INSERT OVERWRITE calls can be used to clear the existing
>>>>>> table without existing schema/index definitions being affected.
>>>>>>
>>>>>> This initial implementation can be found at [1]. It's written mostly
>>>>>> in Scala.
>>>>>>
>>>>>> Initially, we've tested the connector against MySQL as part of the
>>>>>> first cut, and we will be testing against all DBs supported by DAS over 
>>>>>> the
>>>>>> following days. The connector is expected to be shipped with the DAS 
>>>>>> 3.1.0
>>>>>> release.
>>>>>>
>>>>>> Thoughts welcome.
>>>>>>
>>>>>> [1] https://github.com/wso2/carbon-analytics/pull/187
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> --
>>>>>> Gokul Balakrishnan
>>>>>> Senior Software Engineer,
>>>>>> WSO2, Inc. http://wso2.com
>>>>>> M +94 77 5935 789 | +44 7563 570502
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Architecture mailing list
>>>>>> [email protected]
>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>>
>>>>> Inosh Goonewardena
>>>>> Associate Technical Lead- WSO2 Inc.
>>>>> Mobile: +94779966317
>>>>>
>>>>> _______________________________________________
>>>>> Architecture mailing list
>>>>> [email protected]
>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Gokul Balakrishnan
>>>> Senior Software Engineer,
>>>> WSO2, Inc. http://wso2.com
>>>> M +94 77 5935 789 | +44 7563 570502
>>>>
>>>>
>>>> _______________________________________________
>>>> Architecture mailing list
>>>> [email protected]
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Thanks & regards,
>>> Nirmal
>>>
>>> Team Lead - WSO2 Machine Learner
>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>> Mobile: +94715779733
>>> Blog: http://nirmalfdo.blogspot.com/
>>>
>>>
>>>
>>
>>
>> --
>> *Anjana Fernando*
>> Senior Technical Lead
>> WSO2 Inc. | http://wso2.com
>> lean . enterprise . middleware
>>
>
>
>
> --
> Gokul Balakrishnan
> Senior Software Engineer,
> WSO2, Inc. http://wso2.com
> M +94 77 5935 789 | +44 7563 570502
>
>


-- 
Gokul Balakrishnan
Senior Software Engineer,
WSO2, Inc. http://wso2.com
M +94 77 5935 789 | +44 7563 570502

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [DAS] Overhauling the Spark JDBC Connector

Reply via email to