Re: [Architecture] [DAS] Overhauling the Spark JDBC Connector

Anjana Fernando Mon, 13 Jun 2016 04:16:07 -0700

Hi,

On Mon, Jun 13, 2016 at 12:23 PM, Nirmal Fernando <[email protected]> wrote:


>
> The "schema" option is required, and is used to specify the schema to be
>> utilised throughout the temporary table's lifetime. Here, the field types
>> used for the schema match what we have for the CarbonAnalytics provider
>> (i.e. not JDBC nor databridge), and correspond to Spark catalyst types.
>> Moreover, the optional "-i" keyword for a field if specified will create an
>> RDBMS index for that field.
>>
>
> Can't we make 'schema' optional as it was earlier? This introduces a
> backward incompatible change otherwise.
>

The schema was optional before, because earlier it mandated the user to
create the table beforehand, which was not desirable, where for subsequent
"insert override" statements, they drop the table and tried to re-create
it, and didn't do a good job in doing so. So this approach was done to make
it more consistent in the way we create the tables. And in the new
implementation, we need the schema to be known beforehand to know about its
primary keys etc.. to do the operations properly. But yeah, for the sake of
backward compatibility, we can do somewhat of a best effort implementation
by, looking up the table schema using JDBC and trying to figure out the
table schema, mainly the primary keys, which is the critical information we
need. But this is not always a definite thing we can expect from JDBC,
where some DBMSs may not expose this properly through that. So anyway, it
is highly recommended to move into the new approach when you're using
CarbonJDBC. But we will try to do a best effort implementation to retain
backward compatibility, @Gokul, please check on this.

Cheers,
Anjana.


>
>> The "primaryKeys" option is not mandatory, and may be used to denote
>> unique key fields in the underlying RDBMS table. It is based on this option
>> that INSERT or UPSERT queries will be chosen when doing Spark INSERT INTO
>> queries, as explained above.
>>
>> We're in the process of documenting the usage patterns of this provider
>> so that they can be better understood.
>>
>> Thanks,
>>
>> On 10 June 2016 at 15:16, Inosh Goonewardena <[email protected]> wrote:
>>
>>> Hi Gokul,
>>>
>>>
>>> On Fri, Jun 10, 2016 at 2:08 PM, Gokul Balakrishnan <[email protected]>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> In DAS 3.0.x, for interacting with relational databases directly from
>>>> Spark (i.e. bypassing the data access layer), we have hitherto been using
>>>> the JDBC connector that comes directly with Apache Spark (with added
>>>> support for Carbon datasources).
>>>>
>>>> This connector has contained many issues that have been detrimental to
>>>> proper user experience, including:
>>>>
>>>> - Having to create tables on the RDBMS beforehand, prior to query
>>>> execution
>>>> - Tables getting dropped and re-created with a Spark-dictated schema
>>>> during initialization
>>>> - No support for RDBMS unique keys
>>>> - Not being able to perform INSERT INTO queries on RDBMS tables which
>>>> have unique keys set, and as a result the user having to depend upon INSERT
>>>> OVERWRITE which clears the table. This would result in the loss of
>>>> historical data
>>>>
>>>> I have been working on overhauling this connector over the past couple
>>>> of weeks to address the above flaws and bring it up to scratch. A new a
>>>> config file which contains the relevant information in a particular RDBMS
>>>> flavour (such as parameterised query formats, datatypes etc) has also been
>>>> introduced. An overview of all improvements is as follows;
>>>>
>>>> - RDBMS tables will be created dynamically (based on the schema
>>>> provided by the user) if they don't exist already
>>>>
>>>
>>> What is the data type to be used with fields in the schema? Is it SQL
>>> types or data bridge data types? Could you please provide a sample create
>>> table query.
>>>
>>>
>>>> - Pre-existing tables will be appropriated for use without
>>>> dropping/recreating
>>>> - Recognition of primary keys and switching between INSERT/UPSERT modes
>>>> automatically during Spark's INSERT INTO calls
>>>> - Support for creating DB indices, based on an additional input
>>>> parameter
>>>> - Spark INSERT OVERWRITE calls can be used to clear the existing table
>>>> without existing schema/index definitions being affected.
>>>>
>>>> This initial implementation can be found at [1]. It's written mostly in
>>>> Scala.
>>>>
>>>> Initially, we've tested the connector against MySQL as part of the
>>>> first cut, and we will be testing against all DBs supported by DAS over the
>>>> following days. The connector is expected to be shipped with the DAS 3.1.0
>>>> release.
>>>>
>>>> Thoughts welcome.
>>>>
>>>> [1] https://github.com/wso2/carbon-analytics/pull/187
>>>>
>>>> Thanks,
>>>>
>>>> --
>>>> Gokul Balakrishnan
>>>> Senior Software Engineer,
>>>> WSO2, Inc. http://wso2.com
>>>> M +94 77 5935 789 | +44 7563 570502
>>>>
>>>>
>>>> _______________________________________________
>>>> Architecture mailing list
>>>> [email protected]
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>>
>>> Inosh Goonewardena
>>> Associate Technical Lead- WSO2 Inc.
>>> Mobile: +94779966317
>>>
>>> _______________________________________________
>>> Architecture mailing list
>>> [email protected]
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>
>>>
>>
>>
>> --
>> Gokul Balakrishnan
>> Senior Software Engineer,
>> WSO2, Inc. http://wso2.com
>> M +94 77 5935 789 | +44 7563 570502
>>
>>
>> _______________________________________________
>> Architecture mailing list
>> [email protected]
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>>
>
>
> --
>
> Thanks & regards,
> Nirmal
>
> Team Lead - WSO2 Machine Learner
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>


-- 
*Anjana Fernando*
Senior Technical Lead
WSO2 Inc. | http://wso2.com
lean . enterprise . middleware

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [DAS] Overhauling the Spark JDBC Connector

Reply via email to