Re: [Architecture] [DAS] Overhauling the Spark JDBC Connector

Gokul Balakrishnan Mon, 13 Jun 2016 05:08:57 -0700

Hi Anjana, Nirmal,

The schema being mandatory is an architectural decision we've had to take.
If I go into a bit more detail as to the reasons, Spark requires its own
catalyst schema to be constructed when a relation is being created. In the
previous implementation, this was achieved through dropping the target
RDBMS table and recreating it in a format Spark understands. However, in
the current implementation, we have removed the need for and DML operation
during table creation, unless specifically requested.


The issue with making this parameter optional is that we will have to again
fall back to the earlier behaviour of the schema being inferred from the
table metadata, if not specified. This will mean having to maintain a list
of reverse mappings which will pollute the implementation. Moreover, we
will have inconsistencies when certain table schemata are inferred while
others are specified.

Please note that this is not an API change nor is it a change in deployable
artefacts: the user merely has to do edit his/her DAS extensions (i.e.
Spark scripts) if applicable. We will clearly point out the changes that
need be done, in the DAS 3.1.0 migration guide.

Thanks,

On 13 June 2016 at 16:44, Anjana Fernando <[email protected]> wrote:

> Hi,
>
> On Mon, Jun 13, 2016 at 12:23 PM, Nirmal Fernando <[email protected]> wrote:
>
>>
>> The "schema" option is required, and is used to specify the schema to be
>>> utilised throughout the temporary table's lifetime. Here, the field types
>>> used for the schema match what we have for the CarbonAnalytics provider
>>> (i.e. not JDBC nor databridge), and correspond to Spark catalyst types.
>>> Moreover, the optional "-i" keyword for a field if specified will create an
>>> RDBMS index for that field.
>>>
>>
>> Can't we make 'schema' optional as it was earlier? This introduces a
>> backward incompatible change otherwise.
>>
>
> The schema was optional before, because earlier it mandated the user to
> create the table beforehand, which was not desirable, where for subsequent
> "insert override" statements, they drop the table and tried to re-create
> it, and didn't do a good job in doing so. So this approach was done to make
> it more consistent in the way we create the tables. And in the new
> implementation, we need the schema to be known beforehand to know about its
> primary keys etc.. to do the operations properly. But yeah, for the sake of
> backward compatibility, we can do somewhat of a best effort implementation
> by, looking up the table schema using JDBC and trying to figure out the
> table schema, mainly the primary keys, which is the critical information we
> need. But this is not always a definite thing we can expect from JDBC,
> where some DBMSs may not expose this properly through that. So anyway, it
> is highly recommended to move into the new approach when you're using
> CarbonJDBC. But we will try to do a best effort implementation to retain
> backward compatibility, @Gokul, please check on this.
>
> Cheers,
> Anjana.
>
>
>>
>>> The "primaryKeys" option is not mandatory, and may be used to denote
>>> unique key fields in the underlying RDBMS table. It is based on this option
>>> that INSERT or UPSERT queries will be chosen when doing Spark INSERT INTO
>>> queries, as explained above.
>>>
>>> We're in the process of documenting the usage patterns of this provider
>>> so that they can be better understood.
>>>
>>> Thanks,
>>>
>>> On 10 June 2016 at 15:16, Inosh Goonewardena <[email protected]> wrote:
>>>
>>>> Hi Gokul,
>>>>
>>>>
>>>> On Fri, Jun 10, 2016 at 2:08 PM, Gokul Balakrishnan <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> In DAS 3.0.x, for interacting with relational databases directly from
>>>>> Spark (i.e. bypassing the data access layer), we have hitherto been using
>>>>> the JDBC connector that comes directly with Apache Spark (with added
>>>>> support for Carbon datasources).
>>>>>
>>>>> This connector has contained many issues that have been detrimental to
>>>>> proper user experience, including:
>>>>>
>>>>> - Having to create tables on the RDBMS beforehand, prior to query
>>>>> execution
>>>>> - Tables getting dropped and re-created with a Spark-dictated schema
>>>>> during initialization
>>>>> - No support for RDBMS unique keys
>>>>> - Not being able to perform INSERT INTO queries on RDBMS tables which
>>>>> have unique keys set, and as a result the user having to depend upon 
>>>>> INSERT
>>>>> OVERWRITE which clears the table. This would result in the loss of
>>>>> historical data
>>>>>
>>>>> I have been working on overhauling this connector over the past couple
>>>>> of weeks to address the above flaws and bring it up to scratch. A new a
>>>>> config file which contains the relevant information in a particular RDBMS
>>>>> flavour (such as parameterised query formats, datatypes etc) has also been
>>>>> introduced. An overview of all improvements is as follows;
>>>>>
>>>>> - RDBMS tables will be created dynamically (based on the schema
>>>>> provided by the user) if they don't exist already
>>>>>
>>>>
>>>> What is the data type to be used with fields in the schema? Is it SQL
>>>> types or data bridge data types? Could you please provide a sample create
>>>> table query.
>>>>
>>>>
>>>>> - Pre-existing tables will be appropriated for use without
>>>>> dropping/recreating
>>>>> - Recognition of primary keys and switching between INSERT/UPSERT
>>>>> modes automatically during Spark's INSERT INTO calls
>>>>> - Support for creating DB indices, based on an additional input
>>>>> parameter
>>>>> - Spark INSERT OVERWRITE calls can be used to clear the existing table
>>>>> without existing schema/index definitions being affected.
>>>>>
>>>>> This initial implementation can be found at [1]. It's written mostly
>>>>> in Scala.
>>>>>
>>>>> Initially, we've tested the connector against MySQL as part of the
>>>>> first cut, and we will be testing against all DBs supported by DAS over 
>>>>> the
>>>>> following days. The connector is expected to be shipped with the DAS 3.1.0
>>>>> release.
>>>>>
>>>>> Thoughts welcome.
>>>>>
>>>>> [1] https://github.com/wso2/carbon-analytics/pull/187
>>>>>
>>>>> Thanks,
>>>>>
>>>>> --
>>>>> Gokul Balakrishnan
>>>>> Senior Software Engineer,
>>>>> WSO2, Inc. http://wso2.com
>>>>> M +94 77 5935 789 | +44 7563 570502
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Architecture mailing list
>>>>> [email protected]
>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks & Regards,
>>>>
>>>> Inosh Goonewardena
>>>> Associate Technical Lead- WSO2 Inc.
>>>> Mobile: +94779966317
>>>>
>>>> _______________________________________________
>>>> Architecture mailing list
>>>> [email protected]
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>
>>>>
>>>
>>>
>>> --
>>> Gokul Balakrishnan
>>> Senior Software Engineer,
>>> WSO2, Inc. http://wso2.com
>>> M +94 77 5935 789 | +44 7563 570502
>>>
>>>
>>> _______________________________________________
>>> Architecture mailing list
>>> [email protected]
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>
>>>
>>
>>
>> --
>>
>> Thanks & regards,
>> Nirmal
>>
>> Team Lead - WSO2 Machine Learner
>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>> Mobile: +94715779733
>> Blog: http://nirmalfdo.blogspot.com/
>>
>>
>>
>
>
> --
> *Anjana Fernando*
> Senior Technical Lead
> WSO2 Inc. | http://wso2.com
> lean . enterprise . middleware
>



-- 
Gokul Balakrishnan
Senior Software Engineer,
WSO2, Inc. http://wso2.com
M +94 77 5935 789 | +44 7563 570502

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [DAS] Overhauling the Spark JDBC Connector

Reply via email to