Re: [Architecture] [DAS] Overhauling the Spark JDBC Connector

Dulitha Wijewantha Thu, 16 Jun 2016 12:18:44 -0700

Hi Gokul,
Will this allow us to perform INSERT INTO queries with sample data (not
from a table)? This is useful in the DEV phase.


Cheers~

On Tue, Jun 14, 2016 at 7:38 AM, Gokul Balakrishnan <[email protected]> wrote:

> Hi product analytics leads,
>
> Please make sure that the configuration file spark-jdbc-config.xml is
> added to the product-analytics packs, especially is you're using the
> CarbonJDBC provider. Example commit may be found at [1].
>
> [1]
> https://github.com/wso2/product-das/commit/4bdbf68833bd2bc8a20549eaf726873cacde468f
>
> Thanks,
>
> On 13 June 2016 at 17:37, Gokul Balakrishnan <[email protected]> wrote:
>
>> Hi Anjana, Nirmal,
>>
>> The schema being mandatory is an architectural decision we've had to
>> take. If I go into a bit more detail as to the reasons, Spark requires its
>> own catalyst schema to be constructed when a relation is being created. In
>> the previous implementation, this was achieved through dropping the target
>> RDBMS table and recreating it in a format Spark understands. However, in
>> the current implementation, we have removed the need for and DML operation
>> during table creation, unless specifically requested.
>>
>> The issue with making this parameter optional is that we will have to
>> again fall back to the earlier behaviour of the schema being inferred from
>> the table metadata, if not specified. This will mean having to maintain a
>> list of reverse mappings which will pollute the implementation. Moreover,
>> we will have inconsistencies when certain table schemata are inferred while
>> others are specified.
>>
>> Please note that this is not an API change nor is it a change in
>> deployable artefacts: the user merely has to do edit his/her DAS extensions
>> (i.e. Spark scripts) if applicable. We will clearly point out the changes
>> that need be done, in the DAS 3.1.0 migration guide.
>>
>> Thanks,
>>
>> On 13 June 2016 at 16:44, Anjana Fernando <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> On Mon, Jun 13, 2016 at 12:23 PM, Nirmal Fernando <[email protected]>
>>> wrote:
>>>
>>>>
>>>> The "schema" option is required, and is used to specify the schema to
>>>>> be utilised throughout the temporary table's lifetime. Here, the field
>>>>> types used for the schema match what we have for the CarbonAnalytics
>>>>> provider (i.e. not JDBC nor databridge), and correspond to Spark catalyst
>>>>> types. Moreover, the optional "-i" keyword for a field if specified will
>>>>> create an RDBMS index for that field.
>>>>>
>>>>
>>>> Can't we make 'schema' optional as it was earlier? This introduces a
>>>> backward incompatible change otherwise.
>>>>
>>>
>>> The schema was optional before, because earlier it mandated the user to
>>> create the table beforehand, which was not desirable, where for subsequent
>>> "insert override" statements, they drop the table and tried to re-create
>>> it, and didn't do a good job in doing so. So this approach was done to make
>>> it more consistent in the way we create the tables. And in the new
>>> implementation, we need the schema to be known beforehand to know about its
>>> primary keys etc.. to do the operations properly. But yeah, for the sake of
>>> backward compatibility, we can do somewhat of a best effort implementation
>>> by, looking up the table schema using JDBC and trying to figure out the
>>> table schema, mainly the primary keys, which is the critical information we
>>> need. But this is not always a definite thing we can expect from JDBC,
>>> where some DBMSs may not expose this properly through that. So anyway, it
>>> is highly recommended to move into the new approach when you're using
>>> CarbonJDBC. But we will try to do a best effort implementation to retain
>>> backward compatibility, @Gokul, please check on this.
>>>
>>> Cheers,
>>> Anjana.
>>>
>>>
>>>>
>>>>> The "primaryKeys" option is not mandatory, and may be used to denote
>>>>> unique key fields in the underlying RDBMS table. It is based on this 
>>>>> option
>>>>> that INSERT or UPSERT queries will be chosen when doing Spark INSERT INTO
>>>>> queries, as explained above.
>>>>>
>>>>> We're in the process of documenting the usage patterns of this
>>>>> provider so that they can be better understood.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> On 10 June 2016 at 15:16, Inosh Goonewardena <[email protected]> wrote:
>>>>>
>>>>>> Hi Gokul,
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 10, 2016 at 2:08 PM, Gokul Balakrishnan <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> In DAS 3.0.x, for interacting with relational databases directly
>>>>>>> from Spark (i.e. bypassing the data access layer), we have hitherto been
>>>>>>> using the JDBC connector that comes directly with Apache Spark (with 
>>>>>>> added
>>>>>>> support for Carbon datasources).
>>>>>>>
>>>>>>> This connector has contained many issues that have been detrimental
>>>>>>> to proper user experience, including:
>>>>>>>
>>>>>>> - Having to create tables on the RDBMS beforehand, prior to query
>>>>>>> execution
>>>>>>> - Tables getting dropped and re-created with a Spark-dictated schema
>>>>>>> during initialization
>>>>>>> - No support for RDBMS unique keys
>>>>>>> - Not being able to perform INSERT INTO queries on RDBMS tables
>>>>>>> which have unique keys set, and as a result the user having to depend 
>>>>>>> upon
>>>>>>> INSERT OVERWRITE which clears the table. This would result in the loss 
>>>>>>> of
>>>>>>> historical data
>>>>>>>
>>>>>>> I have been working on overhauling this connector over the past
>>>>>>> couple of weeks to address the above flaws and bring it up to scratch. A
>>>>>>> new a config file which contains the relevant information in a 
>>>>>>> particular
>>>>>>> RDBMS flavour (such as parameterised query formats, datatypes etc) has 
>>>>>>> also
>>>>>>> been introduced. An overview of all improvements is as follows;
>>>>>>>
>>>>>>> - RDBMS tables will be created dynamically (based on the schema
>>>>>>> provided by the user) if they don't exist already
>>>>>>>
>>>>>>
>>>>>> What is the data type to be used with fields in the schema? Is it SQL
>>>>>> types or data bridge data types? Could you please provide a sample create
>>>>>> table query.
>>>>>>
>>>>>>
>>>>>>> - Pre-existing tables will be appropriated for use without
>>>>>>> dropping/recreating
>>>>>>> - Recognition of primary keys and switching between INSERT/UPSERT
>>>>>>> modes automatically during Spark's INSERT INTO calls
>>>>>>> - Support for creating DB indices, based on an additional input
>>>>>>> parameter
>>>>>>> - Spark INSERT OVERWRITE calls can be used to clear the existing
>>>>>>> table without existing schema/index definitions being affected.
>>>>>>>
>>>>>>> This initial implementation can be found at [1]. It's written mostly
>>>>>>> in Scala.
>>>>>>>
>>>>>>> Initially, we've tested the connector against MySQL as part of the
>>>>>>> first cut, and we will be testing against all DBs supported by DAS over 
>>>>>>> the
>>>>>>> following days. The connector is expected to be shipped with the DAS 
>>>>>>> 3.1.0
>>>>>>> release.
>>>>>>>
>>>>>>> Thoughts welcome.
>>>>>>>
>>>>>>> [1] https://github.com/wso2/carbon-analytics/pull/187
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> --
>>>>>>> Gokul Balakrishnan
>>>>>>> Senior Software Engineer,
>>>>>>> WSO2, Inc. http://wso2.com
>>>>>>> M +94 77 5935 789 | +44 7563 570502
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Architecture mailing list
>>>>>>> [email protected]
>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>>
>>>>>> Inosh Goonewardena
>>>>>> Associate Technical Lead- WSO2 Inc.
>>>>>> Mobile: +94779966317
>>>>>>
>>>>>> _______________________________________________
>>>>>> Architecture mailing list
>>>>>> [email protected]
>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Gokul Balakrishnan
>>>>> Senior Software Engineer,
>>>>> WSO2, Inc. http://wso2.com
>>>>> M +94 77 5935 789 | +44 7563 570502
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Architecture mailing list
>>>>> [email protected]
>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Thanks & regards,
>>>> Nirmal
>>>>
>>>> Team Lead - WSO2 Machine Learner
>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>> Mobile: +94715779733
>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Anjana Fernando*
>>> Senior Technical Lead
>>> WSO2 Inc. | http://wso2.com
>>> lean . enterprise . middleware
>>>
>>
>>
>>
>> --
>> Gokul Balakrishnan
>> Senior Software Engineer,
>> WSO2, Inc. http://wso2.com
>> M +94 77 5935 789 | +44 7563 570502
>>
>>
>
>
> --
> Gokul Balakrishnan
> Senior Software Engineer,
> WSO2, Inc. http://wso2.com
> M +94 77 5935 789 | +44 7563 570502
>
>
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>


-- 
Dulitha Wijewantha (Chan)
Software Engineer - Mobile Development
WSO2 Inc
Lean.Enterprise.Middleware
 * ~Email       [email protected] <[email protected]>*
*  ~Mobile     +94712112165*
*  ~Website   dulitha.me <http://dulitha.me>*
*  ~Twitter     @dulitharw <https://twitter.com/dulitharw>*
  *~Github     @dulichan <https://github.com/dulichan>*
  *~SO     @chan <http://stackoverflow.com/users/813471/chan>*

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [DAS] Overhauling the Spark JDBC Connector

Reply via email to