Re: Spark Data Frame support in Ignite

Denis Magda Thu, 10 Aug 2017 15:14:15 -0700

>> This JDBC integration is just a Spark data source, which means that Spark
>> will fetch data in its local memory first, and only then apply filters,
>> aggregations, etc.


Seems that there is a backdoor exposed via the standard SQL syntax. You can 
execute so called “pushdown” queries [1] that are sent by Spark to a JDBC 
database right away and the result is wrapped into a form of the DataFrame.

I could do this trick using Ignite as a JDBC compliant datasource executing the 
query below over the data stored in the cluster:

SELECT p.name as person, c.name as city " +
    "FROM person p, city c  WHERE p.city_id = c.id

There are some limitations though because the actual query issued by Spark will 
be:

SELECT * FROM (SELECT p.name as person, c.name as city " +
    "FROM person p, city c  WHERE p.city_id = c.id) as res

Here [2] is a complete example.


[1] 
https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#pushdown-query-to-database-engine
 
<https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#pushdown-query-to-database-engine>
[2] https://github.com/dmagda/ignite-dataframes 
<https://github.com/dmagda/ignite-dataframes>

—
Denis

> On Aug 4, 2017, at 3:41 PM, Dmitriy Setrakyan <d...@gridgain.com> wrote:
> 
> On Thu, Aug 3, 2017 at 9:04 PM, Valentin Kulichenko <
> valentin.kuliche...@gmail.com> wrote:
> 
>> This JDBC integration is just a Spark data source, which means that Spark
>> will fetch data in its local memory first, and only then apply filters,
>> aggregations, etc. This is obviously slow and doesn't use all advantages
>> Ignite provides.
>> 
>> To create useful and valuable integration, we should create a custom
>> Strategy that will convert Spark's logical plan into a SQL query and
>> execute it directly on Ignite.
>> 
> 
> I get it, but we have been talking about Data Frame support for longer than
> a year. I think we should advise our users to switch to JDBC until the
> community gets someone to implement it.
> 
> 
>> 
>> -Val
>> 
>> On Thu, Aug 3, 2017 at 12:12 AM, Dmitriy Setrakyan <dsetrak...@apache.org>
>> wrote:
>> 
>>> On Thu, Aug 3, 2017 at 9:04 AM, Jörn Franke <jornfra...@gmail.com>
>> wrote:
>>> 
>>>> I think the development effort would still be higher. Everything would
>>>> have to be put via JDBC into Ignite, then checkpointing would have to
>> be
>>>> done via JDBC (again additional development effort), a lot of
>> conversion
>>>> from spark internal format to JDBC and back to ignite internal format.
>>>> Pagination I do not see as a useful feature for managing large data
>>> volumes
>>>> from databases - on the contrary it is very inefficient (and one would
>> to
>>>> have to implement logic to fetch al pages). Pagination was also never
>>>> thought of for fetching large data volumes, but for web pages showing a
>>>> small result set over several pages, where the user can click manually
>>> for
>>>> the next page (what they anyway not do most of the time).
>>>> 
>>>> While it might be a quick solution , I think a deeper integration than
>>>> JDBC would be more beneficial.
>>>> 
>>> 
>>> Jorn, I completely agree. However, we have not been able to find a
>>> contributor for this feature. You sound like you have sufficient domain
>>> expertise in Spark and Ignite. Would you be willing to help out?
>>> 
>>> 
>>>>> On 3. Aug 2017, at 08:57, Dmitriy Setrakyan <dsetrak...@apache.org>
>>>> wrote:
>>>>> 
>>>>>> On Thu, Aug 3, 2017 at 8:45 AM, Jörn Franke <jornfra...@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>> I think the JDBC one is more inefficient, slower requires too much
>>>>>> development effort. You can also check the integration of Alluxio
>> with
>>>>>> Spark.
>>>>>> 
>>>>> 
>>>>> As far as I know, Alluxio is a file system, so it cannot use JDBC.
>>>> Ignite,
>>>>> on the other hand, is an SQL system and works well with JDBC. As far
>> as
>>>> the
>>>>> development effort, we are dealing with SQL, so I am not sure why
>> JDBC
>>>>> would be harder.
>>>>> 
>>>>> Generally speaking, until Ignite provides native data frame
>>> integration,
>>>>> having JDBC-based integration out of the box is minimally acceptable.
>>>>> 
>>>>> 
>>>>>> Then, in general I think JDBC has never designed for large data
>>> volumes.
>>>>>> It is for executing queries and getting a small or aggregated result
>>> set
>>>>>> back. Alternatively for inserting / updating single rows.
>>>>>> 
>>>>> 
>>>>> Agree in general. However, Ignite JDBC is designed to work with
>> larger
>>>> data
>>>>> volumes and supports data pagination automatically.
>>>>> 
>>>>> 
>>>>>>> On 3. Aug 2017, at 08:17, Dmitriy Setrakyan <dsetrak...@apache.org
>>> 
>>>>>> wrote:
>>>>>>> 
>>>>>>> Jorn, thanks for your feedback!
>>>>>>> 
>>>>>>> Can you explain how the direct support would be different from the
>>> JDBC
>>>>>>> support?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> D.
>>>>>>> 
>>>>>>>> On Thu, Aug 3, 2017 at 7:40 AM, Jörn Franke <jornfra...@gmail.com
>>> 
>>>>>> wrote:
>>>>>>>> 
>>>>>>>> These are two different things. Spark applications themselves do
>> not
>>>> use
>>>>>>>> JDBC - it is more for non-spark applications to access Spark
>>>> DataFrames.
>>>>>>>> 
>>>>>>>> A direct support by Ignite would make more sense. Although you
>> have
>>> in
>>>>>>>> theory IGFS, if the user is using HDFS, which might not be the
>> case.
>>>> It
>>>>>> is
>>>>>>>> now also very common to use Object stores, such as S3.
>>>>>>>> Direct support could be leverage for interactive analysis or
>>> different
>>>>>>>> Spark applications sharing data.
>>>>>>>> 
>>>>>>>>> On 3. Aug 2017, at 05:12, Dmitriy Setrakyan <
>> dsetrak...@apache.org
>>>> 
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Igniters,
>>>>>>>>> 
>>>>>>>>> We have had the integration with Spark Data Frames on our roadmap
>>>> for a
>>>>>>>>> while:
>>>>>>>>> https://issues.apache.org/jira/browse/IGNITE-3084
>>>>>>>>> 
>>>>>>>>> However, while browsing Spark documentation, I cam across the
>>> generic
>>>>>>>> JDBC
>>>>>>>>> data frame support in Spark:
>>>>>>>>> https://spark.apache.org/docs/latest/sql-programming-guide.
>>>>>>>> html#jdbc-to-other-databases
>>>>>>>>> 
>>>>>>>>> Given that Ignite has a JDBC driver, does it mean that it
>>>> transitively
>>>>>>>> also
>>>>>>>>> supports Spark data frames? If yes, we should document it.
>>>>>>>>> 
>>>>>>>>> D.
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
>>

Re: Spark Data Frame support in Ignite

Reply via email to