Re: Spark Data Frame support in Ignite

Valentin Kulichenko Thu, 10 Aug 2017 15:26:50 -0700

Denis,

This only allows to limit dataset fetched from DB to Spark. This is useful,
but does not replace custom Strategy integration. Because after you create
the FD, you will use its API to do additional filtering, mapping,
aggregation, etc., and this will happen within Spark. With custom strategy
the whole processing will be done on Ignite side.


-Val

On Thu, Aug 10, 2017 at 3:07 PM, Denis Magda <[email protected]> wrote:

> >> This JDBC integration is just a Spark data source, which means that
> Spark
> >> will fetch data in its local memory first, and only then apply filters,
> >> aggregations, etc.
>
> Seems that there is a backdoor exposed via the standard SQL syntax. You
> can execute so called “pushdown” queries [1] that are sent by Spark to a
> JDBC database right away and the result is wrapped into a form of the
> DataFrame.
>
> I could do this trick using Ignite as a JDBC compliant datasource
> executing the query below over the data stored in the cluster:
>
> SELECT p.name as person, c.name as city " +
>     "FROM person p, city c  WHERE p.city_id = c.id
>
> There are some limitations though because the actual query issued by Spark
> will be:
>
> SELECT * FROM (SELECT p.name as person, c.name as city " +
>     "FROM person p, city c  WHERE p.city_id = c.id) as res
>
> Here [2] is a complete example.
>
>
> [1] https://docs.databricks.com/spark/latest/data-sources/sql-
> databases.html#pushdown-query-to-database-engine <
> https://docs.databricks.com/spark/latest/data-sources/sql-
> databases.html#pushdown-query-to-database-engine>
> [2] https://github.com/dmagda/ignite-dataframes <
> https://github.com/dmagda/ignite-dataframes>
>
> —
> Denis
>
> > On Aug 4, 2017, at 3:41 PM, Dmitriy Setrakyan <[email protected]> wrote:
> >
> > On Thu, Aug 3, 2017 at 9:04 PM, Valentin Kulichenko <
> > [email protected]> wrote:
> >
> >> This JDBC integration is just a Spark data source, which means that
> Spark
> >> will fetch data in its local memory first, and only then apply filters,
> >> aggregations, etc. This is obviously slow and doesn't use all advantages
> >> Ignite provides.
> >>
> >> To create useful and valuable integration, we should create a custom
> >> Strategy that will convert Spark's logical plan into a SQL query and
> >> execute it directly on Ignite.
> >>
> >
> > I get it, but we have been talking about Data Frame support for longer
> than
> > a year. I think we should advise our users to switch to JDBC until the
> > community gets someone to implement it.
> >
> >
> >>
> >> -Val
> >>
> >> On Thu, Aug 3, 2017 at 12:12 AM, Dmitriy Setrakyan <
> [email protected]>
> >> wrote:
> >>
> >>> On Thu, Aug 3, 2017 at 9:04 AM, Jörn Franke <[email protected]>
> >> wrote:
> >>>
> >>>> I think the development effort would still be higher. Everything would
> >>>> have to be put via JDBC into Ignite, then checkpointing would have to
> >> be
> >>>> done via JDBC (again additional development effort), a lot of
> >> conversion
> >>>> from spark internal format to JDBC and back to ignite internal format.
> >>>> Pagination I do not see as a useful feature for managing large data
> >>> volumes
> >>>> from databases - on the contrary it is very inefficient (and one would
> >> to
> >>>> have to implement logic to fetch al pages). Pagination was also never
> >>>> thought of for fetching large data volumes, but for web pages showing
> a
> >>>> small result set over several pages, where the user can click manually
> >>> for
> >>>> the next page (what they anyway not do most of the time).
> >>>>
> >>>> While it might be a quick solution , I think a deeper integration than
> >>>> JDBC would be more beneficial.
> >>>>
> >>>
> >>> Jorn, I completely agree. However, we have not been able to find a
> >>> contributor for this feature. You sound like you have sufficient domain
> >>> expertise in Spark and Ignite. Would you be willing to help out?
> >>>
> >>>
> >>>>> On 3. Aug 2017, at 08:57, Dmitriy Setrakyan <[email protected]>
> >>>> wrote:
> >>>>>
> >>>>>> On Thu, Aug 3, 2017 at 8:45 AM, Jörn Franke <[email protected]>
> >>>> wrote:
> >>>>>>
> >>>>>> I think the JDBC one is more inefficient, slower requires too much
> >>>>>> development effort. You can also check the integration of Alluxio
> >> with
> >>>>>> Spark.
> >>>>>>
> >>>>>
> >>>>> As far as I know, Alluxio is a file system, so it cannot use JDBC.
> >>>> Ignite,
> >>>>> on the other hand, is an SQL system and works well with JDBC. As far
> >> as
> >>>> the
> >>>>> development effort, we are dealing with SQL, so I am not sure why
> >> JDBC
> >>>>> would be harder.
> >>>>>
> >>>>> Generally speaking, until Ignite provides native data frame
> >>> integration,
> >>>>> having JDBC-based integration out of the box is minimally acceptable.
> >>>>>
> >>>>>
> >>>>>> Then, in general I think JDBC has never designed for large data
> >>> volumes.
> >>>>>> It is for executing queries and getting a small or aggregated result
> >>> set
> >>>>>> back. Alternatively for inserting / updating single rows.
> >>>>>>
> >>>>>
> >>>>> Agree in general. However, Ignite JDBC is designed to work with
> >> larger
> >>>> data
> >>>>> volumes and supports data pagination automatically.
> >>>>>
> >>>>>
> >>>>>>> On 3. Aug 2017, at 08:17, Dmitriy Setrakyan <[email protected]
> >>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Jorn, thanks for your feedback!
> >>>>>>>
> >>>>>>> Can you explain how the direct support would be different from the
> >>> JDBC
> >>>>>>> support?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> D.
> >>>>>>>
> >>>>>>>> On Thu, Aug 3, 2017 at 7:40 AM, Jörn Franke <[email protected]
> >>>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>> These are two different things. Spark applications themselves do
> >> not
> >>>> use
> >>>>>>>> JDBC - it is more for non-spark applications to access Spark
> >>>> DataFrames.
> >>>>>>>>
> >>>>>>>> A direct support by Ignite would make more sense. Although you
> >> have
> >>> in
> >>>>>>>> theory IGFS, if the user is using HDFS, which might not be the
> >> case.
> >>>> It
> >>>>>> is
> >>>>>>>> now also very common to use Object stores, such as S3.
> >>>>>>>> Direct support could be leverage for interactive analysis or
> >>> different
> >>>>>>>> Spark applications sharing data.
> >>>>>>>>
> >>>>>>>>> On 3. Aug 2017, at 05:12, Dmitriy Setrakyan <
> >> [email protected]
> >>>>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Igniters,
> >>>>>>>>>
> >>>>>>>>> We have had the integration with Spark Data Frames on our roadmap
> >>>> for a
> >>>>>>>>> while:
> >>>>>>>>> https://issues.apache.org/jira/browse/IGNITE-3084
> >>>>>>>>>
> >>>>>>>>> However, while browsing Spark documentation, I cam across the
> >>> generic
> >>>>>>>> JDBC
> >>>>>>>>> data frame support in Spark:
> >>>>>>>>> https://spark.apache.org/docs/latest/sql-programming-guide.
> >>>>>>>> html#jdbc-to-other-databases
> >>>>>>>>>
> >>>>>>>>> Given that Ignite has a JDBC driver, does it mean that it
> >>>> transitively
> >>>>>>>> also
> >>>>>>>>> supports Spark data frames? If yes, we should document it.
> >>>>>>>>>
> >>>>>>>>> D.
> >>>>>>>>
> >>>>>>
> >>>>
> >>>
> >>
>
>

Re: Spark Data Frame support in Ignite

Reply via email to