>> This JDBC integration is just a Spark data source, which means that Spark >> will fetch data in its local memory first, and only then apply filters, >> aggregations, etc.
Seems that there is a backdoor exposed via the standard SQL syntax. You can execute so called “pushdown” queries [1] that are sent by Spark to a JDBC database right away and the result is wrapped into a form of the DataFrame. I could do this trick using Ignite as a JDBC compliant datasource executing the query below over the data stored in the cluster: SELECT p.name as person, c.name as city " + "FROM person p, city c WHERE p.city_id = c.id There are some limitations though because the actual query issued by Spark will be: SELECT * FROM (SELECT p.name as person, c.name as city " + "FROM person p, city c WHERE p.city_id = c.id) as res Here [2] is a complete example. [1] https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#pushdown-query-to-database-engine <https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#pushdown-query-to-database-engine> [2] https://github.com/dmagda/ignite-dataframes <https://github.com/dmagda/ignite-dataframes> — Denis > On Aug 4, 2017, at 3:41 PM, Dmitriy Setrakyan <d...@gridgain.com> wrote: > > On Thu, Aug 3, 2017 at 9:04 PM, Valentin Kulichenko < > valentin.kuliche...@gmail.com> wrote: > >> This JDBC integration is just a Spark data source, which means that Spark >> will fetch data in its local memory first, and only then apply filters, >> aggregations, etc. This is obviously slow and doesn't use all advantages >> Ignite provides. >> >> To create useful and valuable integration, we should create a custom >> Strategy that will convert Spark's logical plan into a SQL query and >> execute it directly on Ignite. >> > > I get it, but we have been talking about Data Frame support for longer than > a year. I think we should advise our users to switch to JDBC until the > community gets someone to implement it. > > >> >> -Val >> >> On Thu, Aug 3, 2017 at 12:12 AM, Dmitriy Setrakyan <dsetrak...@apache.org> >> wrote: >> >>> On Thu, Aug 3, 2017 at 9:04 AM, Jörn Franke <jornfra...@gmail.com> >> wrote: >>> >>>> I think the development effort would still be higher. Everything would >>>> have to be put via JDBC into Ignite, then checkpointing would have to >> be >>>> done via JDBC (again additional development effort), a lot of >> conversion >>>> from spark internal format to JDBC and back to ignite internal format. >>>> Pagination I do not see as a useful feature for managing large data >>> volumes >>>> from databases - on the contrary it is very inefficient (and one would >> to >>>> have to implement logic to fetch al pages). Pagination was also never >>>> thought of for fetching large data volumes, but for web pages showing a >>>> small result set over several pages, where the user can click manually >>> for >>>> the next page (what they anyway not do most of the time). >>>> >>>> While it might be a quick solution , I think a deeper integration than >>>> JDBC would be more beneficial. >>>> >>> >>> Jorn, I completely agree. However, we have not been able to find a >>> contributor for this feature. You sound like you have sufficient domain >>> expertise in Spark and Ignite. Would you be willing to help out? >>> >>> >>>>> On 3. Aug 2017, at 08:57, Dmitriy Setrakyan <dsetrak...@apache.org> >>>> wrote: >>>>> >>>>>> On Thu, Aug 3, 2017 at 8:45 AM, Jörn Franke <jornfra...@gmail.com> >>>> wrote: >>>>>> >>>>>> I think the JDBC one is more inefficient, slower requires too much >>>>>> development effort. You can also check the integration of Alluxio >> with >>>>>> Spark. >>>>>> >>>>> >>>>> As far as I know, Alluxio is a file system, so it cannot use JDBC. >>>> Ignite, >>>>> on the other hand, is an SQL system and works well with JDBC. As far >> as >>>> the >>>>> development effort, we are dealing with SQL, so I am not sure why >> JDBC >>>>> would be harder. >>>>> >>>>> Generally speaking, until Ignite provides native data frame >>> integration, >>>>> having JDBC-based integration out of the box is minimally acceptable. >>>>> >>>>> >>>>>> Then, in general I think JDBC has never designed for large data >>> volumes. >>>>>> It is for executing queries and getting a small or aggregated result >>> set >>>>>> back. Alternatively for inserting / updating single rows. >>>>>> >>>>> >>>>> Agree in general. However, Ignite JDBC is designed to work with >> larger >>>> data >>>>> volumes and supports data pagination automatically. >>>>> >>>>> >>>>>>> On 3. Aug 2017, at 08:17, Dmitriy Setrakyan <dsetrak...@apache.org >>> >>>>>> wrote: >>>>>>> >>>>>>> Jorn, thanks for your feedback! >>>>>>> >>>>>>> Can you explain how the direct support would be different from the >>> JDBC >>>>>>> support? >>>>>>> >>>>>>> Thanks, >>>>>>> D. >>>>>>> >>>>>>>> On Thu, Aug 3, 2017 at 7:40 AM, Jörn Franke <jornfra...@gmail.com >>> >>>>>> wrote: >>>>>>>> >>>>>>>> These are two different things. Spark applications themselves do >> not >>>> use >>>>>>>> JDBC - it is more for non-spark applications to access Spark >>>> DataFrames. >>>>>>>> >>>>>>>> A direct support by Ignite would make more sense. Although you >> have >>> in >>>>>>>> theory IGFS, if the user is using HDFS, which might not be the >> case. >>>> It >>>>>> is >>>>>>>> now also very common to use Object stores, such as S3. >>>>>>>> Direct support could be leverage for interactive analysis or >>> different >>>>>>>> Spark applications sharing data. >>>>>>>> >>>>>>>>> On 3. Aug 2017, at 05:12, Dmitriy Setrakyan < >> dsetrak...@apache.org >>>> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Igniters, >>>>>>>>> >>>>>>>>> We have had the integration with Spark Data Frames on our roadmap >>>> for a >>>>>>>>> while: >>>>>>>>> https://issues.apache.org/jira/browse/IGNITE-3084 >>>>>>>>> >>>>>>>>> However, while browsing Spark documentation, I cam across the >>> generic >>>>>>>> JDBC >>>>>>>>> data frame support in Spark: >>>>>>>>> https://spark.apache.org/docs/latest/sql-programming-guide. >>>>>>>> html#jdbc-to-other-databases >>>>>>>>> >>>>>>>>> Given that Ignite has a JDBC driver, does it mean that it >>>> transitively >>>>>>>> also >>>>>>>>> supports Spark data frames? If yes, we should document it. >>>>>>>>> >>>>>>>>> D. >>>>>>>> >>>>>> >>>> >>> >>