Re: Optimization of SQL queries from Spark Data Frame to Ignite

Valentin Kulichenko Wed, 29 Nov 2017 15:14:55 -0800

Nikolay,

Can you please create a separate ticket for the strategy implementation
then? Any idea on how long will it take?


As for querying a partition, both SqlQuery and SqlFieldQuery allow to
specify set of partitions to work with (see setPartitions method). I think
that should be enough.

-Val

On Wed, Nov 29, 2017 at 3:39 AM, Vladimir Ozerov <[email protected]>
wrote:

> Hi Nikolay,
>
> No, it is not possible to get this info from public API, neither we planned
> to expose it. See IGNITE-4509 and commit *fbf0e353* to get better
> understanding on how this was implemented.
>
> Vladimir.
>
> On Wed, Nov 29, 2017 at 2:01 PM, Николай Ижиков <[email protected]>
> wrote:
>
> > Hello, Vladimir.
> >
> > > partition pruning is already implemented in Ignite, so there is no need
> > to do this on your own.
> >
> > Spark work with partitioned data set.
> > It is required to provide data partition information to Spark from custom
> > Data Source(Ignite).
> >
> > Can I get information about pruned partitions throw some public API?
> > Is there a plan or ticket to implement such API?
> >
> >
> >
> > 2017-11-29 10:34 GMT+03:00 Vladimir Ozerov <[email protected]>:
> >
> > > Nikolay,
> > >
> > > Regarding p3. - partition pruning is already implemented in Ignite, so
> > > there is no need to do this on your own.
> > >
> > > On Wed, Nov 29, 2017 at 3:23 AM, Valentin Kulichenko <
> > > [email protected]> wrote:
> > >
> > > > Nikolay,
> > > >
> > > > Custom strategy allows to fully process the AST generated by Spark
> and
> > > > convert it to Ignite SQL, so there will be no execution on Spark side
> > at
> > > > all. This is what we are trying to achieve here. Basically, one will
> be
> > > > able to use DataFrame API to execute queries directly on Ignite. Does
> > it
> > > > make sense to you?
> > > >
> > > > I would recommend you to take a look at MemSQL implementation which
> > does
> > > > similar stuff: https://github.com/memsql/memsql-spark-connector
> > > >
> > > > Note that this approach will work only if all relations included in
> AST
> > > are
> > > > Ignite tables. Otherwise, strategy should return null so that Spark
> > falls
> > > > back to its regular mode. Ignite will be used as regular data source
> in
> > > > this case, and probably it's possible to implement some optimizations
> > > here
> > > > as well. However, I never investigated this and it seems like another
> > > > separate discussion.
> > > >
> > > > -Val
> > > >
> > > > On Tue, Nov 28, 2017 at 9:54 AM, Николай Ижиков <
> > [email protected]>
> > > > wrote:
> > > >
> > > > > Hello, guys.
> > > > >
> > > > > I have implemented basic support of Spark Data Frame API [1], [2]
> for
> > > > > Ignite.
> > > > > Spark provides API for a custom strategy to optimize queries from
> > spark
> > > > to
> > > > > underlying data source(Ignite).
> > > > >
> > > > > The goal of optimization(obvious, just to be on the same page):
> > > > > Minimize data transfer between Spark and Ignite.
> > > > > Speedup query execution.
> > > > >
> > > > > I see 3 ways to optimize queries:
> > > > >
> > > > >         1. *Join Reduce* If one make some query that join two or
> more
> > > > > Ignite tables, we have to pass all join info to Ignite and transfer
> > to
> > > > > Spark only result of table join.
> > > > >         To implement it we have to extend current implementation
> with
> > > new
> > > > > RelationProvider that can generate all kind of joins for two or
> more
> > > > tables.
> > > > >         We should add some tests, also.
> > > > >         The question is - how join result should be partitioned?
> > > > >
> > > > >
> > > > >         2. *Order by* If one make some query to Ignite table with
> > order
> > > > by
> > > > > clause we can execute sorting on Ignite side.
> > > > >         But it seems that currently Spark doesn’t have any way to
> > tell
> > > > > that partitions already sorted.
> > > > >
> > > > >
> > > > >         3. *Key filter* If one make query with `WHERE key = XXX` or
> > > > `WHERE
> > > > > key IN (X, Y, Z)`, we can reduce number of partitions.
> > > > >         And query only partitions that store certain key values.
> > > > >         Is this kind of optimization already built in Ignite or I
> > > should
> > > > > implement it by myself?
> > > > >
> > > > > May be, there is any other way to make queries run faster?
> > > > >
> > > > > [1] https://spark.apache.org/docs/latest/sql-programming-guide.
> html
> > > > > [2] https://github.com/apache/ignite/pull/2742
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Nikolay Izhikov
> > [email protected]
> >
>

Re: Optimization of SQL queries from Spark Data Frame to Ignite

Reply via email to