Re: Calcite based SQL query engine. Local queries

Dmitriy Pavlov Fri, 08 Nov 2019 05:13:37 -0800

Yes, I understand that it is straightforward and, may be, naive approach.
Which is why I'm asking how to do map-reduce on cache C data in Ignite with
proper partition pinning.


About Predefined/Implemented aggregate - I'm not sure I agree that we can
predict everything. It is real perk of Ignite that you can send any of your
code (which, BTW, can be developed in lifetime of the system) to your data.

So I propose map and reduce phase should allow user code to be executed. If
I know any other better approach, I would somehow document it (e.g. add to
some next training/workshop).

Sincerely,
Dmitriy Pavlov

пт, 8 нояб. 2019 г. в 15:45, Ivan Pavlukhin <[email protected]>:

> Dmitriy,
>
> First, what kind of cumulative metric can it be? A lot of cumulative
> metrics can be compared using SQL. MIN, MAX, AVG are simple ones. For
> more complex ones I can think about user-define aggregate functions
> (UDAF). We do not have them in Ignite so far, but can introduce them.
>
> Second, naive approaches of such ComputeScan can lead to incorrect
> results as partitions might not be properly pinned and duplicate
> entries might appear.
>
> пт, 8 нояб. 2019 г. в 15:27, Dmitriy Pavlov <[email protected]>:
> >
> > Hi Ivan, Igniters, imagine you need to scan all entities in the cluster.
> >
> > Ideally, you don't want to de-serialize all of entries, so you can use
> > withKeepBinary(). e.g. you need a couple of fields and get some
> cumulative
> > metric on this data. You can send compute to all cluster nodes and run
> > there SQL scan queries with local mode is on. In that manner you can
> > implement Map-Reduce.
> >
> > It may be there is another way of doing that, so I encourage to share
> it. I
> > could update workshops/training I preparing in background.
> >
> > Sincerely,
> > Dmitriy Pavlov
> >
> > пт, 8 нояб. 2019 г. в 08:57, Ivan Pavlukhin <[email protected]>:
> >
> > > Denis,
> > >
> > > To make things really clearer we need to provide some concrete example
> > > of Compute + LocalSQL and reason about it to figure out whether
> > > "smart" SQL engine can deliver the same (or better) results or not.
> > >
> > > пт, 8 нояб. 2019 г. в 01:48, Denis Magda <[email protected]>:
> > > >
> > > > Folks,
> > > >
> > > > See our compute tasks as an advanced version of stored procedures
> that
> > > let
> > > > the users code the logic of various complexity with Java, .NET or C++
> > > (and
> > > > not with PL/SQL). The logic can use a combination of APIs (key-value,
> > > SQL,
> > > > etc.) to access data both locally and remotely while being executed
> on
> > > > server nodes. The logic can make N key-value requests or run M SQL
> > > queries.
> > > >
> > > > We kept supporting local SQL queries exactly for such scenarios (for
> our
> > > > version of stored procedures) to ensure the distributed map-reduce
> phase
> > > is
> > > > canceled if all the data is local. And affinityCalls were improved
> one
> > > day
> > > > to pin the partitions.
> > > >
> > > > If the new engine is smart enough to understand that all the
> partitions
> > > are
> > > > available locally during the affinityRun execution then it's totally
> fine
> > > > to remove the 'local' flag. Otherwise, we need to instruct the engine
> > > > manually that a distributed phase is redundant via 'local' flag or by
> > > other
> > > > means.
> > > >
> > > > Does it make things clearer?
> > > >
> > > >
> > > > -
> > > > Denis
> > > >
> > > >
> > > > On Thu, Nov 7, 2019 at 3:53 AM Ivan Pavlukhin <[email protected]>
> > > wrote:
> > > >
> > > > > Stephen,
> > > > >
> > > > > In my understanding we need to do a better job to realize
> use-cases of
> > > > > Compute + LocalSQL ourselves.
> > > > >
> > > > > Ideally smart optimizer should do the best job of query deployment.
> > > > >
> > > > > чт, 7 нояб. 2019 г. в 13:04, Stephen Darlington
> > > > > <[email protected]>:
> > > > > >
> > > > > > I made a (bad) assumption that this would also affect queries
> against
> > > > > partitions. If “setLocal()” goes away but “setPartitions()”
> remains I’m
> > > > > happy.
> > > > > >
> > > > > > What I would say is that the “broadcast / local” method is one I
> see
> > > > > fairly often. Do we need to do a better job educating people of the
> > > > > “correct” way?
> > > > > >
> > > > > > Regards,
> > > > > > Stephen
> > > > > >
> > > > > > > On 7 Nov 2019, at 08:30, Alexey Goncharuk <
> > > [email protected]>
> > > > > wrote:
> > > > > > >
> > > > > > > Denis, Stephen,
> > > > > > >
> > > > > > > Running a local query in a broadcast closure won't work on
> changing
> > > > > > > topology. We specifically added an affinityCall method to the
> > > compute
> > > > > API
> > > > > > > in order to pin a partition to prevent its moving and eviction
> > > > > throughout
> > > > > > > the task execution. Therefore, the query inside an
> affinityCall is
> > > > > always
> > > > > > > executed against some partitions (otherwise the query may give
> > > > > incorrect
> > > > > > > results when topology is changed).
> > > > > > >
> > > > > > > I support Igor's question and think that the 'local' flag for
> the
> > > query
> > > > > > > should be deprecated and eventually removed. A 'local' query
> can
> > > > > always be
> > > > > > > expressed as a query agains a set of partitions. If those
> > > partitions
> > > > > are
> > > > > > > located on the same node - good, we get fast and correct
> results.
> > > If
> > > > > not -
> > > > > > > we may either raise an exception and ask user to remap the
> query,
> > > or
> > > > > > > fallback to a distributed query execution.
> > > > > > >
> > > > > > > Given that the Calcite prototype is in its early stages, it's
> > > likely
> > > > > its
> > > > > > > first version will be available in 3.x, and it's a good chance
> to
> > > get
> > > > > rid
> > > > > > > of wrong API pieces.
> > > > > > >
> > > > > > > --AG
> > > > > > >
> > > > > > > пн, 4 нояб. 2019 г. в 14:02, Stephen Darlington <
> > > > > > > [email protected]>:
> > > > > > >
> > > > > > >> A common use case is where you want to work on many rows of
> data
> > > > > across
> > > > > > >> the grid. You’d broadcast a closure, running the same code on
> > > every
> > > > > node
> > > > > > >> with just the local data. SQL doesn’t work in isolation — it’s
> > > often
> > > > > used
> > > > > > >> as a filter for future computations.
> > > > > > >>
> > > > > > >> Regards,
> > > > > > >> Stephen
> > > > > > >>
> > > > > > >>> On 1 Nov 2019, at 17:53, Ivan Pavlukhin <[email protected]
> >
> > > wrote:
> > > > > > >>>
> > > > > > >>> Denis,
> > > > > > >>>
> > > > > > >>> I am mostly concerned about gathering use cases. It would be
> > > great to
> > > > > > >>> critically assess such cases to identify why it cannot be
> solved
> > > by
> > > > > > >>> using distributed SQL. Also it sounds similar to some kind of
> > > > > "hints",
> > > > > > >>> but very limited and with all hints drawbacks (impossibility
> to
> > > use
> > > > > > >>> full strength of CBO). We can provide better "hints" support
> > > with new
> > > > > > >>> engine as well.
> > > > > > >>>
> > > > > > >>> пт, 1 нояб. 2019 г. в 20:14, Denis Magda <[email protected]
> >:
> > > > > > >>>>
> > > > > > >>>> Ivan,
> > > > > > >>>>
> > > > > > >>>> I was involved in a couple of such use cases personally, so,
> > > that's
> > > > > not
> > > > > > >> my
> > > > > > >>>> imagination ;) Even more, as far as I remember, the primary
> > > reason
> > > > > why
> > > > > > >> we
> > > > > > >>>> improved our affinityRuns ensuring no partition is purged
> from a
> > > > > node
> > > > > > >> until
> > > > > > >>>> a task is completed is because many users were running
> local SQL
> > > > > from
> > > > > > >>>> compute tasks and needed a guarantee that SQL will always
> > > return a
> > > > > > >> correct
> > > > > > >>>> result set.
> > > > > > >>>>
> > > > > > >>>> -
> > > > > > >>>> Denis
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> On Fri, Nov 1, 2019 at 10:01 AM Ivan Pavlukhin <
> > > [email protected]
> > > > > >
> > > > > > >> wrote:
> > > > > > >>>>
> > > > > > >>>>> Denis,
> > > > > > >>>>>
> > > > > > >>>>> Would be nice to see real use-cases of affinity call +
> local
> > > SQL
> > > > > > >>>>> combination. Generally, new engine will be able to infer
> > > > > collocation
> > > > > > >>>>> resulting in the same collocated execution automatically.
> > > > > > >>>>>
> > > > > > >>>>> пт, 1 нояб. 2019 г. в 19:11, Denis Magda <
> [email protected]>:
> > > > > > >>>>>>
> > > > > > >>>>>> Hi Igor,
> > > > > > >>>>>>
> > > > > > >>>>>> Local queries feature is broadly used together with
> > > affinity-based
> > > > > > >>>>> compute
> > > > > > >>>>>> tasks:
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>
> > > > >
> > >
> https://apacheignite.readme.io/docs/collocate-compute-and-data#section-affinity-call-and-run-methods
> > > > > > >>>>>>
> > > > > > >>>>>> The use case is as follows. The user knows that all
> required
> > > data
> > > > > > >> needed
> > > > > > >>>>>> for computation is collocated, and SQL is used as an
> advanced
> > > API
> > > > > for
> > > > > > >>>>> data
> > > > > > >>>>>> retrieval from the computation code. The affinity task
> ensures
> > > > > that
> > > > > > >>>>>> partitions won't be discarded from the node(s) if the
> topology
> > > > > changes
> > > > > > >>>>>> during the task execution and, thus, it's safe to run SQL
> > > locally
> > > > > > >>>>> skipping
> > > > > > >>>>>> distributed phases.
> > > > > > >>>>>>
> > > > > > >>>>>> The combination of affinity compute tasks with local SQL
> is a
> > > > > real and
> > > > > > >>>>>> valuable use case, and this is what we need to support
> with
> > > > > Calcite.
> > > > > > >> Do
> > > > > > >>>>> you
> > > > > > >>>>>> see any challenges?
> > > > > > >>>>>>
> > > > > > >>>>>> -
> > > > > > >>>>>> Denis
> > > > > > >>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>> On Fri, Nov 1, 2019 at 8:46 AM Roman Kondakov
> > > > > > >> <[email protected]
> > > > > > >>>>>>
> > > > > > >>>>>> wrote:
> > > > > > >>>>>>
> > > > > > >>>>>>> Hi Igor!
> > > > > > >>>>>>>
> > > > > > >>>>>>> IMO we need to maintain the backward compatibility
> between
> > > old
> > > > > and
> > > > > > >> new
> > > > > > >>>>>>> query engines as much as possible. And therefore we
> shouldn't
> > > > > change
> > > > > > >>>>> the
> > > > > > >>>>>>> behavior of local queries.
> > > > > > >>>>>>>
> > > > > > >>>>>>> So, for local queries Calcite's planner shouldn't
> consider
> > > the
> > > > > > >>>>>>> distribution trait at all.
> > > > > > >>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>> --
> > > > > > >>>>>>> Kind Regards
> > > > > > >>>>>>> Roman Kondakov
> > > > > > >>>>>>>
> > > > > > >>>>>>> On 01.11.2019 17:07, Seliverstov Igor wrote:
> > > > > > >>>>>>>> Hi Igniters,
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Working on new generation of Ignite SQL I faced a
> question:
> > > «Do
> > > > > we
> > > > > > >>>>> need
> > > > > > >>>>>>> local queries at all and, if so, what semantic they
> should
> > > > > have?».
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Current planing flow consists of next steps:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> 1) Parsing SQL to AST
> > > > > > >>>>>>>> 2) Validating AST (against Schema)
> > > > > > >>>>>>>> 3) Optimizing (Building execution graph)
> > > > > > >>>>>>>> 4) Splitting (into query fragments which executes on
> target
> > > > > nodes)
> > > > > > >>>>>>>> 5) Mapping (query fragments to nodes/partitions)
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> At last step we check that all Fragment sources (a
> table or
> > > > > result)
> > > > > > >>>>> have
> > > > > > >>>>>>> the same distribution (in other words all sources have
> to be
> > > > > > >>>>> co-located)
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Planner and Splitter guarantee that all caches in a
> > > Fragment are
> > > > > > >>>>>>> co-located, an Exchange is produced otherwise. But if we
> > > force
> > > > > local
> > > > > > >>>>>>> execution we cannot produce Exchanges, that means we may
> > > face two
> > > > > > >>>>>>> non-co-located caches inside a single query fragment
> (result
> > > of
> > > > > local
> > > > > > >>>>> query
> > > > > > >>>>>>> planning is a single query fragment). So, we cannot pass
> the
> > > > > check.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Should we throw an exception or omit the check for local
> > > query
> > > > > > >>>>> planning
> > > > > > >>>>>>> or prohibit local queries at all?
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Your thoughts?
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Regards,
> > > > > > >>>>>>>> Igor
> > > > > > >>>>>>>
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>> --
> > > > > > >>>>> Best regards,
> > > > > > >>>>> Ivan Pavlukhin
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> --
> > > > > > >>> Best regards,
> > > > > > >>> Ivan Pavlukhin
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Ivan Pavlukhin
> > > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Ivan Pavlukhin
> > >
>
>
>
> --
> Best regards,
> Ivan Pavlukhin
>

Re: Calcite based SQL query engine. Local queries

Reply via email to