Re: [Distributed SQL] Do we have a plan to implement QuadTree index?

Alexey Zinoviev Thu, 16 Aug 2018 04:23:11 -0700

Ok, many thanks, will look at geospatial index

2018-08-16 14:10 GMT+06:00 Alexey Goncharuk <[email protected]>:


> Alexey,
>
> I am not sure if it will be a proper fir for you, but I think it worth a
> try.
>
> Apache Ignite has an option to index geospatial data using third-party
> libraries (note that it is not included in the default Ignite build, the
> module is activated via the lgpl profile). The index is located in
> Ignite-geospatial module and uses an r-tree implementation underneath. One
> downside of this module is that the geospatial index is not supported for
> the Ignite native persistence yet.
>
> Hope this helps!
> --AG
>
> чт, 16 авг. 2018 г. в 6:21, Alexey Zinoviev <[email protected]>:
>
> > Sorry, for the delay, dear Pavel and Denis.
> >
> > Yes, I need a kind of indexing to improve KNN algorithms during training
> > the model.
> >
> > In my draft solution I've implemented building of
> > https://en.wikipedia.org/wiki/K-d_tree
> > <https://en.wikipedia.org/wiki/K-d_tree> on the training data set.
> > It collects the information about data distributed in our specific ML
> > Datasets (distributed by nodes over Ignite cache)
> >
> > Pavel, you ask me any questions and I've prepared answers.
> >
> > 1) Should be this index in-memory only or you want to persist it?
> > *Maybe it should be persisted (to reuse that for next predictions)*
> >
> > 2) In general index can be implemented in two ways per-partition and
> > per-node.
> > *Thank you for your explanation. Of course the per-node is better.*
> >
> > 3) I think K-d tree is preferable
> > *You are absolutely right, it should be K-d tree*
> >
> > 4) Could you please share use cases you're trying to solve with QuadTree?
> > With
> > close to real data and examples of queries?
> >
> > I need to solve *k-nearest neighbors search task *on a set of vectors
> with
> > unique keys presented in Ignite Cache (training set),
> > during the training phase I'm going to build a temporary index as a
> KD-Tree
> > (based on distance between vectors).
> >
> > The distance metric is a parameter too.
> >
> > After that, in prediction phase the *k-nearest neighbors search task *is
> > solved by brute-force iteration over all vectors to find the *k-nearest
> > neighbors.*
> > I'd like to improve the search part with queries to index to extract
> > closest vectors.
> >
> > Of course, it's kind of experiment, and maybe it's bad idea to patch SQL
> > internals to solve this private task, but as you mentioned it can be a
> good
> > point of collaboration between ML and SQL components.
> >
> > Can I get one of the implemented indices as a primary example and
> implement
> > it by myself?
> > Could you recommend something as an start point for me?
> >
> > Thanks for your help.
> >
> >
> >
> >
> > 2018-08-04 11:18 GMT+06:00 Denis Magda <[email protected]>:
> >
> > > Alexey, are you working on some new ML/DL APIs/algorithms? Please
> > elaborate
> > > what you'd like to add to Ignite.
> > >
> > > --
> > > Denis
> > >
> > > On Wed, Aug 1, 2018 at 3:10 PM Pavel Kovalenko <[email protected]>
> > wrote:
> > >
> > > > Hello Alexey,
> > > >
> > > > It's not so difficult to implement new type of indexing of data, but
> if
> > > you
> > > > want to reach performance in distributed environment you need to have
> > > > strong knowledge of a data you're indexing and what kind of queries
> you
> > > > want to execute.
> > > > Should be this index in-memory only or you want to persist it?
> > > > In case of persistence your index should fit our page memory model
> > > > requirements.
> > > > In both cases your index should be ready to work in concurrent
> > > environment.
> > > >
> > > > In general index can be implemented in two ways per-partition and
> > > per-node.
> > > > Per-partition may be efficient if you have a lot of points (x,y)
> > > > representing a big one, e.g. image. In this case it's required that
> all
> > > of
> > > > these points will be in one partition that query e.g. makes images
> > > > intersection will execute in one node. But if you have multiple
> images,
> > > > your index will contain also another points from other object and
> will
> > > > overload it.
> > > > Per-node may be efficient if you have a lot of points (x,y) that are
> > > > independent of each other, that you will use it as spatial e.g.. But
> in
> > > > this case, I think K-d tree is preferable as it can be used in more
> > wide
> > > > way.
> > > >
> > > > Could you please share use cases you're trying to solve with
> QuadTree?
> > > With
> > > > close to real data and examples of queries? Because now the question
> is
> > > too
> > > > abstract and it's hard to understand how it should be implemented to
> > > reach
> > > > good results.
> > > >
> > > >
> > > > 2018-08-01 16:45 GMT+03:00 Alexey Zinoviev <[email protected]>:
> > > >
> > > > > Hi, Igniters.
> > > > >
> > > > > Currently I'm working on different math stuff over the Apache
> Ignite
> > > and
> > > > in
> > > > > a few tasks I need to implement in memory something like this
> > > > >
> > > > > https://en.wikipedia.org/wiki/Quadtree
> > > > >
> > > > > I didn't find such index in Apache Ignite, but maybe it's under
> > > > development
> > > > > by somebody?
> > > > >
> > > > > Is it a difficult to add a new index type to our distributed SQL
> > (from
> > > > > point of view of different infrastructure issues and so on P.S I
> > don't
> > > > > worry the math stuff here because I've implemented it many times in
> > > > > non-distributed version)?
> > > > >
> > > > > It will be great to hear any kind of your thoughts and maybe
> somebody
> > > > could
> > > > > help with implementation
> > > > >
> > > > > zaleslaw
> > > > >
> > > >
> > >
> >
>

Re: [Distributed SQL] Do we have a plan to implement QuadTree index?

Reply via email to