Ok, many thanks, will look at geospatial index 2018-08-16 14:10 GMT+06:00 Alexey Goncharuk <alexey.goncha...@gmail.com>:
> Alexey, > > I am not sure if it will be a proper fir for you, but I think it worth a > try. > > Apache Ignite has an option to index geospatial data using third-party > libraries (note that it is not included in the default Ignite build, the > module is activated via the lgpl profile). The index is located in > Ignite-geospatial module and uses an r-tree implementation underneath. One > downside of this module is that the geospatial index is not supported for > the Ignite native persistence yet. > > Hope this helps! > --AG > > чт, 16 авг. 2018 г. в 6:21, Alexey Zinoviev <zaleslaw....@gmail.com>: > > > Sorry, for the delay, dear Pavel and Denis. > > > > Yes, I need a kind of indexing to improve KNN algorithms during training > > the model. > > > > In my draft solution I've implemented building of > > https://en.wikipedia.org/wiki/K-d_tree > > <https://en.wikipedia.org/wiki/K-d_tree> on the training data set. > > It collects the information about data distributed in our specific ML > > Datasets (distributed by nodes over Ignite cache) > > > > Pavel, you ask me any questions and I've prepared answers. > > > > 1) Should be this index in-memory only or you want to persist it? > > *Maybe it should be persisted (to reuse that for next predictions)* > > > > 2) In general index can be implemented in two ways per-partition and > > per-node. > > *Thank you for your explanation. Of course the per-node is better.* > > > > 3) I think K-d tree is preferable > > *You are absolutely right, it should be K-d tree* > > > > 4) Could you please share use cases you're trying to solve with QuadTree? > > With > > close to real data and examples of queries? > > > > I need to solve *k-nearest neighbors search task *on a set of vectors > with > > unique keys presented in Ignite Cache (training set), > > during the training phase I'm going to build a temporary index as a > KD-Tree > > (based on distance between vectors). > > > > The distance metric is a parameter too. > > > > After that, in prediction phase the *k-nearest neighbors search task *is > > solved by brute-force iteration over all vectors to find the *k-nearest > > neighbors.* > > I'd like to improve the search part with queries to index to extract > > closest vectors. > > > > Of course, it's kind of experiment, and maybe it's bad idea to patch SQL > > internals to solve this private task, but as you mentioned it can be a > good > > point of collaboration between ML and SQL components. > > > > Can I get one of the implemented indices as a primary example and > implement > > it by myself? > > Could you recommend something as an start point for me? > > > > Thanks for your help. > > > > > > > > > > 2018-08-04 11:18 GMT+06:00 Denis Magda <dma...@apache.org>: > > > > > Alexey, are you working on some new ML/DL APIs/algorithms? Please > > elaborate > > > what you'd like to add to Ignite. > > > > > > -- > > > Denis > > > > > > On Wed, Aug 1, 2018 at 3:10 PM Pavel Kovalenko <jokse...@gmail.com> > > wrote: > > > > > > > Hello Alexey, > > > > > > > > It's not so difficult to implement new type of indexing of data, but > if > > > you > > > > want to reach performance in distributed environment you need to have > > > > strong knowledge of a data you're indexing and what kind of queries > you > > > > want to execute. > > > > Should be this index in-memory only or you want to persist it? > > > > In case of persistence your index should fit our page memory model > > > > requirements. > > > > In both cases your index should be ready to work in concurrent > > > environment. > > > > > > > > In general index can be implemented in two ways per-partition and > > > per-node. > > > > Per-partition may be efficient if you have a lot of points (x,y) > > > > representing a big one, e.g. image. In this case it's required that > all > > > of > > > > these points will be in one partition that query e.g. makes images > > > > intersection will execute in one node. But if you have multiple > images, > > > > your index will contain also another points from other object and > will > > > > overload it. > > > > Per-node may be efficient if you have a lot of points (x,y) that are > > > > independent of each other, that you will use it as spatial e.g.. But > in > > > > this case, I think K-d tree is preferable as it can be used in more > > wide > > > > way. > > > > > > > > Could you please share use cases you're trying to solve with > QuadTree? > > > With > > > > close to real data and examples of queries? Because now the question > is > > > too > > > > abstract and it's hard to understand how it should be implemented to > > > reach > > > > good results. > > > > > > > > > > > > 2018-08-01 16:45 GMT+03:00 Alexey Zinoviev <zaleslaw....@gmail.com>: > > > > > > > > > Hi, Igniters. > > > > > > > > > > Currently I'm working on different math stuff over the Apache > Ignite > > > and > > > > in > > > > > a few tasks I need to implement in memory something like this > > > > > > > > > > https://en.wikipedia.org/wiki/Quadtree > > > > > > > > > > I didn't find such index in Apache Ignite, but maybe it's under > > > > development > > > > > by somebody? > > > > > > > > > > Is it a difficult to add a new index type to our distributed SQL > > (from > > > > > point of view of different infrastructure issues and so on P.S I > > don't > > > > > worry the math stuff here because I've implemented it many times in > > > > > non-distributed version)? > > > > > > > > > > It will be great to hear any kind of your thoughts and maybe > somebody > > > > could > > > > > help with implementation > > > > > > > > > > zaleslaw > > > > > > > > > > > > > > >