Andrey, Per you request, I created ticket https://issues.apache.org/jira/browse/IGNITE-12291 linked to https://issues.apache.org/jira/projects/IGNITE/issues/IGNITE-12189
Could you please proceed with PR merge ? BR, Yuriy Shuliha ср, 9 жовт. 2019 о 12:52 Andrey Mashenkov <[email protected]> пише: > Hi Yuri, > > To get access to TC Bot you should register as TeamCity user [1], if you > didn't do this already. > Then you will be able to authorize on Ignite TC Bot page with same > credentials. > > [1] https://ci.ignite.apache.org/registerUser.html > > On Fri, Oct 4, 2019 at 3:10 PM Yuriy Shuliga <[email protected]> wrote: > >> Andrew, >> >> I have corrected PR according to your notes. Please review. >> What will be the next steps in order to merge in? >> >> Y. >> >> чт, 3 жовт. 2019 о 17:47 Andrey Mashenkov <[email protected]> >> пише: >> >> > Yuri, >> > >> > I've done with review. >> > No crime found, but trivial compatibility bug. >> > >> > On Thu, Oct 3, 2019 at 3:54 PM Yuriy Shuliga <[email protected]> wrote: >> > >> > > Denis, >> > > >> > > Thank you for your attention to this. >> > > as for now, the https://issues.apache.org/jira/browse/IGNITE-12189 >> > ticket >> > > is still pending review. >> > > Do we have a chance to move it forward somehow? >> > > >> > > BR, >> > > Yuriy Shuliha >> > > >> > > пн, 30 вер. 2019 о 23:35 Denis Magda <[email protected]> пише: >> > > >> > > > Yuriy, >> > > > >> > > > I've seen you opening a pull-request with the first changes: >> > > > https://issues.apache.org/jira/browse/IGNITE-12189 >> > > > >> > > > Alex Scherbakov and Ivan are you the right guys to do the review? >> > > > >> > > > - >> > > > Denis >> > > > >> > > > >> > > > On Fri, Sep 27, 2019 at 8:48 AM Павлухин Иван <[email protected]> >> > > wrote: >> > > > >> > > > > Yuriy, >> > > > > >> > > > > Thank you for providing details! Quite interesting. >> > > > > >> > > > > Yes, we already have support of distributed limit and merging >> sorted >> > > > > subresults for SQL queries. E.g. ReduceIndexSorted and >> > > > > MergeStreamIterator are used for merging sorted streams. >> > > > > >> > > > > Could you please also clarify about score/relevance? Is it >> provided >> > by >> > > > > Lucene engine for each query result? I am thinking how to do >> sorted >> > > > > merge properly in this case. >> > > > > >> > > > > ср, 25 сент. 2019 г. в 18:56, Yuriy Shuliga <[email protected]>: >> > > > > > >> > > > > > Ivan, >> > > > > > >> > > > > > Thank you for interesting question! >> > > > > > >> > > > > > Text searches (or full text searches) are mostly human-oriented. >> > And >> > > > the >> > > > > > point of user's interest is topmost part of response. >> > > > > > Then user can read it, evaluate and use the given records for >> > further >> > > > > > purposes. >> > > > > > >> > > > > > Particularly in our case, we use Ignite for operations with >> > financial >> > > > > data, >> > > > > > and there lots of text stuff like assets names, fin. >> instruments, >> > > > > companies >> > > > > > etc. >> > > > > > In order to operate with this quickly and reliably, users used >> to >> > > work >> > > > > with >> > > > > > text search, type-ahead completions, suggestions. >> > > > > > >> > > > > > For this purposes we are indexing particular string data in >> > separate >> > > > > caches. >> > > > > > >> > > > > > Sorting capabilities and response size limitations are very >> > important >> > > > > > there. As our API have to provide most relevant information in >> view >> > > of >> > > > > > limited size. >> > > > > > >> > > > > > Now let me comment some Ignite/Lucene perspective. >> > > > > > Actually Ignite queries and Lucene returns *TopDocs.scoresDocs >> > > *already >> > > > > > sorted by *score *(relevance). So most relevant documents are on >> > the >> > > > top. >> > > > > > And currently distributed queries responses from different nodes >> > are >> > > > > merged >> > > > > > into final query cursor queue in arbitrary way. >> > > > > > So in fact we already have the score order ruined here. Also >> Ignite >> > > > > > requests all possible documents from Lucene that is redundant >> and >> > not >> > > > > good >> > > > > > for performance. >> > > > > > >> > > > > > I'm implementing *limit* parameter to be part of *TextQuery *and >> > have >> > > > to >> > > > > > notice that we still have to add sorting for text queries >> > processing >> > > in >> > > > > > order to have applicable results. >> > > > > > >> > > > > > *Limit* parameter itself should improve the part of issues from >> > > above, >> > > > > but >> > > > > > definitely, sorting by document score at least should be >> > implemented >> > > > > along >> > > > > > with limit. >> > > > > > >> > > > > > This is a pretty short commentary if you still have any >> questions, >> > > > please >> > > > > > ask, do not hesitate) >> > > > > > >> > > > > > BR, >> > > > > > Yuriy Shuliha >> > > > > > >> > > > > > чт, 19 вер. 2019 о 11:38 Павлухин Иван <[email protected]> >> пише: >> > > > > > >> > > > > > > Yuriy, >> > > > > > > >> > > > > > > Greatly appreciate your interest. >> > > > > > > >> > > > > > > Could you please elaborate a little bit about sorting? What >> tasks >> > > > does >> > > > > > > it help to solve and how? It would be great to provide an >> > example. >> > > > > > > >> > > > > > > ср, 18 сент. 2019 г. в 09:39, Alexei Scherbakov < >> > > > > > > [email protected]>: >> > > > > > > > >> > > > > > > > Denis, >> > > > > > > > >> > > > > > > > I like the idea of throwing an exception for enabled text >> > queries >> > > > on >> > > > > > > > persistent caches. >> > > > > > > > >> > > > > > > > Also I'm fine with proposed limit for unsorted searches. >> > > > > > > > >> > > > > > > > Yury, please proceed with ticket creation. >> > > > > > > > >> > > > > > > > вт, 17 сент. 2019 г., 22:06 Denis Magda <[email protected] >> >: >> > > > > > > > >> > > > > > > > > Igniters, >> > > > > > > > > >> > > > > > > > > I see nothing wrong with Yury's proposal in regards >> full-text >> > > > > search >> > > > > > > API >> > > > > > > > > evolution as long as Yury is ready to push it forward. >> > > > > > > > > >> > > > > > > > > As for the in-memory mode only, it makes total sense for >> > > > in-memory >> > > > > data >> > > > > > > > > grid deployments when Ignite caches data of an underlying >> DB >> > > like >> > > > > > > Postgres. >> > > > > > > > > As part of the changes, I would simply throw an exception >> (by >> > > > > default) >> > > > > > > if >> > > > > > > > > the one attempts to use text indices with the native >> > > persistence >> > > > > > > enabled. >> > > > > > > > > If the person is ready to live with that limitation that >> an >> > > > > explicit >> > > > > > > > > configuration change is needed to come around the >> exception. >> > > > > > > > > >> > > > > > > > > Thoughts? >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > - >> > > > > > > > > Denis >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > On Tue, Sep 17, 2019 at 7:44 AM Yuriy Shuliga < >> > > [email protected] >> > > > > >> > > > > > > wrote: >> > > > > > > > > >> > > > > > > > > > Hello to all again, >> > > > > > > > > > >> > > > > > > > > > Thank you for important comments and notes given below! >> > > > > > > > > > >> > > > > > > > > > Let me answer and continue the discussion. >> > > > > > > > > > >> > > > > > > > > > (I) Overall needs in Lucene indexing >> > > > > > > > > > >> > > > > > > > > > Alexei has referenced to >> > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5371 where >> > > > > > > > > > absence of index persistence was declared as an >> obstacle to >> > > > > further >> > > > > > > > > > development. >> > > > > > > > > > >> > > > > > > > > > a) This ticket is already closed as not valid.b) There >> are >> > > > > definite >> > > > > > > needs >> > > > > > > > > > (and in our project as well) in just in-memory indexing >> of >> > > > > selected >> > > > > > > data. >> > > > > > > > > > We intend to use search capabilities for fetching >> limited >> > > > amount >> > > > > of >> > > > > > > > > records >> > > > > > > > > > that should be used in type-ahead search / suggestions. >> > > > > > > > > > Not all of the data will be indexed and the are no need >> in >> > > > Lucene >> > > > > > > index >> > > > > > > > > to >> > > > > > > > > > be persistence. Hope this is a wide pattern of >> text-search >> > > > usage. >> > > > > > > > > > >> > > > > > > > > > (II) Necessary fixes in current implementation. >> > > > > > > > > > >> > > > > > > > > > a) Implementation of correct *limit *(*offset* seems to >> be >> > > not >> > > > > > > required >> > > > > > > > > in >> > > > > > > > > > text-search tasks for now) >> > > > > > > > > > I have investigated the data flow for distributed text >> > > queries. >> > > > > it >> > > > > > > was >> > > > > > > > > > simple test prefix query, like 'name'*='ene*'* >> > > > > > > > > > For now each server-node returns all response records to >> > the >> > > > > > > client-node >> > > > > > > > > > and it may contain ~thousands, ~hundred thousands >> records. >> > > > > > > > > > Event if we need only first 10-100. Again, all the >> results >> > > are >> > > > > added >> > > > > > > to >> > > > > > > > > > queue in GridCacheQueryFutureAdapter in arbitrary order >> by >> > > > pages. >> > > > > > > > > > I did not find here any means to deliver deterministic >> > > result. >> > > > > > > > > > So implementing limit as part of query and >> > > > > (GridCacheQueryRequest) >> > > > > > > will >> > > > > > > > > not >> > > > > > > > > > change the nature of response but will limit load on >> nodes >> > > and >> > > > > > > > > networking. >> > > > > > > > > > >> > > > > > > > > > Can we consider to open a ticket for this? >> > > > > > > > > > >> > > > > > > > > > (III) Further extension of Lucene API exposition to >> Ignite >> > > > > > > > > > >> > > > > > > > > > a) Sorting >> > > > > > > > > > The solution for this could be: >> > > > > > > > > > - Make entities comparable >> > > > > > > > > > - Add custom comparator to entity >> > > > > > > > > > - Add annotations to mark sorted fields for Lucene >> indexing >> > > > > > > > > > - Use comparators when merging responses or reducing to >> > > desired >> > > > > > > limit on >> > > > > > > > > > client node. >> > > > > > > > > > Will require full result set to be loaded into memory. >> > Though >> > > > > can be >> > > > > > > used >> > > > > > > > > > for relatively small limits. >> > > > > > > > > > BR, >> > > > > > > > > > Yuriy Shuliha >> > > > > > > > > > >> > > > > > > > > > пт, 30 серп. 2019 о 10:37 Alexei Scherbakov < >> > > > > > > > > [email protected]> >> > > > > > > > > > пише: >> > > > > > > > > > >> > > > > > > > > > > Yuriy, >> > > > > > > > > > > >> > > > > > > > > > > Note what one of major blockers for text queries is >> [1] >> > > which >> > > > > makes >> > > > > > > > > > lucene >> > > > > > > > > > > indexes unusable with persistence and main reason for >> > > > > > > discontinuation. >> > > > > > > > > > > Probably it's should be addressed first to make text >> > > queries >> > > > a >> > > > > > > valid >> > > > > > > > > > > product feature. >> > > > > > > > > > > >> > > > > > > > > > > Distributed sorting and advanved querying is indeed >> not a >> > > > > trivial >> > > > > > > task. >> > > > > > > > > > > Some kind of merging must be implemented on query >> > > originating >> > > > > node. >> > > > > > > > > > > >> > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5371 >> > > > > > > > > > > >> > > > > > > > > > > чт, 29 авг. 2019 г. в 23:38, Denis Magda < >> > > [email protected] >> > > > >: >> > > > > > > > > > > >> > > > > > > > > > > > Yuriy, >> > > > > > > > > > > > >> > > > > > > > > > > > If you are ready to take over the full-text search >> > > indexes >> > > > > then >> > > > > > > > > please >> > > > > > > > > > go >> > > > > > > > > > > > ahead. The primary reason why the community wants to >> > > > > discontinue >> > > > > > > them >> > > > > > > > > > > first >> > > > > > > > > > > > (and, probable, resurrect later) are the limitations >> > > listed >> > > > > by >> > > > > > > Andrey >> > > > > > > > > > and >> > > > > > > > > > > > minimal support from the community end. >> > > > > > > > > > > > >> > > > > > > > > > > > - >> > > > > > > > > > > > Denis >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov < >> > > > > > > > > > > > [email protected]> >> > > > > > > > > > > > wrote: >> > > > > > > > > > > > >> > > > > > > > > > > > > Hi Yuriy, >> > > > > > > > > > > > > >> > > > > > > > > > > > > Unfortunatelly, there is a plan to discontinue >> > > > TextQueries >> > > > > in >> > > > > > > > > Ignite >> > > > > > > > > > > [1]. >> > > > > > > > > > > > > Motivation here is text indexes are not >> persistent, >> > not >> > > > > > > > > transactional >> > > > > > > > > > > and >> > > > > > > > > > > > > can't be user together with SQL or inside SQL. >> > > > > > > > > > > > > and there is a lack of interest from community >> side. >> > > > > > > > > > > > > You are weclome to take on these issues and make >> > > > > TextQueries >> > > > > > > great. >> > > > > > > > > > > > > >> > > > > > > > > > > > > 1, PageSize can't be used to limit resultset. >> > > > > > > > > > > > > Query results return from data node to client-side >> > > cursor >> > > > > in >> > > > > > > > > > > page-by-page >> > > > > > > > > > > > > manner and >> > > > > > > > > > > > > this parameter is designed control page size. It >> is >> > > > > supposed >> > > > > > > query >> > > > > > > > > > > > executes >> > > > > > > > > > > > > lazily on server side and >> > > > > > > > > > > > > it is not excepted full resultset be loaded to >> memory >> > > on >> > > > > server >> > > > > > > > > side >> > > > > > > > > > at >> > > > > > > > > > > > > once, but by pages. >> > > > > > > > > > > > > Do you mean you found Lucene load entire resultset >> > into >> > > > > memory >> > > > > > > > > before >> > > > > > > > > > > > first >> > > > > > > > > > > > > page is sent to client? >> > > > > > > > > > > > > >> > > > > > > > > > > > > I'd think a new parameter should be added to limit >> > > > result. >> > > > > The >> > > > > > > best >> > > > > > > > > > > > > solution is to use query language commands for >> this, >> > > e.g. >> > > > > > > > > > > "LIMIT/OFFSET" >> > > > > > > > > > > > in >> > > > > > > > > > > > > SQL. >> > > > > > > > > > > > > >> > > > > > > > > > > > > This task doesn't look trivial. Query is >> distributed >> > > > > operation >> > > > > > > and >> > > > > > > > > > same >> > > > > > > > > > > > > user query will be executed on data nodes >> > > > > > > > > > > > > and then results from all nodes should be correcly >> > > merged >> > > > > > > before >> > > > > > > > > > being >> > > > > > > > > > > > > returned via client-cursor. >> > > > > > > > > > > > > So, LIMIT should be applied on every node and >> then on >> > > > merge >> > > > > > > phase. >> > > > > > > > > > > > > >> > > > > > > > > > > > > Also, this may be non-obviuos, limiting results >> make >> > no >> > > > > sence >> > > > > > > > > without >> > > > > > > > > > > > > sorting, >> > > > > > > > > > > > > as there is no guarantee every next query run will >> > > return >> > > > > same >> > > > > > > data >> > > > > > > > > > > > because >> > > > > > > > > > > > > of page reordeing. >> > > > > > > > > > > > > Basically, merge phase receive results from data >> > nodes >> > > > > > > > > asynchronously >> > > > > > > > > > > and >> > > > > > > > > > > > > messages from different nodes can't be ordered. >> > > > > > > > > > > > > >> > > > > > > > > > > > > 2. >> > > > > > > > > > > > > a. "tokenize" param name (for @QueryTextFiled) >> looks >> > > more >> > > > > > > verbose, >> > > > > > > > > > > isn't >> > > > > > > > > > > > > it. >> > > > > > > > > > > > > b,c. What about distributed query? How partial >> > results >> > > > from >> > > > > > > nodes >> > > > > > > > > > will >> > > > > > > > > > > be >> > > > > > > > > > > > > merged? >> > > > > > > > > > > > > Does Lucene allows to configure comparator for >> data >> > > > > sorting? >> > > > > > > > > > > > > What comparator Ignite should choose to sort >> result >> > on >> > > > > merge >> > > > > > > phase? >> > > > > > > > > > > > > >> > > > > > > > > > > > > 3. For now Lucene engine is not configurable at >> all. >> > > E.g. >> > > > > it is >> > > > > > > > > > > > impossible >> > > > > > > > > > > > > to configure Tokenizer. >> > > > > > > > > > > > > I'd think about possible ways to configure engine >> at >> > > > first >> > > > > and >> > > > > > > only >> > > > > > > > > > > then >> > > > > > > > > > > > go >> > > > > > > > > > > > > further to discuss\implement complex features, >> > > > > > > > > > > > > that may depends on engine config. >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga < >> > > > > > > [email protected]> >> > > > > > > > > > > wrote: >> > > > > > > > > > > > > >> > > > > > > > > > > > > > Dear community, >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > By starting this chain I'd like to open >> discussion >> > > that >> > > > > would >> > > > > > > > > come >> > > > > > > > > > to >> > > > > > > > > > > > > > contribution results in subj. area. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > Ignite has indexing capabilities, backed up by >> > > > different >> > > > > > > > > > mechanisms, >> > > > > > > > > > > > > > including Lucene. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > Currently, Lucene 7.5.0 is used (past year >> > release). >> > > > > > > > > > > > > > This is a wide spread and mature technology that >> > > covers >> > > > > text >> > > > > > > > > search >> > > > > > > > > > > > area >> > > > > > > > > > > > > > and beyond (e.g. spacial data indexing). >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > My goal is to *expose more Lucene functionality >> to >> > > > Ignite >> > > > > > > > > indexing >> > > > > > > > > > > and >> > > > > > > > > > > > > > query mechanisms for text data*. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > It's quite simple request at current stage. It >> is >> > > > coming >> > > > > > > from our >> > > > > > > > > > > > > project's >> > > > > > > > > > > > > > needs, but i believe, will be useful for a lot >> more >> > > > > people. >> > > > > > > > > > > > > > Let's walk through and vote or discuss about >> Jira >> > > > > tickets for >> > > > > > > > > them. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > 1.[trivial] Use dataQuery.getPageSize() to >> limit >> > > > search >> > > > > > > > > response >> > > > > > > > > > > > items >> > > > > > > > > > > > > > inside GridLuceneIndex.query(). Currently it is >> > > calling >> > > > > > > > > > > > > > IndexSearcher.search(query, >> *Integer.MAX_VALUE*) - >> > so >> > > > > > > basically >> > > > > > > > > all >> > > > > > > > > > > > > scored >> > > > > > > > > > > > > > matches will me returned, what we do not need in >> > most >> > > > > cases. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > 2.[simple] Add sorting. Then more capable >> search >> > > call >> > > > > can be >> > > > > > > > > > > > > > executed: *IndexSearcher.search(query, count, >> > > > > > > > > > > > > > sort) * >> > > > > > > > > > > > > > Implementation steps: >> > > > > > > > > > > > > > a) Introduce boolean *sortField* parameter in >> > > > > > > *@QueryTextFiled * >> > > > > > > > > > > > > > annotation. If >> > > > > > > > > > > > > > *true *the filed will be indexed but not >> tokenized. >> > > > > Number >> > > > > > > types >> > > > > > > > > > are >> > > > > > > > > > > > > > preferred here. >> > > > > > > > > > > > > > b) Add *sort* collection to *TextQuery* >> > constructor. >> > > It >> > > > > > > should >> > > > > > > > > > define >> > > > > > > > > > > > > > desired sort fields used for querying. >> > > > > > > > > > > > > > c) Implement Lucene sort usage in >> > > > > GridLuceneIndex.query(). >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > 3.[moderate] Build complex queries with >> > *TextQuery*, >> > > > > > > including >> > > > > > > > > > > > > > terms/queries boosting. >> > > > > > > > > > > > > > *This section for voting only, as requires more >> > > > detailed >> > > > > > > work. >> > > > > > > > > > Should >> > > > > > > > > > > > be >> > > > > > > > > > > > > > extended if community is interested in it.* >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > Looking forward to your comments! >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > BR, >> > > > > > > > > > > > > > Yuriy Shuliha >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > -- >> > > > > > > > > > > > > Best regards, >> > > > > > > > > > > > > Andrey V. Mashenkov >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > -- >> > > > > > > > > > > >> > > > > > > > > > > Best regards, >> > > > > > > > > > > Alexei Scherbakov >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > -- >> > > > > > > Best regards, >> > > > > > > Ivan Pavlukhin >> > > > > > > >> > > > > >> > > > > >> > > > > >> > > > > -- >> > > > > Best regards, >> > > > > Ivan Pavlukhin >> > > > > >> > > > >> > > >> > >> > >> > -- >> > Best regards, >> > Andrey V. Mashenkov >> > >> > > > -- > Best regards, > Andrey V. Mashenkov >
