Yuriy, Note what one of major blockers for text queries is [1] which makes lucene indexes unusable with persistence and main reason for discontinuation. Probably it's should be addressed first to make text queries a valid product feature.
Distributed sorting and advanved querying is indeed not a trivial task. Some kind of merging must be implemented on query originating node. [1] https://issues.apache.org/jira/browse/IGNITE-5371 чт, 29 авг. 2019 г. в 23:38, Denis Magda <[email protected]>: > Yuriy, > > If you are ready to take over the full-text search indexes then please go > ahead. The primary reason why the community wants to discontinue them first > (and, probable, resurrect later) are the limitations listed by Andrey and > minimal support from the community end. > > - > Denis > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov < > [email protected]> > wrote: > > > Hi Yuriy, > > > > Unfortunatelly, there is a plan to discontinue TextQueries in Ignite [1]. > > Motivation here is text indexes are not persistent, not transactional and > > can't be user together with SQL or inside SQL. > > and there is a lack of interest from community side. > > You are weclome to take on these issues and make TextQueries great. > > > > 1, PageSize can't be used to limit resultset. > > Query results return from data node to client-side cursor in page-by-page > > manner and > > this parameter is designed control page size. It is supposed query > executes > > lazily on server side and > > it is not excepted full resultset be loaded to memory on server side at > > once, but by pages. > > Do you mean you found Lucene load entire resultset into memory before > first > > page is sent to client? > > > > I'd think a new parameter should be added to limit result. The best > > solution is to use query language commands for this, e.g. "LIMIT/OFFSET" > in > > SQL. > > > > This task doesn't look trivial. Query is distributed operation and same > > user query will be executed on data nodes > > and then results from all nodes should be correcly merged before being > > returned via client-cursor. > > So, LIMIT should be applied on every node and then on merge phase. > > > > Also, this may be non-obviuos, limiting results make no sence without > > sorting, > > as there is no guarantee every next query run will return same data > because > > of page reordeing. > > Basically, merge phase receive results from data nodes asynchronously and > > messages from different nodes can't be ordered. > > > > 2. > > a. "tokenize" param name (for @QueryTextFiled) looks more verbose, isn't > > it. > > b,c. What about distributed query? How partial results from nodes will be > > merged? > > Does Lucene allows to configure comparator for data sorting? > > What comparator Ignite should choose to sort result on merge phase? > > > > 3. For now Lucene engine is not configurable at all. E.g. it is > impossible > > to configure Tokenizer. > > I'd think about possible ways to configure engine at first and only then > go > > further to discuss\implement complex features, > > that may depends on engine config. > > > > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga <[email protected]> wrote: > > > > > Dear community, > > > > > > By starting this chain I'd like to open discussion that would come to > > > contribution results in subj. area. > > > > > > Ignite has indexing capabilities, backed up by different mechanisms, > > > including Lucene. > > > > > > Currently, Lucene 7.5.0 is used (past year release). > > > This is a wide spread and mature technology that covers text search > area > > > and beyond (e.g. spacial data indexing). > > > > > > My goal is to *expose more Lucene functionality to Ignite indexing and > > > query mechanisms for text data*. > > > > > > It's quite simple request at current stage. It is coming from our > > project's > > > needs, but i believe, will be useful for a lot more people. > > > Let's walk through and vote or discuss about Jira tickets for them. > > > > > > 1.[trivial] Use dataQuery.getPageSize() to limit search response > items > > > inside GridLuceneIndex.query(). Currently it is calling > > > IndexSearcher.search(query, *Integer.MAX_VALUE*) - so basically all > > scored > > > matches will me returned, what we do not need in most cases. > > > > > > 2.[simple] Add sorting. Then more capable search call can be > > > executed: *IndexSearcher.search(query, count, > > > sort) * > > > Implementation steps: > > > a) Introduce boolean *sortField* parameter in *@QueryTextFiled * > > > annotation. If > > > *true *the filed will be indexed but not tokenized. Number types are > > > preferred here. > > > b) Add *sort* collection to *TextQuery* constructor. It should define > > > desired sort fields used for querying. > > > c) Implement Lucene sort usage in GridLuceneIndex.query(). > > > > > > 3.[moderate] Build complex queries with *TextQuery*, including > > > terms/queries boosting. > > > *This section for voting only, as requires more detailed work. Should > be > > > extended if community is interested in it.* > > > > > > Looking forward to your comments! > > > > > > BR, > > > Yuriy Shuliha > > > > > > > > > -- > > Best regards, > > Andrey V. Mashenkov > > > -- Best regards, Alexei Scherbakov
