Re: Text queries/indexes (GridLuceneIndex, @QueryTextFiled)

Павлухин Иван Thu, 19 Sep 2019 01:39:22 -0700

Yuriy,

Greatly appreciate your interest.


Could you please elaborate a little bit about sorting? What tasks does
it help to solve and how? It would be great to provide an example.

ср, 18 сент. 2019 г. в 09:39, Alexei Scherbakov <[email protected]>:
>
> Denis,
>
> I like the idea of throwing an exception for enabled text queries on
> persistent caches.
>
> Also I'm fine with proposed limit for unsorted searches.
>
> Yury, please proceed with ticket creation.
>
> вт, 17 сент. 2019 г., 22:06 Denis Magda <[email protected]>:
>
> > Igniters,
> >
> > I see nothing wrong with Yury's proposal in regards full-text search API
> > evolution as long as Yury is ready to push it forward.
> >
> > As for the in-memory mode only, it makes total sense for in-memory data
> > grid deployments when Ignite caches data of an underlying DB like Postgres.
> > As part of the changes, I would simply throw an exception (by default) if
> > the one attempts to use text indices with the native persistence enabled.
> > If the person is ready to live with that limitation that an explicit
> > configuration change is needed to come around the exception.
> >
> > Thoughts?
> >
> >
> > -
> > Denis
> >
> >
> > On Tue, Sep 17, 2019 at 7:44 AM Yuriy Shuliga <[email protected]> wrote:
> >
> > > Hello to all again,
> > >
> > > Thank you for important comments and notes given below!
> > >
> > > Let me answer and continue the discussion.
> > >
> > > (I) Overall needs in Lucene indexing
> > >
> > > Alexei has referenced to
> > > https://issues.apache.org/jira/browse/IGNITE-5371 where
> > > absence of index persistence was declared as an obstacle to further
> > > development.
> > >
> > > a) This ticket is already closed as not valid.b) There are definite needs
> > > (and in our project as well) in just in-memory indexing of selected data.
> > > We intend to use search capabilities for fetching limited amount of
> > records
> > > that should be used in type-ahead search / suggestions.
> > > Not all of the data will be indexed and the are no need in Lucene index
> > to
> > > be persistence. Hope this is a wide pattern of text-search usage.
> > >
> > > (II) Necessary fixes in current implementation.
> > >
> > > a) Implementation of correct *limit *(*offset* seems to be not required
> > in
> > > text-search tasks for now)
> > > I have investigated the data flow for distributed text queries. it was
> > > simple test prefix query, like 'name'*='ene*'*
> > > For now each server-node returns all response records to the client-node
> > > and it may contain ~thousands, ~hundred thousands records.
> > > Event if we need only first 10-100. Again, all the results are added to
> > > queue in GridCacheQueryFutureAdapter in arbitrary order by pages.
> > > I did not find here any means to deliver deterministic result.
> > > So implementing limit as part of query and (GridCacheQueryRequest) will
> > not
> > > change the nature of response but will limit load on nodes and
> > networking.
> > >
> > > Can we consider to open a ticket for this?
> > >
> > > (III) Further extension of Lucene API exposition to Ignite
> > >
> > > a) Sorting
> > > The solution for this could be:
> > > - Make entities comparable
> > > - Add custom comparator to entity
> > > - Add annotations to mark sorted fields for Lucene indexing
> > > - Use comparators when merging responses or reducing to desired limit on
> > > client node.
> > > Will require full result set to be loaded into memory. Though can be used
> > > for relatively small limits.
> > > BR,
> > > Yuriy Shuliha
> > >
> > > пт, 30 серп. 2019 о 10:37 Alexei Scherbakov <
> > [email protected]>
> > > пише:
> > >
> > > > Yuriy,
> > > >
> > > > Note what one of major blockers for text queries is [1] which makes
> > > lucene
> > > > indexes unusable with persistence and main reason for discontinuation.
> > > > Probably it's should be addressed first to make text queries a valid
> > > > product feature.
> > > >
> > > > Distributed sorting and advanved querying is indeed not a trivial task.
> > > > Some kind of merging must be implemented on query originating node.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/IGNITE-5371
> > > >
> > > > чт, 29 авг. 2019 г. в 23:38, Denis Magda <[email protected]>:
> > > >
> > > > > Yuriy,
> > > > >
> > > > > If you are ready to take over the full-text search indexes then
> > please
> > > go
> > > > > ahead. The primary reason why the community wants to discontinue them
> > > > first
> > > > > (and, probable, resurrect later) are the limitations listed by Andrey
> > > and
> > > > > minimal support from the community end.
> > > > >
> > > > > -
> > > > > Denis
> > > > >
> > > > >
> > > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov <
> > > > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi Yuriy,
> > > > > >
> > > > > > Unfortunatelly, there is a plan to discontinue TextQueries in
> > Ignite
> > > > [1].
> > > > > > Motivation here is text indexes are not persistent, not
> > transactional
> > > > and
> > > > > > can't be user together with SQL or inside SQL.
> > > > > > and there is a lack of interest from community side.
> > > > > > You are weclome to take on these issues and make TextQueries great.
> > > > > >
> > > > > > 1,  PageSize can't be used to limit resultset.
> > > > > > Query results return from data node to client-side cursor in
> > > > page-by-page
> > > > > > manner and
> > > > > > this parameter is designed control page size. It is supposed query
> > > > > executes
> > > > > > lazily on server side and
> > > > > > it is not excepted full resultset be loaded to memory on server
> > side
> > > at
> > > > > > once, but by pages.
> > > > > > Do you mean you found Lucene load entire resultset into memory
> > before
> > > > > first
> > > > > > page is sent to client?
> > > > > >
> > > > > > I'd think a new parameter should be added to limit result. The best
> > > > > > solution is to use query language commands for this, e.g.
> > > > "LIMIT/OFFSET"
> > > > > in
> > > > > > SQL.
> > > > > >
> > > > > > This task doesn't look trivial. Query is distributed operation and
> > > same
> > > > > > user query will be executed on data nodes
> > > > > > and then results from all nodes should be correcly merged before
> > > being
> > > > > > returned via client-cursor.
> > > > > > So, LIMIT should be applied on every node and then on merge phase.
> > > > > >
> > > > > > Also, this may be non-obviuos, limiting results make no sence
> > without
> > > > > > sorting,
> > > > > > as there is no guarantee every next query run will return same data
> > > > > because
> > > > > > of page reordeing.
> > > > > > Basically, merge phase receive results from data nodes
> > asynchronously
> > > > and
> > > > > > messages from different nodes can't be ordered.
> > > > > >
> > > > > > 2.
> > > > > > a. "tokenize" param name (for @QueryTextFiled) looks more verbose,
> > > > isn't
> > > > > > it.
> > > > > > b,c. What about distributed query? How partial results from nodes
> > > will
> > > > be
> > > > > > merged?
> > > > > >  Does Lucene allows to configure comparator for data sorting?
> > > > > > What comparator Ignite should choose to sort result on merge phase?
> > > > > >
> > > > > > 3. For now Lucene engine is not configurable at all. E.g. it is
> > > > > impossible
> > > > > > to configure Tokenizer.
> > > > > > I'd think about possible ways to configure engine at first and only
> > > > then
> > > > > go
> > > > > > further to discuss\implement complex features,
> > > > > > that may depends on engine config.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga <[email protected]>
> > > > wrote:
> > > > > >
> > > > > > > Dear community,
> > > > > > >
> > > > > > > By starting this chain I'd like to open discussion that would
> > come
> > > to
> > > > > > > contribution results in subj. area.
> > > > > > >
> > > > > > > Ignite has indexing capabilities, backed up by different
> > > mechanisms,
> > > > > > > including Lucene.
> > > > > > >
> > > > > > > Currently, Lucene 7.5.0 is used (past year release).
> > > > > > > This is a wide spread and mature technology that covers text
> > search
> > > > > area
> > > > > > > and beyond (e.g. spacial data indexing).
> > > > > > >
> > > > > > > My goal is to *expose more Lucene functionality to Ignite
> > indexing
> > > > and
> > > > > > > query mechanisms for text data*.
> > > > > > >
> > > > > > > It's quite simple request at current stage. It is coming from our
> > > > > > project's
> > > > > > > needs, but i believe, will be useful for a lot more people.
> > > > > > > Let's walk through and vote or discuss about Jira tickets for
> > them.
> > > > > > >
> > > > > > > 1.[trivial] Use  dataQuery.getPageSize()  to limit search
> > response
> > > > > items
> > > > > > > inside GridLuceneIndex.query(). Currently it is calling
> > > > > > > IndexSearcher.search(query, *Integer.MAX_VALUE*) - so basically
> > all
> > > > > > scored
> > > > > > > matches will me returned, what we do not need in most cases.
> > > > > > >
> > > > > > > 2.[simple] Add sorting.  Then more capable search call can be
> > > > > > > executed: *IndexSearcher.search(query, count,
> > > > > > > sort) *
> > > > > > > Implementation steps:
> > > > > > > a) Introduce boolean *sortField* parameter in *@QueryTextFiled *
> > > > > > > annotation. If
> > > > > > > *true *the filed will be indexed but not tokenized. Number types
> > > are
> > > > > > > preferred here.
> > > > > > > b) Add *sort* collection to *TextQuery* constructor. It should
> > > define
> > > > > > > desired sort fields used for querying.
> > > > > > > c) Implement Lucene sort usage in GridLuceneIndex.query().
> > > > > > >
> > > > > > > 3.[moderate] Build complex queries with *TextQuery*, including
> > > > > > > terms/queries boosting.
> > > > > > > *This section for voting only, as requires more detailed work.
> > > Should
> > > > > be
> > > > > > > extended if community is interested in it.*
> > > > > > >
> > > > > > > Looking forward to your comments!
> > > > > > >
> > > > > > > BR,
> > > > > > > Yuriy Shuliha
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > > Andrey V. Mashenkov
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Best regards,
> > > > Alexei Scherbakov
> > > >
> > >
> >



-- 
Best regards,
Ivan Pavlukhin

Re: Text queries/indexes (GridLuceneIndex, @QueryTextFiled)

Reply via email to