Re: Text queries/indexes (GridLuceneIndex, @QueryTextFiled)

Павлухин Иван Fri, 27 Sep 2019 08:49:17 -0700

Yuriy,

Thank you for providing details! Quite interesting.


Yes, we already have support of distributed limit and merging sorted
subresults for SQL queries. E.g. ReduceIndexSorted and
MergeStreamIterator are used for merging sorted streams.

Could you please also clarify about score/relevance? Is it provided by
Lucene engine for each query result? I am thinking how to do sorted
merge properly in this case.

ср, 25 сент. 2019 г. в 18:56, Yuriy Shuliga <shul...@gmail.com>:
>
> Ivan,
>
> Thank you for interesting question!
>
> Text searches (or full text searches) are mostly human-oriented. And the
> point of user's interest is topmost part of response.
> Then user can read it, evaluate and use the given records for further
> purposes.
>
> Particularly in our case, we use Ignite for operations with financial data,
> and there lots of text stuff like assets names, fin. instruments, companies
> etc.
> In order to operate with this quickly and reliably, users used to work with
> text search, type-ahead completions, suggestions.
>
> For this purposes we are indexing particular string data in separate caches.
>
> Sorting capabilities and response size limitations are very important
> there. As our API have to provide most relevant information in view of
> limited size.
>
> Now let me comment some Ignite/Lucene perspective.
> Actually Ignite queries and Lucene returns *TopDocs.scoresDocs *already
> sorted by *score *(relevance). So most relevant documents are on the top.
> And currently distributed queries responses from different nodes are merged
> into final query cursor queue in arbitrary way.
> So in fact we already have the score order ruined here. Also Ignite
> requests all possible documents from Lucene that is redundant and not good
> for performance.
>
> I'm implementing *limit* parameter to be part of *TextQuery *and have to
> notice that we still have to add sorting for text queries processing in
> order to have applicable results.
>
> *Limit* parameter itself should improve the part of issues from above, but
> definitely, sorting by document score at least  should be implemented along
> with limit.
>
> This is a pretty short commentary if you still have any questions, please
> ask, do not hesitate)
>
> BR,
> Yuriy Shuliha
>
> чт, 19 вер. 2019 о 11:38 Павлухин Иван <vololo...@gmail.com> пише:
>
> > Yuriy,
> >
> > Greatly appreciate your interest.
> >
> > Could you please elaborate a little bit about sorting? What tasks does
> > it help to solve and how? It would be great to provide an example.
> >
> > ср, 18 сент. 2019 г. в 09:39, Alexei Scherbakov <
> > alexey.scherbak...@gmail.com>:
> > >
> > > Denis,
> > >
> > > I like the idea of throwing an exception for enabled text queries on
> > > persistent caches.
> > >
> > > Also I'm fine with proposed limit for unsorted searches.
> > >
> > > Yury, please proceed with ticket creation.
> > >
> > > вт, 17 сент. 2019 г., 22:06 Denis Magda <dma...@apache.org>:
> > >
> > > > Igniters,
> > > >
> > > > I see nothing wrong with Yury's proposal in regards full-text search
> > API
> > > > evolution as long as Yury is ready to push it forward.
> > > >
> > > > As for the in-memory mode only, it makes total sense for in-memory data
> > > > grid deployments when Ignite caches data of an underlying DB like
> > Postgres.
> > > > As part of the changes, I would simply throw an exception (by default)
> > if
> > > > the one attempts to use text indices with the native persistence
> > enabled.
> > > > If the person is ready to live with that limitation that an explicit
> > > > configuration change is needed to come around the exception.
> > > >
> > > > Thoughts?
> > > >
> > > >
> > > > -
> > > > Denis
> > > >
> > > >
> > > > On Tue, Sep 17, 2019 at 7:44 AM Yuriy Shuliga <shul...@gmail.com>
> > wrote:
> > > >
> > > > > Hello to all again,
> > > > >
> > > > > Thank you for important comments and notes given below!
> > > > >
> > > > > Let me answer and continue the discussion.
> > > > >
> > > > > (I) Overall needs in Lucene indexing
> > > > >
> > > > > Alexei has referenced to
> > > > > https://issues.apache.org/jira/browse/IGNITE-5371 where
> > > > > absence of index persistence was declared as an obstacle to further
> > > > > development.
> > > > >
> > > > > a) This ticket is already closed as not valid.b) There are definite
> > needs
> > > > > (and in our project as well) in just in-memory indexing of selected
> > data.
> > > > > We intend to use search capabilities for fetching limited amount of
> > > > records
> > > > > that should be used in type-ahead search / suggestions.
> > > > > Not all of the data will be indexed and the are no need in Lucene
> > index
> > > > to
> > > > > be persistence. Hope this is a wide pattern of text-search usage.
> > > > >
> > > > > (II) Necessary fixes in current implementation.
> > > > >
> > > > > a) Implementation of correct *limit *(*offset* seems to be not
> > required
> > > > in
> > > > > text-search tasks for now)
> > > > > I have investigated the data flow for distributed text queries. it
> > was
> > > > > simple test prefix query, like 'name'*='ene*'*
> > > > > For now each server-node returns all response records to the
> > client-node
> > > > > and it may contain ~thousands, ~hundred thousands records.
> > > > > Event if we need only first 10-100. Again, all the results are added
> > to
> > > > > queue in GridCacheQueryFutureAdapter in arbitrary order by pages.
> > > > > I did not find here any means to deliver deterministic result.
> > > > > So implementing limit as part of query and (GridCacheQueryRequest)
> > will
> > > > not
> > > > > change the nature of response but will limit load on nodes and
> > > > networking.
> > > > >
> > > > > Can we consider to open a ticket for this?
> > > > >
> > > > > (III) Further extension of Lucene API exposition to Ignite
> > > > >
> > > > > a) Sorting
> > > > > The solution for this could be:
> > > > > - Make entities comparable
> > > > > - Add custom comparator to entity
> > > > > - Add annotations to mark sorted fields for Lucene indexing
> > > > > - Use comparators when merging responses or reducing to desired
> > limit on
> > > > > client node.
> > > > > Will require full result set to be loaded into memory. Though can be
> > used
> > > > > for relatively small limits.
> > > > > BR,
> > > > > Yuriy Shuliha
> > > > >
> > > > > пт, 30 серп. 2019 о 10:37 Alexei Scherbakov <
> > > > alexey.scherbak...@gmail.com>
> > > > > пише:
> > > > >
> > > > > > Yuriy,
> > > > > >
> > > > > > Note what one of major blockers for text queries is [1] which makes
> > > > > lucene
> > > > > > indexes unusable with persistence and main reason for
> > discontinuation.
> > > > > > Probably it's should be addressed first to make text queries a
> > valid
> > > > > > product feature.
> > > > > >
> > > > > > Distributed sorting and advanved querying is indeed not a trivial
> > task.
> > > > > > Some kind of merging must be implemented on query originating node.
> > > > > >
> > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5371
> > > > > >
> > > > > > чт, 29 авг. 2019 г. в 23:38, Denis Magda <dma...@apache.org>:
> > > > > >
> > > > > > > Yuriy,
> > > > > > >
> > > > > > > If you are ready to take over the full-text search indexes then
> > > > please
> > > > > go
> > > > > > > ahead. The primary reason why the community wants to discontinue
> > them
> > > > > > first
> > > > > > > (and, probable, resurrect later) are the limitations listed by
> > Andrey
> > > > > and
> > > > > > > minimal support from the community end.
> > > > > > >
> > > > > > > -
> > > > > > > Denis
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov <
> > > > > > > andrey.mashen...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Yuriy,
> > > > > > > >
> > > > > > > > Unfortunatelly, there is a plan to discontinue TextQueries in
> > > > Ignite
> > > > > > [1].
> > > > > > > > Motivation here is text indexes are not persistent, not
> > > > transactional
> > > > > > and
> > > > > > > > can't be user together with SQL or inside SQL.
> > > > > > > > and there is a lack of interest from community side.
> > > > > > > > You are weclome to take on these issues and make TextQueries
> > great.
> > > > > > > >
> > > > > > > > 1,  PageSize can't be used to limit resultset.
> > > > > > > > Query results return from data node to client-side cursor in
> > > > > > page-by-page
> > > > > > > > manner and
> > > > > > > > this parameter is designed control page size. It is supposed
> > query
> > > > > > > executes
> > > > > > > > lazily on server side and
> > > > > > > > it is not excepted full resultset be loaded to memory on server
> > > > side
> > > > > at
> > > > > > > > once, but by pages.
> > > > > > > > Do you mean you found Lucene load entire resultset into memory
> > > > before
> > > > > > > first
> > > > > > > > page is sent to client?
> > > > > > > >
> > > > > > > > I'd think a new parameter should be added to limit result. The
> > best
> > > > > > > > solution is to use query language commands for this, e.g.
> > > > > > "LIMIT/OFFSET"
> > > > > > > in
> > > > > > > > SQL.
> > > > > > > >
> > > > > > > > This task doesn't look trivial. Query is distributed operation
> > and
> > > > > same
> > > > > > > > user query will be executed on data nodes
> > > > > > > > and then results from all nodes should be correcly merged
> > before
> > > > > being
> > > > > > > > returned via client-cursor.
> > > > > > > > So, LIMIT should be applied on every node and then on merge
> > phase.
> > > > > > > >
> > > > > > > > Also, this may be non-obviuos, limiting results make no sence
> > > > without
> > > > > > > > sorting,
> > > > > > > > as there is no guarantee every next query run will return same
> > data
> > > > > > > because
> > > > > > > > of page reordeing.
> > > > > > > > Basically, merge phase receive results from data nodes
> > > > asynchronously
> > > > > > and
> > > > > > > > messages from different nodes can't be ordered.
> > > > > > > >
> > > > > > > > 2.
> > > > > > > > a. "tokenize" param name (for @QueryTextFiled) looks more
> > verbose,
> > > > > > isn't
> > > > > > > > it.
> > > > > > > > b,c. What about distributed query? How partial results from
> > nodes
> > > > > will
> > > > > > be
> > > > > > > > merged?
> > > > > > > >  Does Lucene allows to configure comparator for data sorting?
> > > > > > > > What comparator Ignite should choose to sort result on merge
> > phase?
> > > > > > > >
> > > > > > > > 3. For now Lucene engine is not configurable at all. E.g. it is
> > > > > > > impossible
> > > > > > > > to configure Tokenizer.
> > > > > > > > I'd think about possible ways to configure engine at first and
> > only
> > > > > > then
> > > > > > > go
> > > > > > > > further to discuss\implement complex features,
> > > > > > > > that may depends on engine config.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga <
> > shul...@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Dear community,
> > > > > > > > >
> > > > > > > > > By starting this chain I'd like to open discussion that would
> > > > come
> > > > > to
> > > > > > > > > contribution results in subj. area.
> > > > > > > > >
> > > > > > > > > Ignite has indexing capabilities, backed up by different
> > > > > mechanisms,
> > > > > > > > > including Lucene.
> > > > > > > > >
> > > > > > > > > Currently, Lucene 7.5.0 is used (past year release).
> > > > > > > > > This is a wide spread and mature technology that covers text
> > > > search
> > > > > > > area
> > > > > > > > > and beyond (e.g. spacial data indexing).
> > > > > > > > >
> > > > > > > > > My goal is to *expose more Lucene functionality to Ignite
> > > > indexing
> > > > > > and
> > > > > > > > > query mechanisms for text data*.
> > > > > > > > >
> > > > > > > > > It's quite simple request at current stage. It is coming
> > from our
> > > > > > > > project's
> > > > > > > > > needs, but i believe, will be useful for a lot more people.
> > > > > > > > > Let's walk through and vote or discuss about Jira tickets for
> > > > them.
> > > > > > > > >
> > > > > > > > > 1.[trivial] Use  dataQuery.getPageSize()  to limit search
> > > > response
> > > > > > > items
> > > > > > > > > inside GridLuceneIndex.query(). Currently it is calling
> > > > > > > > > IndexSearcher.search(query, *Integer.MAX_VALUE*) - so
> > basically
> > > > all
> > > > > > > > scored
> > > > > > > > > matches will me returned, what we do not need in most cases.
> > > > > > > > >
> > > > > > > > > 2.[simple] Add sorting.  Then more capable search call can be
> > > > > > > > > executed: *IndexSearcher.search(query, count,
> > > > > > > > > sort) *
> > > > > > > > > Implementation steps:
> > > > > > > > > a) Introduce boolean *sortField* parameter in
> > *@QueryTextFiled *
> > > > > > > > > annotation. If
> > > > > > > > > *true *the filed will be indexed but not tokenized. Number
> > types
> > > > > are
> > > > > > > > > preferred here.
> > > > > > > > > b) Add *sort* collection to *TextQuery* constructor. It
> > should
> > > > > define
> > > > > > > > > desired sort fields used for querying.
> > > > > > > > > c) Implement Lucene sort usage in GridLuceneIndex.query().
> > > > > > > > >
> > > > > > > > > 3.[moderate] Build complex queries with *TextQuery*,
> > including
> > > > > > > > > terms/queries boosting.
> > > > > > > > > *This section for voting only, as requires more detailed
> > work.
> > > > > Should
> > > > > > > be
> > > > > > > > > extended if community is interested in it.*
> > > > > > > > >
> > > > > > > > > Looking forward to your comments!
> > > > > > > > >
> > > > > > > > > BR,
> > > > > > > > > Yuriy Shuliha
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best regards,
> > > > > > > > Andrey V. Mashenkov
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Best regards,
> > > > > > Alexei Scherbakov
> > > > > >
> > > > >
> > > >
> >
> >
> >
> > --
> > Best regards,
> > Ivan Pavlukhin
> >



-- 
Best regards,
Ivan Pavlukhin

Re: Text queries/indexes (GridLuceneIndex, @QueryTextFiled)

Reply via email to