Re: Text queries/indexes (GridLuceneIndex, @QueryTextFiled)

Yuriy Shuliga Thu, 17 Oct 2019 06:43:07 -0700

  Andrey,

Per you request, I created ticket
https://issues.apache.org/jira/browse/IGNITE-12291   linked to
https://issues.apache.org/jira/projects/IGNITE/issues/IGNITE-12189


Could you please proceed with PR merge ?

BR,
Yuriy Shuliha

ср, 9 жовт. 2019 о 12:52 Andrey Mashenkov <[email protected]> пише:

> Hi Yuri,
>
> To get access to TC Bot you should register as TeamCity user [1], if you
> didn't do this already.
> Then you will be able to authorize on Ignite TC Bot page with same
> credentials.
>
> [1] https://ci.ignite.apache.org/registerUser.html
>
> On Fri, Oct 4, 2019 at 3:10 PM Yuriy Shuliga <[email protected]> wrote:
>
>> Andrew,
>>
>> I have corrected PR according to your notes. Please review.
>> What will be the next steps in order to merge in?
>>
>> Y.
>>
>> чт, 3 жовт. 2019 о 17:47 Andrey Mashenkov <[email protected]>
>> пише:
>>
>> > Yuri,
>> >
>> > I've done with review.
>> > No crime found, but trivial compatibility bug.
>> >
>> > On Thu, Oct 3, 2019 at 3:54 PM Yuriy Shuliga <[email protected]> wrote:
>> >
>> > > Denis,
>> > >
>> > > Thank you for your attention to this.
>> > > as for now, the https://issues.apache.org/jira/browse/IGNITE-12189
>> > ticket
>> > > is still pending review.
>> > > Do we have a chance to move it forward somehow?
>> > >
>> > > BR,
>> > > Yuriy Shuliha
>> > >
>> > > пн, 30 вер. 2019 о 23:35 Denis Magda <[email protected]> пише:
>> > >
>> > > > Yuriy,
>> > > >
>> > > > I've seen you opening a pull-request with the first changes:
>> > > > https://issues.apache.org/jira/browse/IGNITE-12189
>> > > >
>> > > > Alex Scherbakov and Ivan are you the right guys to do the review?
>> > > >
>> > > > -
>> > > > Denis
>> > > >
>> > > >
>> > > > On Fri, Sep 27, 2019 at 8:48 AM Павлухин Иван <[email protected]>
>> > > wrote:
>> > > >
>> > > > > Yuriy,
>> > > > >
>> > > > > Thank you for providing details! Quite interesting.
>> > > > >
>> > > > > Yes, we already have support of distributed limit and merging
>> sorted
>> > > > > subresults for SQL queries. E.g. ReduceIndexSorted and
>> > > > > MergeStreamIterator are used for merging sorted streams.
>> > > > >
>> > > > > Could you please also clarify about score/relevance? Is it
>> provided
>> > by
>> > > > > Lucene engine for each query result? I am thinking how to do
>> sorted
>> > > > > merge properly in this case.
>> > > > >
>> > > > > ср, 25 сент. 2019 г. в 18:56, Yuriy Shuliga <[email protected]>:
>> > > > > >
>> > > > > > Ivan,
>> > > > > >
>> > > > > > Thank you for interesting question!
>> > > > > >
>> > > > > > Text searches (or full text searches) are mostly human-oriented.
>> > And
>> > > > the
>> > > > > > point of user's interest is topmost part of response.
>> > > > > > Then user can read it, evaluate and use the given records for
>> > further
>> > > > > > purposes.
>> > > > > >
>> > > > > > Particularly in our case, we use Ignite for operations with
>> > financial
>> > > > > data,
>> > > > > > and there lots of text stuff like assets names, fin.
>> instruments,
>> > > > > companies
>> > > > > > etc.
>> > > > > > In order to operate with this quickly and reliably, users used
>> to
>> > > work
>> > > > > with
>> > > > > > text search, type-ahead completions, suggestions.
>> > > > > >
>> > > > > > For this purposes we are indexing particular string data in
>> > separate
>> > > > > caches.
>> > > > > >
>> > > > > > Sorting capabilities and response size limitations are very
>> > important
>> > > > > > there. As our API have to provide most relevant information in
>> view
>> > > of
>> > > > > > limited size.
>> > > > > >
>> > > > > > Now let me comment some Ignite/Lucene perspective.
>> > > > > > Actually Ignite queries and Lucene returns *TopDocs.scoresDocs
>> > > *already
>> > > > > > sorted by *score *(relevance). So most relevant documents are on
>> > the
>> > > > top.
>> > > > > > And currently distributed queries responses from different nodes
>> > are
>> > > > > merged
>> > > > > > into final query cursor queue in arbitrary way.
>> > > > > > So in fact we already have the score order ruined here. Also
>> Ignite
>> > > > > > requests all possible documents from Lucene that is redundant
>> and
>> > not
>> > > > > good
>> > > > > > for performance.
>> > > > > >
>> > > > > > I'm implementing *limit* parameter to be part of *TextQuery *and
>> > have
>> > > > to
>> > > > > > notice that we still have to add sorting for text queries
>> > processing
>> > > in
>> > > > > > order to have applicable results.
>> > > > > >
>> > > > > > *Limit* parameter itself should improve the part of issues from
>> > > above,
>> > > > > but
>> > > > > > definitely, sorting by document score at least  should be
>> > implemented
>> > > > > along
>> > > > > > with limit.
>> > > > > >
>> > > > > > This is a pretty short commentary if you still have any
>> questions,
>> > > > please
>> > > > > > ask, do not hesitate)
>> > > > > >
>> > > > > > BR,
>> > > > > > Yuriy Shuliha
>> > > > > >
>> > > > > > чт, 19 вер. 2019 о 11:38 Павлухин Иван <[email protected]>
>> пише:
>> > > > > >
>> > > > > > > Yuriy,
>> > > > > > >
>> > > > > > > Greatly appreciate your interest.
>> > > > > > >
>> > > > > > > Could you please elaborate a little bit about sorting? What
>> tasks
>> > > > does
>> > > > > > > it help to solve and how? It would be great to provide an
>> > example.
>> > > > > > >
>> > > > > > > ср, 18 сент. 2019 г. в 09:39, Alexei Scherbakov <
>> > > > > > > [email protected]>:
>> > > > > > > >
>> > > > > > > > Denis,
>> > > > > > > >
>> > > > > > > > I like the idea of throwing an exception for enabled text
>> > queries
>> > > > on
>> > > > > > > > persistent caches.
>> > > > > > > >
>> > > > > > > > Also I'm fine with proposed limit for unsorted searches.
>> > > > > > > >
>> > > > > > > > Yury, please proceed with ticket creation.
>> > > > > > > >
>> > > > > > > > вт, 17 сент. 2019 г., 22:06 Denis Magda <[email protected]
>> >:
>> > > > > > > >
>> > > > > > > > > Igniters,
>> > > > > > > > >
>> > > > > > > > > I see nothing wrong with Yury's proposal in regards
>> full-text
>> > > > > search
>> > > > > > > API
>> > > > > > > > > evolution as long as Yury is ready to push it forward.
>> > > > > > > > >
>> > > > > > > > > As for the in-memory mode only, it makes total sense for
>> > > > in-memory
>> > > > > data
>> > > > > > > > > grid deployments when Ignite caches data of an underlying
>> DB
>> > > like
>> > > > > > > Postgres.
>> > > > > > > > > As part of the changes, I would simply throw an exception
>> (by
>> > > > > default)
>> > > > > > > if
>> > > > > > > > > the one attempts to use text indices with the native
>> > > persistence
>> > > > > > > enabled.
>> > > > > > > > > If the person is ready to live with that limitation that
>> an
>> > > > > explicit
>> > > > > > > > > configuration change is needed to come around the
>> exception.
>> > > > > > > > >
>> > > > > > > > > Thoughts?
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > -
>> > > > > > > > > Denis
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > On Tue, Sep 17, 2019 at 7:44 AM Yuriy Shuliga <
>> > > [email protected]
>> > > > >
>> > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > Hello to all again,
>> > > > > > > > > >
>> > > > > > > > > > Thank you for important comments and notes given below!
>> > > > > > > > > >
>> > > > > > > > > > Let me answer and continue the discussion.
>> > > > > > > > > >
>> > > > > > > > > > (I) Overall needs in Lucene indexing
>> > > > > > > > > >
>> > > > > > > > > > Alexei has referenced to
>> > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5371 where
>> > > > > > > > > > absence of index persistence was declared as an
>> obstacle to
>> > > > > further
>> > > > > > > > > > development.
>> > > > > > > > > >
>> > > > > > > > > > a) This ticket is already closed as not valid.b) There
>> are
>> > > > > definite
>> > > > > > > needs
>> > > > > > > > > > (and in our project as well) in just in-memory indexing
>> of
>> > > > > selected
>> > > > > > > data.
>> > > > > > > > > > We intend to use search capabilities for fetching
>> limited
>> > > > amount
>> > > > > of
>> > > > > > > > > records
>> > > > > > > > > > that should be used in type-ahead search / suggestions.
>> > > > > > > > > > Not all of the data will be indexed and the are no need
>> in
>> > > > Lucene
>> > > > > > > index
>> > > > > > > > > to
>> > > > > > > > > > be persistence. Hope this is a wide pattern of
>> text-search
>> > > > usage.
>> > > > > > > > > >
>> > > > > > > > > > (II) Necessary fixes in current implementation.
>> > > > > > > > > >
>> > > > > > > > > > a) Implementation of correct *limit *(*offset* seems to
>> be
>> > > not
>> > > > > > > required
>> > > > > > > > > in
>> > > > > > > > > > text-search tasks for now)
>> > > > > > > > > > I have investigated the data flow for distributed text
>> > > queries.
>> > > > > it
>> > > > > > > was
>> > > > > > > > > > simple test prefix query, like 'name'*='ene*'*
>> > > > > > > > > > For now each server-node returns all response records to
>> > the
>> > > > > > > client-node
>> > > > > > > > > > and it may contain ~thousands, ~hundred thousands
>> records.
>> > > > > > > > > > Event if we need only first 10-100. Again, all the
>> results
>> > > are
>> > > > > added
>> > > > > > > to
>> > > > > > > > > > queue in GridCacheQueryFutureAdapter in arbitrary order
>> by
>> > > > pages.
>> > > > > > > > > > I did not find here any means to deliver deterministic
>> > > result.
>> > > > > > > > > > So implementing limit as part of query and
>> > > > > (GridCacheQueryRequest)
>> > > > > > > will
>> > > > > > > > > not
>> > > > > > > > > > change the nature of response but will limit load on
>> nodes
>> > > and
>> > > > > > > > > networking.
>> > > > > > > > > >
>> > > > > > > > > > Can we consider to open a ticket for this?
>> > > > > > > > > >
>> > > > > > > > > > (III) Further extension of Lucene API exposition to
>> Ignite
>> > > > > > > > > >
>> > > > > > > > > > a) Sorting
>> > > > > > > > > > The solution for this could be:
>> > > > > > > > > > - Make entities comparable
>> > > > > > > > > > - Add custom comparator to entity
>> > > > > > > > > > - Add annotations to mark sorted fields for Lucene
>> indexing
>> > > > > > > > > > - Use comparators when merging responses or reducing to
>> > > desired
>> > > > > > > limit on
>> > > > > > > > > > client node.
>> > > > > > > > > > Will require full result set to be loaded into memory.
>> > Though
>> > > > > can be
>> > > > > > > used
>> > > > > > > > > > for relatively small limits.
>> > > > > > > > > > BR,
>> > > > > > > > > > Yuriy Shuliha
>> > > > > > > > > >
>> > > > > > > > > > пт, 30 серп. 2019 о 10:37 Alexei Scherbakov <
>> > > > > > > > > [email protected]>
>> > > > > > > > > > пише:
>> > > > > > > > > >
>> > > > > > > > > > > Yuriy,
>> > > > > > > > > > >
>> > > > > > > > > > > Note what one of major blockers for text queries is
>> [1]
>> > > which
>> > > > > makes
>> > > > > > > > > > lucene
>> > > > > > > > > > > indexes unusable with persistence and main reason for
>> > > > > > > discontinuation.
>> > > > > > > > > > > Probably it's should be addressed first to make text
>> > > queries
>> > > > a
>> > > > > > > valid
>> > > > > > > > > > > product feature.
>> > > > > > > > > > >
>> > > > > > > > > > > Distributed sorting and advanved querying is indeed
>> not a
>> > > > > trivial
>> > > > > > > task.
>> > > > > > > > > > > Some kind of merging must be implemented on query
>> > > originating
>> > > > > node.
>> > > > > > > > > > >
>> > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5371
>> > > > > > > > > > >
>> > > > > > > > > > > чт, 29 авг. 2019 г. в 23:38, Denis Magda <
>> > > [email protected]
>> > > > >:
>> > > > > > > > > > >
>> > > > > > > > > > > > Yuriy,
>> > > > > > > > > > > >
>> > > > > > > > > > > > If you are ready to take over the full-text search
>> > > indexes
>> > > > > then
>> > > > > > > > > please
>> > > > > > > > > > go
>> > > > > > > > > > > > ahead. The primary reason why the community wants to
>> > > > > discontinue
>> > > > > > > them
>> > > > > > > > > > > first
>> > > > > > > > > > > > (and, probable, resurrect later) are the limitations
>> > > listed
>> > > > > by
>> > > > > > > Andrey
>> > > > > > > > > > and
>> > > > > > > > > > > > minimal support from the community end.
>> > > > > > > > > > > >
>> > > > > > > > > > > > -
>> > > > > > > > > > > > Denis
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov <
>> > > > > > > > > > > > [email protected]>
>> > > > > > > > > > > > wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > > > Hi Yuriy,
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Unfortunatelly, there is a plan to discontinue
>> > > > TextQueries
>> > > > > in
>> > > > > > > > > Ignite
>> > > > > > > > > > > [1].
>> > > > > > > > > > > > > Motivation here is text indexes are not
>> persistent,
>> > not
>> > > > > > > > > transactional
>> > > > > > > > > > > and
>> > > > > > > > > > > > > can't be user together with SQL or inside SQL.
>> > > > > > > > > > > > > and there is a lack of interest from community
>> side.
>> > > > > > > > > > > > > You are weclome to take on these issues and make
>> > > > > TextQueries
>> > > > > > > great.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > 1,  PageSize can't be used to limit resultset.
>> > > > > > > > > > > > > Query results return from data node to client-side
>> > > cursor
>> > > > > in
>> > > > > > > > > > > page-by-page
>> > > > > > > > > > > > > manner and
>> > > > > > > > > > > > > this parameter is designed control page size. It
>> is
>> > > > > supposed
>> > > > > > > query
>> > > > > > > > > > > > executes
>> > > > > > > > > > > > > lazily on server side and
>> > > > > > > > > > > > > it is not excepted full resultset be loaded to
>> memory
>> > > on
>> > > > > server
>> > > > > > > > > side
>> > > > > > > > > > at
>> > > > > > > > > > > > > once, but by pages.
>> > > > > > > > > > > > > Do you mean you found Lucene load entire resultset
>> > into
>> > > > > memory
>> > > > > > > > > before
>> > > > > > > > > > > > first
>> > > > > > > > > > > > > page is sent to client?
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > I'd think a new parameter should be added to limit
>> > > > result.
>> > > > > The
>> > > > > > > best
>> > > > > > > > > > > > > solution is to use query language commands for
>> this,
>> > > e.g.
>> > > > > > > > > > > "LIMIT/OFFSET"
>> > > > > > > > > > > > in
>> > > > > > > > > > > > > SQL.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > This task doesn't look trivial. Query is
>> distributed
>> > > > > operation
>> > > > > > > and
>> > > > > > > > > > same
>> > > > > > > > > > > > > user query will be executed on data nodes
>> > > > > > > > > > > > > and then results from all nodes should be correcly
>> > > merged
>> > > > > > > before
>> > > > > > > > > > being
>> > > > > > > > > > > > > returned via client-cursor.
>> > > > > > > > > > > > > So, LIMIT should be applied on every node and
>> then on
>> > > > merge
>> > > > > > > phase.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Also, this may be non-obviuos, limiting results
>> make
>> > no
>> > > > > sence
>> > > > > > > > > without
>> > > > > > > > > > > > > sorting,
>> > > > > > > > > > > > > as there is no guarantee every next query run will
>> > > return
>> > > > > same
>> > > > > > > data
>> > > > > > > > > > > > because
>> > > > > > > > > > > > > of page reordeing.
>> > > > > > > > > > > > > Basically, merge phase receive results from data
>> > nodes
>> > > > > > > > > asynchronously
>> > > > > > > > > > > and
>> > > > > > > > > > > > > messages from different nodes can't be ordered.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > 2.
>> > > > > > > > > > > > > a. "tokenize" param name (for @QueryTextFiled)
>> looks
>> > > more
>> > > > > > > verbose,
>> > > > > > > > > > > isn't
>> > > > > > > > > > > > > it.
>> > > > > > > > > > > > > b,c. What about distributed query? How partial
>> > results
>> > > > from
>> > > > > > > nodes
>> > > > > > > > > > will
>> > > > > > > > > > > be
>> > > > > > > > > > > > > merged?
>> > > > > > > > > > > > >  Does Lucene allows to configure comparator for
>> data
>> > > > > sorting?
>> > > > > > > > > > > > > What comparator Ignite should choose to sort
>> result
>> > on
>> > > > > merge
>> > > > > > > phase?
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > 3. For now Lucene engine is not configurable at
>> all.
>> > > E.g.
>> > > > > it is
>> > > > > > > > > > > > impossible
>> > > > > > > > > > > > > to configure Tokenizer.
>> > > > > > > > > > > > > I'd think about possible ways to configure engine
>> at
>> > > > first
>> > > > > and
>> > > > > > > only
>> > > > > > > > > > > then
>> > > > > > > > > > > > go
>> > > > > > > > > > > > > further to discuss\implement complex features,
>> > > > > > > > > > > > > that may depends on engine config.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga <
>> > > > > > > [email protected]>
>> > > > > > > > > > > wrote:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > > Dear community,
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > By starting this chain I'd like to open
>> discussion
>> > > that
>> > > > > would
>> > > > > > > > > come
>> > > > > > > > > > to
>> > > > > > > > > > > > > > contribution results in subj. area.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Ignite has indexing capabilities, backed up by
>> > > > different
>> > > > > > > > > > mechanisms,
>> > > > > > > > > > > > > > including Lucene.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Currently, Lucene 7.5.0 is used (past year
>> > release).
>> > > > > > > > > > > > > > This is a wide spread and mature technology that
>> > > covers
>> > > > > text
>> > > > > > > > > search
>> > > > > > > > > > > > area
>> > > > > > > > > > > > > > and beyond (e.g. spacial data indexing).
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > My goal is to *expose more Lucene functionality
>> to
>> > > > Ignite
>> > > > > > > > > indexing
>> > > > > > > > > > > and
>> > > > > > > > > > > > > > query mechanisms for text data*.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > It's quite simple request at current stage. It
>> is
>> > > > coming
>> > > > > > > from our
>> > > > > > > > > > > > > project's
>> > > > > > > > > > > > > > needs, but i believe, will be useful for a lot
>> more
>> > > > > people.
>> > > > > > > > > > > > > > Let's walk through and vote or discuss about
>> Jira
>> > > > > tickets for
>> > > > > > > > > them.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > 1.[trivial] Use  dataQuery.getPageSize()  to
>> limit
>> > > > search
>> > > > > > > > > response
>> > > > > > > > > > > > items
>> > > > > > > > > > > > > > inside GridLuceneIndex.query(). Currently it is
>> > > calling
>> > > > > > > > > > > > > > IndexSearcher.search(query,
>> *Integer.MAX_VALUE*) -
>> > so
>> > > > > > > basically
>> > > > > > > > > all
>> > > > > > > > > > > > > scored
>> > > > > > > > > > > > > > matches will me returned, what we do not need in
>> > most
>> > > > > cases.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > 2.[simple] Add sorting.  Then more capable
>> search
>> > > call
>> > > > > can be
>> > > > > > > > > > > > > > executed: *IndexSearcher.search(query, count,
>> > > > > > > > > > > > > > sort) *
>> > > > > > > > > > > > > > Implementation steps:
>> > > > > > > > > > > > > > a) Introduce boolean *sortField* parameter in
>> > > > > > > *@QueryTextFiled *
>> > > > > > > > > > > > > > annotation. If
>> > > > > > > > > > > > > > *true *the filed will be indexed but not
>> tokenized.
>> > > > > Number
>> > > > > > > types
>> > > > > > > > > > are
>> > > > > > > > > > > > > > preferred here.
>> > > > > > > > > > > > > > b) Add *sort* collection to *TextQuery*
>> > constructor.
>> > > It
>> > > > > > > should
>> > > > > > > > > > define
>> > > > > > > > > > > > > > desired sort fields used for querying.
>> > > > > > > > > > > > > > c) Implement Lucene sort usage in
>> > > > > GridLuceneIndex.query().
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > 3.[moderate] Build complex queries with
>> > *TextQuery*,
>> > > > > > > including
>> > > > > > > > > > > > > > terms/queries boosting.
>> > > > > > > > > > > > > > *This section for voting only, as requires more
>> > > > detailed
>> > > > > > > work.
>> > > > > > > > > > Should
>> > > > > > > > > > > > be
>> > > > > > > > > > > > > > extended if community is interested in it.*
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Looking forward to your comments!
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > BR,
>> > > > > > > > > > > > > > Yuriy Shuliha
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > --
>> > > > > > > > > > > > > Best regards,
>> > > > > > > > > > > > > Andrey V. Mashenkov
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > --
>> > > > > > > > > > >
>> > > > > > > > > > > Best regards,
>> > > > > > > > > > > Alexei Scherbakov
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > --
>> > > > > > > Best regards,
>> > > > > > > Ivan Pavlukhin
>> > > > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Best regards,
>> > > > > Ivan Pavlukhin
>> > > > >
>> > > >
>> > >
>> >
>> >
>> > --
>> > Best regards,
>> > Andrey V. Mashenkov
>> >
>>
>
>
> --
> Best regards,
> Andrey V. Mashenkov
>

Re: Text queries/indexes (GridLuceneIndex, @QueryTextFiled)

Reply via email to