from:"jim ferenczi"

Re: Vamana greedy search variant

2023-08-05 Thread jim ferenczi

Hi Jonathan,

Could you provide further clarification on your goal? The current
description is unclear.

Why construct an HNSW graph only to 'optimize' it into a Vamana graph? Why
not directly build a Vamana graph? This paper
 provides guidance for streaming
graph construction.

An intriguing feature of the Vamana graph is its potential for more
efficient merging of multiple graphs compared to rebuilding the entire HNSW
graph for each merge. While I'm unsure of the process, it seems simpler to
work within a single level.

I'm also concerned about using HNSW code beyond the index writer. Flushing
to disk is integral to the index codec. It might be advisable to employ a
standard index writer rather than extracting code for external use,
potentially disrupting the indexing chain.

Best regards,
Jim

On Sat, 5 Aug 2023 at 22:47, Jonathan Ellis  wrote:

> Hi all,
>
> I put FINGER on pause to try out different graph construction methods.
> Vamana (the in-memory component of DiskANN) looks interesting, especially
> because it's a two-pass algorithm and the first pass is very similar to
> "build L0 of hnsw."  So there is potentially a good fit there for using
> HNSW in Cassandra to index and query vectors "online" but when we flush to
> disk, optimize it with Vamana for about half as much work as if we started
> from zero.
>
> Here's what I'm seeing on the nytimes-256 dataset, at two different beam
> widths:
>
> HNSW: top 100 recall 0.8166, build 42.98s, query 14.89s. 42148760 nodes
> visited
>
> Vamana@2.0: top 100 recall 0.9050, build 15.71s, query 13.03s. 39525880
> nodes visited
>
> Not bad, right?  But here's the thing, that's not using exactly the same
> search algorithm.  When I point the HnswGraphSearcher.searchLevel at the
> Vamana graph, it's much worse:
>
> Vamona@2.0: top 100 recall 0.7323, build 16.17s, query 9.45s. 44599250
> nodes visited
>
> It is faster, mostly (?) because I've barely put any time into optimizing
> the Vamana search method.  But recall is much worse.
>
> I've attached clips of the algorithm descriptions.  Setting aside a bit of
> noise, the main difference is that hnsw says, "stop expanding nodes when
> none of the candidates are closer to the query than the worst of the topk
> found so far."  While Vamana keeps the topk results and candidates in the
> *same* set [1] and says, "stop expanding nodes when none of this top L set
> are unvisited" (where L >= k).  In the results above I manually set L to
> 4*k, since that makes the visited counts close to even.
> .
> Does anyone have any intuition for why, when pointed at the same graph,
> this does better with the same (slightly fewer) nodes visited?
>
> If you'd rather play with the code, rough cut is here:
> https://github.com/jbellis/lucene/tree/concurrent-vamana, search code is
> VamanaGraphBuilder.greedySearch.  Test harness branch at
> https://github.com/jbellis/hnswdemo/tree/vamana.
>
> [1] I implemented this as FixedNeighborArray.
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Conneting Lucene with ChatGPT Retrieval Plugin

2023-05-09 Thread jim ferenczi

Lucene is a library. I don’t see how it would be exposed in this plugin
which is about services.


On Tue, 9 May 2023 at 18:00, Jun Luo  wrote:

> The pr mentioned a Elasticsearch pr
>  that increased the
> dim to 2048 in ElasticSearch.
>
> Curious how you use Lucene's KNN search. Lucene's KNN supports one vector
> per document. Usually multiple/many vectors are needed for a document
> content. We will have to split the document content into chunks and create
> one Lucene document per document chunk.
>
> ChatGPT plugin directly stores the chunk text in the underline vector db.
> If there are lots of documents, will it be a concern to store the full
> document content in Lucene? In the traditional inverted index use case, is
> it common to store the full document content in Lucene?
>
> Another question: if you use Lucene as a vector db, do you still need the
> inverted index? Wondering what would be the use case to use inverted index
> together with vector index. If we don't need the inverted index, will it be
> better to use other vector dbs? For example, PostgreSQL also added vector
> support recently.
>
> Thanks,
> Jun
>
> On Sat, May 6, 2023 at 1:44 PM Michael Wechner 
> wrote:
>
>> there is already a pull request for Elasticsearch which is also
>> mentioning the max size 1024
>>
>> https://github.com/openai/chatgpt-retrieval-plugin/pull/83
>>
>>
>>
>> Am 06.05.23 um 19:00 schrieb Michael Wechner:
>> > Hi Together
>> >
>> > I recently setup ChatGPT retrieval plugin locally
>> >
>> > https://github.com/openai/chatgpt-retrieval-plugin
>> >
>> > I think it would be nice to consider to submit a Lucene implementation
>> > for this plugin
>> >
>> > https://github.com/openai/chatgpt-retrieval-plugin#future-directions
>> >
>> > The plugin is using by default OpenAI's model "text-embedding-ada-002"
>> > with 1536 dimensions
>> >
>> > https://openai.com/blog/new-and-improved-embedding-model
>> >
>> > but which means one won't be able to use it out-of-the-box with Lucene.
>> >
>> > Similar request here
>> >
>> >
>> https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions
>> >
>> >
>> > I understand we just recently had a lenghty discussion about
>> > increasing the max dimension and whatever one thinks of OpenAI, fact
>> > is, that it has a huge impact and I think it would be nice that Lucene
>> > could be part of this "revolution". All we have to do is increase the
>> > limit from 1024 to 1536 or even 2048 for example.
>> >
>> > Since the performace seems to be linear with the vector dimension and
>> > several members have done performance tests successfully and 1024
>> > seems to have been chosen as max dimension quite arbitrarily in the
>> > first place, I think it should not be a problem to increase the max
>> > dimension by a factor 1.5 or 2.
>> >
>> > WDYT?
>> >
>> > Thanks
>> >
>> > Michael
>> >
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread jim ferenczi

> Keep in mind, there may be other ways to do it. In general if merging
something is going to be "heavyweight", we should think about it to
prevent things from going really bad overall.

Yep I agree. Personally I don t see how we can solve this without prior
knowledge of the vectors. Faiss has a nice implementation that fits
naturally with Lucene called IVF (
https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html)
but if we want to avoid running kmeans on every merge we d require to
provide the clusters for the entire index before indexing the first vector.
It s a complex issue…

On Fri, 7 Apr 2023 at 22:58, Robert Muir  wrote:

> Personally i'd have to re-read the paper, but in general the merging
> issue has to be addressed somehow to fix the overall indexing time
> problem. It seems it gets "dodged" with huge rambuffers in the emails
> here.
> Keep in mind, there may be other ways to do it. In general if merging
> something is going to be "heavyweight", we should think about it to
> prevent things from going really bad overall.
>
> As an example, I'm most familiar with adding DEFLATE compression to
> stored fields. Previously, we'd basically decompress and recompress
> the stored fields on merge, and LZ4 is so fast that it wasn't
> obviously a problem. But with DEFLATE it got slower/heavier (more
> intense compression algorithm), something had to be done or indexing
> would be unacceptably slow. Hence if you look at storedfields writer,
> there is "dirtiness" logic etc so that recompression is amortized over
> time and doesn't happen on every merge.
>
> On Fri, Apr 7, 2023 at 5:38 PM jim ferenczi 
> wrote:
> >
> > I am also not sure that diskann would solve the merging issue. The idea
> describe in the paper is to run kmeans first to create multiple graphs, one
> per cluster. In our case the vectors in each segment could belong to
> different cluster so I don’t see how we could merge them efficiently.
> >
> > On Fri, 7 Apr 2023 at 22:28, jim ferenczi 
> wrote:
> >>
> >> The inference time (and cost) to generate these big vectors must be
> quite large too ;).
> >> Regarding the ram buffer, we could drastically reduce the size by
> writing the vectors on disk instead of keeping them in the heap. With 1k
> dimensions the ram buffer is filled with these vectors quite rapidly.
> >>
> >> On Fri, 7 Apr 2023 at 21:59, Robert Muir  wrote:
> >>>
> >>> On Fri, Apr 7, 2023 at 7:47 AM Michael Sokolov 
> wrote:
> >>> >
> >>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer
> size=1994)
> >>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer
> size=1994)
> >>> >
> >>> > Robert, since you're the only on-the-record veto here, does this
> >>> > change your thinking at all, or if not could you share some test
> >>> > results that didn't go the way you expected? Maybe we can find some
> >>> > mitigation if we focus on a specific issue.
> >>> >
> >>>
> >>> My scale concerns are both space and time. What does the execution
> >>> time look like if you don't set insanely large IW rambuffer? The
> >>> default is 16MB. Just concerned we're shoving some problems under the
> >>> rug :)
> >>>
> >>> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
> >>> to index 4M documents with these 2k vectors. Whereas you'd measure
> >>> this in seconds with typical lucene indexing, its nothing.
> >>>
> >>> -
> >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >>> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread jim ferenczi

> It is designed to build an in-memory datastructure and "merge" means
"rebuild".

The main idea imo in the diskann paper is to build the graph with the full
dimensions to preserve the quality of the neighbors. At query time it uses
the reduced dimensions (using product quantization) to compute the
similarity and thus reducing the ram required by a large factor. This is
something we could do with the current implementation. I think that Michael
tested something similar with the quantization, but when applied at build
time too it reduces the quality of the graph and the overall recall.

On Fri, 7 Apr 2023 at 22:36, jim ferenczi  wrote:

> I am also not sure that diskann would solve the merging issue. The idea
> describe in the paper is to run kmeans first to create multiple graphs, one
> per cluster. In our case the vectors in each segment could belong to
> different cluster so I don’t see how we could merge them efficiently.
>
> On Fri, 7 Apr 2023 at 22:28, jim ferenczi  wrote:
>
>> The inference time (and cost) to generate these big vectors must be quite
>> large too ;).
>> Regarding the ram buffer, we could drastically reduce the size by writing
>> the vectors on disk instead of keeping them in the heap. With 1k dimensions
>> the ram buffer is filled with these vectors quite rapidly.
>>
>> On Fri, 7 Apr 2023 at 21:59, Robert Muir  wrote:
>>
>>> On Fri, Apr 7, 2023 at 7:47 AM Michael Sokolov 
>>> wrote:
>>> >
>>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
>>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer
>>> size=1994)
>>> >
>>> > Robert, since you're the only on-the-record veto here, does this
>>> > change your thinking at all, or if not could you share some test
>>> > results that didn't go the way you expected? Maybe we can find some
>>> > mitigation if we focus on a specific issue.
>>> >
>>>
>>> My scale concerns are both space and time. What does the execution
>>> time look like if you don't set insanely large IW rambuffer? The
>>> default is 16MB. Just concerned we're shoving some problems under the
>>> rug :)
>>>
>>> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
>>> to index 4M documents with these 2k vectors. Whereas you'd measure
>>> this in seconds with typical lucene indexing, its nothing.
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread jim ferenczi

I am also not sure that diskann would solve the merging issue. The idea
describe in the paper is to run kmeans first to create multiple graphs, one
per cluster. In our case the vectors in each segment could belong to
different cluster so I don’t see how we could merge them efficiently.

On Fri, 7 Apr 2023 at 22:28, jim ferenczi  wrote:

> The inference time (and cost) to generate these big vectors must be quite
> large too ;).
> Regarding the ram buffer, we could drastically reduce the size by writing
> the vectors on disk instead of keeping them in the heap. With 1k dimensions
> the ram buffer is filled with these vectors quite rapidly.
>
> On Fri, 7 Apr 2023 at 21:59, Robert Muir  wrote:
>
>> On Fri, Apr 7, 2023 at 7:47 AM Michael Sokolov 
>> wrote:
>> >
>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer
>> size=1994)
>> >
>> > Robert, since you're the only on-the-record veto here, does this
>> > change your thinking at all, or if not could you share some test
>> > results that didn't go the way you expected? Maybe we can find some
>> > mitigation if we focus on a specific issue.
>> >
>>
>> My scale concerns are both space and time. What does the execution
>> time look like if you don't set insanely large IW rambuffer? The
>> default is 16MB. Just concerned we're shoving some problems under the
>> rug :)
>>
>> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
>> to index 4M documents with these 2k vectors. Whereas you'd measure
>> this in seconds with typical lucene indexing, its nothing.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread jim ferenczi

The inference time (and cost) to generate these big vectors must be quite
large too ;).
Regarding the ram buffer, we could drastically reduce the size by writing
the vectors on disk instead of keeping them in the heap. With 1k dimensions
the ram buffer is filled with these vectors quite rapidly.

On Fri, 7 Apr 2023 at 21:59, Robert Muir  wrote:

> On Fri, Apr 7, 2023 at 7:47 AM Michael Sokolov  wrote:
> >
> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
> >
> > Robert, since you're the only on-the-record veto here, does this
> > change your thinking at all, or if not could you share some test
> > results that didn't go the way you expected? Maybe we can find some
> > mitigation if we focus on a specific issue.
> >
>
> My scale concerns are both space and time. What does the execution
> time look like if you don't set insanely large IW rambuffer? The
> default is 16MB. Just concerned we're shoving some problems under the
> rug :)
>
> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
> to index 4M documents with these 2k vectors. Whereas you'd measure
> this in seconds with typical lucene indexing, its nothing.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: Welcome Julie Tibshirani to the Lucene PMC

2021-11-30 Thread jim ferenczi

Congrats and welcome Julie!

Le mar. 30 nov. 2021 à 22:49, Adrien Grand  a écrit :

> I'm pleased to announce that Julie Tibshirani has accepted an invitation
> to join the Lucene PMC!
>
> Congratulations Julie, and welcome aboard!
>
> --
> Adrien
>

Re: Accessibility of CollectedSearchGroup's state

2021-10-14 Thread jim ferenczi

I agree, we should have a SinglePassGroupingCollector in Elasticsearch and
reduce the visibility of these expert classes in Lucene.
As it stands today, the FirstPassGroupingCollector could be a final class
imo.


Le jeu. 14 oct. 2021 à 18:42, Adrien Grand  a écrit :

> I feel sorry for increasing the scope of all these requests for changes
> that you make, but the way Elasticsearch overrides this collector feels
> wrong to me as any change in the implementation details of this collector
> would probably break Elasticsearch's collector too. In my opinion,
> CollectedSearchGroup should not even be public. My preference would be to
> copy this collector to the Elasticsearch code base and fold the changes
> from Elasticsearch's CollapsingTopDocsCollector into it. I'm not super
> familiar with this code, so I might be missing something. Maybe Jim or Alan
> have an opinion.
>
> On Thu, Oct 14, 2021 at 1:48 PM Chris Hegarty
>  wrote:
>
>> In an effort to prepare Elasticsearch for modularization, we are
>> investigating and eliminating split packages. The situation has improved
>> through recent refactoring in Lucene 9.0 [1], but a number of split
>> packages still remain. This message identifies one such so that it can
>> be discussed in isolation, with a view to a potential solution either in
>> Lucene or possibly within Elasticsearch itself.
>>
>> Elasticsearch has a collapsing search collector that groups documents
>> based on field values and collapses based on the top sorted documents,
>> `CollapsingTopDocsCollector` [2]. The CTDC is a subclass of lucene's
>> `FirstPassGroupingCollector` [3], and extends its functionality to
>> get the top docs in just a single pass. As a subclass, the CTDC
>> leverages the sorted top N unique groups by means of the protected
>> `FPGC.orderedGroups` field (of type `TreeSet`),
>> when performing the collapsing. Specifically, the
>> `CollectedSearchGroup.topDoc` field is of interest in order to retrieve
>> the number of the top document. The `topDoc` field is package-private
>> and therefore not normally accessible to the CTDC (without resorting to
>> nasty tricks!).
>>
>> Given that lucene's publicly extensible FPGC exposes the `orderedGroups`
>> as a set of `CollectedSearchGroup`, and that CSG is a public class, it
>> would appear that the lack of public access to its state is likely an
>> oversight, rather than a deliberate design decision (otherwise, from
>> outside the package CSG adds no apparent value and appears as if a
>> marker interface, which is not useful to any subclasses).
>>
>> Minimally, the elasticsearch collector requires read access to the
>> `CollectedSearchGroup.topDoc` field. This could be achieved by adding
>> an accessor method to CSG, that returns the primitive int doc value.  But
>> this whole API seems fairly low-level and powerful (you should know what
>> you're doing - experts only!). Also, CSG's superclass, `SearchGroup`,
>> makes its state available through public fields, so maybe we could just
>> make CSG's state public too?
>>
>>
>> lucene/grouping/src/java/org/apache/lucene/search/grouping/CollectedSearchGroup.java
>>
>>  public class CollectedSearchGroup extends SearchGroup {
>> -  int topDoc;
>> -  int comparatorSlot;
>> +
>> +  /** The number of the top document. */
>> +  public int topDoc;
>> +
>> +  /** The field comparator slot. */
>> +  public int comparatorSlot;
>>  }
>>
>> -Chris.
>>
>> [1] https://issues.apache.org/jira/browse/LUCENE-9319
>> [2]
>> https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/apache/lucene/search/grouping/CollapseTopFieldDocs.java
>> [3]
>> https://github.com/apache/lucene/blob/main/lucene/grouping/src/java/org/apache/lucene/search/grouping/FirstPassGroupingCollector.java
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>
> --
> Adrien
>

Re: Welcome Peter Gromov as Lucene committer

2021-04-07 Thread jim ferenczi

Welcome Peter!

Le mer. 7 avr. 2021 à 15:14, Ignacio Vera  a écrit :

> Welcome Peter!
>
> On Wed, Apr 7, 2021 at 3:10 PM Peter Gromov
>  wrote:
>
>> Thanks, that helped!
>>
>> On Wed, Apr 7, 2021 at 11:23 AM Dawid Weiss 
>> wrote:
>>
>>> See here -
>>> https://git.apache.org/setup/
>>>
>>> On Wed, Apr 7, 2021 at 10:52 AM Peter Gromov
>>>  wrote:
>>> >
>>> > I have 2FA, but nobody added me to the ASF organization. How does one
>>> do that?
>>> >
>>> > On Wed, Apr 7, 2021 at 10:09 AM Atri Sharma  wrote:
>>> >>
>>> >> Did you set your 2FA up and get added to the Apache Software
>>> >> Foundation Github org?
>>> >>
>>> >> On Wed, Apr 7, 2021 at 12:42 PM Peter Gromov
>>> >>  wrote:
>>> >> >
>>> >> > Thanks for the honor!
>>> >> >
>>> >> > (BTW I'm still not recognized by Github as having write access, and
>>> can't merge my pull requests :))
>>> >> >
>>> >> > > Peter, the tradition is that new committers introduce themselves
>>> with a brief bio.
>>> >> >
>>> >> > Okay, time for some bragging :) I've been working at JetBrains for
>>> some 17 years, most of them on IntelliJ platform, mainly supporting various
>>> languages and their infrastructure, analyzing snapshots and improving
>>> performance. Aiming to catch more bugs before they hit production, I've
>>> introduced property-based testing to IntelliJ by creating a small library
>>> called jetCheck. Recently I've switched to the Grazie project and now I do
>>> some rule-based computational linguistics there and enhance the IDE support
>>> for English. As Grazie needs LanguageTool and Hunspell, I've also spent
>>> some time rewriting the latter in Java (here in Lucene), and optimizing
>>> them both. In my free time, I like mountain hiking (Munich/Germany is a
>>> great location for that!), and some amateur piano/harmonica playing/singing.
>>> >>
>>> >> --
>>> >> Regards,
>>> >>
>>> >> Atri
>>> >> Apache Concerted
>>> >>
>>> >> -
>>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>>> >>
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>

Re: [VOTE] Release Lucene/Solr 8.8.0 RC1

2021-01-20 Thread jim ferenczi

+1

SUCCESS! [1:06:13.147855]

Le mer. 20 janv. 2021 à 08:39, Atri Sharma  a écrit :

> +1 (binding)
>
> SUCCESS! [1:04:15:20393]
>
> On Wed, Jan 20, 2021 at 1:03 PM Ignacio Vera  wrote:
> >
> > +1 (binding)
> >
> > SUCCESS! [1:05:30.358141]
> >
> >
> > On Tue, Jan 19, 2021 at 8:25 PM Timothy Potter 
> wrote:
> >>
> >> +1 (binding)
> >>
> >> SUCCESS! [1:07:15.796578]
> >>
> >>
> >> Also built a *local* Docker image from the RC and tested various
> features with the Solr operator on K8s, such as the updates to the Prom
> exporter & Grafana dashboard for query performance.
> >>
> >>
> >> Looks good!
> >>
> >>
> >> On Tue, Jan 19, 2021 at 12:06 PM Houston Putman <
> houstonput...@gmail.com> wrote:
> >>>
> >>> +1
> >>>
> >>> SUCCESS! [1:01:28.552891]
> >>>
> >>> On Tue, Jan 19, 2021 at 1:53 PM Cassandra Targett <
> casstarg...@gmail.com> wrote:
> 
>  I’ve put up the DRAFT version of the Ref Guide for 8.8:
> https://lucene.apache.org/solr/guide/8_8/.
> 
>  I also created the Jenkins job for building the 8.8 guide which
> pushes to the Nightlies server in case we have edits between now and
> release (https://nightlies.apache.org/Lucene/Solr-reference-guide-8.8/).
> 
>  Note branch_8_8 does not (yet) include the new Math Expressions guide
> being worked on in SOLR-13105. Still hoping that will make it, but thought
> I’d get this out sooner rather than later just in case.
>  On Jan 19, 2021, 10:51 AM -0600, Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com>, wrote:
> 
>  Please vote for release candidate 1 for Lucene/Solr 8.8.0
> 
>  The artifacts can be downloaded from:
> 
> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.8.0-RC1-rev737cb9c49b08f6e3964c1e8a80132da3c764e027
> 
>  You can run the smoke tester directly with this command:
> 
>  python3 -u dev-tools/scripts/smokeTestRelease.py \
> 
> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.8.0-RC1-rev737cb9c49b08f6e3964c1e8a80132da3c764e027
> 
>  The vote will be open for at least 72 hours i.e. until 2021-01-22
> 17:00 UTC.
> 
>  [ ] +1  approve
>  [ ] +0  no opinion
>  [ ] -1  disapprove (and reason why)
> 
>  Here is my +1
>  
>
> --
> Regards,
>
> Atri
> Apache Concerted
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: 8.8 Release

2021-01-19 Thread jim ferenczi

Thanks Ishan, the change is merged in the 8.8 branch.
I should have said that it's not a blocker per say so if you've already
built RC1 and want to proceed with it please go ahead and I'll adapt the
change.


Le lun. 18 janv. 2021 à 16:30, Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> a écrit :

> Oh, I just started the RC1.. I'm barely half way through the Solr tests,
> though :-)
> Jim, Please go ahead and backport it to the release branch, and I'll pick
> it up in another attempt.
> Thanks,
> Ishan
>
> p.s. I got a window of opportunity to get some work done, so I took over
> from Noble tonight.
>
> On Mon, Jan 18, 2021 at 7:24 PM jim ferenczi 
> wrote:
>
>> Sorry, I forgot to add the link to the issue:
>> https://issues.apache.org/jira/browse/LUCENE-9675
>>
>>
>> Le lun. 18 janv. 2021 à 14:33, jim ferenczi  a
>> écrit :
>>
>>> Hi Noble,
>>> I opened an issue to expose the compression mode that is used in binary
>>> doc values. The configurable compression is a new feature in 8.8 so we'd
>>> like to expose the compression mode that was used to write the segment in
>>> the attributes of the field.
>>> I'd like to backport to the 8.8 branch when it's ready so that we don't
>>> need to add a new internal version in the doc values format. Is it
>>> acceptable ?
>>>
>>> Cheers,
>>> Jim
>>>
>>> Le sam. 16 janv. 2021 à 00:15, Namgyu Kim  a écrit :
>>>
>>>> Take care, Ishan!
>>>> Family is the most important thing.
>>>>
>>>> Hi Noble,
>>>> Thank you for your support :D
>>>> The commit for the blocker issue(LUCENE-9661) that I mentioned before
>>>> is submitted on branch_8_8.
>>>> (github :
>>>> https://github.com/apache/lucene-solr/commit/d1eca6399131950c74de156a4fb564ba70bdfc86
>>>> )
>>>> (Apache gitbox :
>>>> https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d1eca63)
>>>> Please let me know if there is a problem!
>>>>
>>>> On Fri, Jan 15, 2021 at 6:37 PM Noble Paul 
>>>> wrote:
>>>>
>>>>> Family first.
>>>>>
>>>>> We got this covered Ishan
>>>>>
>>>>> On Fri, Jan 15, 2021 at 12:43 PM David Smiley 
>>>>> wrote:
>>>>> >
>>>>> > Wow. Good call on handing off the release duties so you can focus on
>>>>> family. Take care!
>>>>> >
>>>>> > On Thu, Jan 14, 2021 at 3:46 PM Ishan Chattopadhyaya <
>>>>> ichattopadhy...@gmail.com> wrote:
>>>>> >>
>>>>> >> Hi Devs,
>>>>> >>
>>>>> >> Just admitted mom to a hospital, she has given us a sudden health
>>>>> scare. I doubt I'll be around for this release or for some time now.
>>>>> >>
>>>>> >> I've spoken to Noble and requested him to take over the release
>>>>> duties here.
>>>>> >>
>>>>> >> Thanks and regards,
>>>>> >> Ishan
>>>>> >>
>>>>> >> On Thu, 14 Jan, 2021, 2:03 pm Noble Paul, 
>>>>> wrote:
>>>>> >>>
>>>>> >>> I've merged it .
>>>>> >>> Thanks
>>>>> >>>
>>>>> >>> On Thu, Jan 14, 2021 at 5:12 PM Ishan Chattopadhyaya
>>>>> >>>  wrote:
>>>>> >>> >
>>>>> >>> > Sure Noble, if you are confident it doesn't disrupt the
>>>>> stability of the release, please go ahead. In case there are any problems
>>>>> with stability/performance discovered (due to this issue) at the time of
>>>>> the release, I'll revert this. Thanks!
>>>>> >>> >
>>>>> >>> > On Thu, Jan 14, 2021 at 11:33 AM Noble Paul <
>>>>> noble.p...@gmail.com> wrote:
>>>>> >>> >>
>>>>> >>> >> @Ishan Chattopadhyaya
>>>>> >>> >>
>>>>> >>> >> I wish to port https://issues.apache.org/jira/browse/SOLR-14155
>>>>> to
>>>>> >>> >> 8.8, if it is OK
>>>>> >>> >>
>>>>> >>> >> On Wed, Jan 13, 2021 at 9:20 AM Ishan Chattopadhyaya
>>>>> >>> >>  wrote:
>>>>>

Re: 8.8 Release

2021-01-18 Thread jim ferenczi

Sorry, I forgot to add the link to the issue:
https://issues.apache.org/jira/browse/LUCENE-9675


Le lun. 18 janv. 2021 à 14:33, jim ferenczi  a
écrit :

> Hi Noble,
> I opened an issue to expose the compression mode that is used in binary
> doc values. The configurable compression is a new feature in 8.8 so we'd
> like to expose the compression mode that was used to write the segment in
> the attributes of the field.
> I'd like to backport to the 8.8 branch when it's ready so that we don't
> need to add a new internal version in the doc values format. Is it
> acceptable ?
>
> Cheers,
> Jim
>
> Le sam. 16 janv. 2021 à 00:15, Namgyu Kim  a écrit :
>
>> Take care, Ishan!
>> Family is the most important thing.
>>
>> Hi Noble,
>> Thank you for your support :D
>> The commit for the blocker issue(LUCENE-9661) that I mentioned before is
>> submitted on branch_8_8.
>> (github :
>> https://github.com/apache/lucene-solr/commit/d1eca6399131950c74de156a4fb564ba70bdfc86
>> )
>> (Apache gitbox :
>> https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d1eca63)
>> Please let me know if there is a problem!
>>
>> On Fri, Jan 15, 2021 at 6:37 PM Noble Paul  wrote:
>>
>>> Family first.
>>>
>>> We got this covered Ishan
>>>
>>> On Fri, Jan 15, 2021 at 12:43 PM David Smiley 
>>> wrote:
>>> >
>>> > Wow. Good call on handing off the release duties so you can focus on
>>> family. Take care!
>>> >
>>> > On Thu, Jan 14, 2021 at 3:46 PM Ishan Chattopadhyaya <
>>> ichattopadhy...@gmail.com> wrote:
>>> >>
>>> >> Hi Devs,
>>> >>
>>> >> Just admitted mom to a hospital, she has given us a sudden health
>>> scare. I doubt I'll be around for this release or for some time now.
>>> >>
>>> >> I've spoken to Noble and requested him to take over the release
>>> duties here.
>>> >>
>>> >> Thanks and regards,
>>> >> Ishan
>>> >>
>>> >> On Thu, 14 Jan, 2021, 2:03 pm Noble Paul, 
>>> wrote:
>>> >>>
>>> >>> I've merged it .
>>> >>> Thanks
>>> >>>
>>> >>> On Thu, Jan 14, 2021 at 5:12 PM Ishan Chattopadhyaya
>>> >>>  wrote:
>>> >>> >
>>> >>> > Sure Noble, if you are confident it doesn't disrupt the stability
>>> of the release, please go ahead. In case there are any problems with
>>> stability/performance discovered (due to this issue) at the time of the
>>> release, I'll revert this. Thanks!
>>> >>> >
>>> >>> > On Thu, Jan 14, 2021 at 11:33 AM Noble Paul 
>>> wrote:
>>> >>> >>
>>> >>> >> @Ishan Chattopadhyaya
>>> >>> >>
>>> >>> >> I wish to port https://issues.apache.org/jira/browse/SOLR-14155
>>> to
>>> >>> >> 8.8, if it is OK
>>> >>> >>
>>> >>> >> On Wed, Jan 13, 2021 at 9:20 AM Ishan Chattopadhyaya
>>> >>> >>  wrote:
>>> >>> >> >
>>> >>> >> > Wish you a very happy new year, Namgyu!
>>> >>> >> > Please proceed, thanks!
>>> >>> >> >
>>> >>> >> > On Wed, 13 Jan, 2021, 1:01 am Namgyu Kim, 
>>> wrote:
>>> >>> >> >>
>>> >>> >> >> Hi Ishan,
>>> >>> >> >> It's late, but happy new year!
>>> >>> >> >>
>>> >>> >> >> There is a blocker issue for 8.x (including 8.8)
>>> >>> >> >> https://issues.apache.org/jira/browse/LUCENE-9661
>>> >>> >> >>
>>> >>> >> >> Would it be okay to cherry-pick the commit for LUCENE-9661 to
>>> the 8.8 branch after merging?
>>> >>> >> >>
>>> >>> >> >> On Wed, Jan 13, 2021 at 1:48 AM Ishan Chattopadhyaya <
>>> ichattopadhy...@gmail.com> wrote:
>>> >>> >> >>>
>>> >>> >> >>> Please go for it if you think it won't disrupt the stability
>>> and can be wrapped up in 2-3 days.
>>> >>> >> >>>
>>> >>> >> >>> On Tue, 12 Jan, 2021, 10:08 pm Walter Underwo

Re: 8.8 Release

2021-01-18 Thread jim ferenczi

Hi Noble,
I opened an issue to expose the compression mode that is used in binary doc
values. The configurable compression is a new feature in 8.8 so we'd like
to expose the compression mode that was used to write the segment in the
attributes of the field.
I'd like to backport to the 8.8 branch when it's ready so that we don't
need to add a new internal version in the doc values format. Is it
acceptable ?

Cheers,
Jim

Le sam. 16 janv. 2021 à 00:15, Namgyu Kim  a écrit :

> Take care, Ishan!
> Family is the most important thing.
>
> Hi Noble,
> Thank you for your support :D
> The commit for the blocker issue(LUCENE-9661) that I mentioned before is
> submitted on branch_8_8.
> (github :
> https://github.com/apache/lucene-solr/commit/d1eca6399131950c74de156a4fb564ba70bdfc86
> )
> (Apache gitbox :
> https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d1eca63)
> Please let me know if there is a problem!
>
> On Fri, Jan 15, 2021 at 6:37 PM Noble Paul  wrote:
>
>> Family first.
>>
>> We got this covered Ishan
>>
>> On Fri, Jan 15, 2021 at 12:43 PM David Smiley  wrote:
>> >
>> > Wow. Good call on handing off the release duties so you can focus on
>> family. Take care!
>> >
>> > On Thu, Jan 14, 2021 at 3:46 PM Ishan Chattopadhyaya <
>> ichattopadhy...@gmail.com> wrote:
>> >>
>> >> Hi Devs,
>> >>
>> >> Just admitted mom to a hospital, she has given us a sudden health
>> scare. I doubt I'll be around for this release or for some time now.
>> >>
>> >> I've spoken to Noble and requested him to take over the release duties
>> here.
>> >>
>> >> Thanks and regards,
>> >> Ishan
>> >>
>> >> On Thu, 14 Jan, 2021, 2:03 pm Noble Paul, 
>> wrote:
>> >>>
>> >>> I've merged it .
>> >>> Thanks
>> >>>
>> >>> On Thu, Jan 14, 2021 at 5:12 PM Ishan Chattopadhyaya
>> >>>  wrote:
>> >>> >
>> >>> > Sure Noble, if you are confident it doesn't disrupt the stability
>> of the release, please go ahead. In case there are any problems with
>> stability/performance discovered (due to this issue) at the time of the
>> release, I'll revert this. Thanks!
>> >>> >
>> >>> > On Thu, Jan 14, 2021 at 11:33 AM Noble Paul 
>> wrote:
>> >>> >>
>> >>> >> @Ishan Chattopadhyaya
>> >>> >>
>> >>> >> I wish to port https://issues.apache.org/jira/browse/SOLR-14155 to
>> >>> >> 8.8, if it is OK
>> >>> >>
>> >>> >> On Wed, Jan 13, 2021 at 9:20 AM Ishan Chattopadhyaya
>> >>> >>  wrote:
>> >>> >> >
>> >>> >> > Wish you a very happy new year, Namgyu!
>> >>> >> > Please proceed, thanks!
>> >>> >> >
>> >>> >> > On Wed, 13 Jan, 2021, 1:01 am Namgyu Kim, 
>> wrote:
>> >>> >> >>
>> >>> >> >> Hi Ishan,
>> >>> >> >> It's late, but happy new year!
>> >>> >> >>
>> >>> >> >> There is a blocker issue for 8.x (including 8.8)
>> >>> >> >> https://issues.apache.org/jira/browse/LUCENE-9661
>> >>> >> >>
>> >>> >> >> Would it be okay to cherry-pick the commit for LUCENE-9661 to
>> the 8.8 branch after merging?
>> >>> >> >>
>> >>> >> >> On Wed, Jan 13, 2021 at 1:48 AM Ishan Chattopadhyaya <
>> ichattopadhy...@gmail.com> wrote:
>> >>> >> >>>
>> >>> >> >>> Please go for it if you think it won't disrupt the stability
>> and can be wrapped up in 2-3 days.
>> >>> >> >>>
>> >>> >> >>> On Tue, 12 Jan, 2021, 10:08 pm Walter Underwood, <
>> wun...@wunderwood.org> wrote:
>> >>> >> 
>> >>> >>  I’d love for SOLR-15056 to be in, but it is just a patch now
>> and hasn’t had anything besides
>> >>> >>  local testing, so that is a bit of a long shot.
>> >>> >> 
>> >>> >>  wunder
>> >>> >>  Walter Underwood
>> >>> >>  wun...@wunderwood.org
>> >>> >>  http://observer.wunderwood.org/  (my blog)
>> >>> >> 
>> >>> >>  On Jan 11, 2021, at 9:29 AM, Timothy Potter <
>> thelabd...@gmail.com> wrote:
>> >>> >> 
>> >>> >>  15036 will be in later today, so you can plan to cut this
>> evening US time or tomorrow.
>> >>> >> 
>> >>> >>  Cheers,
>> >>> >>  Tim
>> >>> >> 
>> >>> >>  On Mon, Jan 11, 2021 at 9:54 AM Ishan Chattopadhyaya <
>> ichattopadhy...@gmail.com> wrote:
>> >>> >> >
>> >>> >> > I think all the issues mentioned above are in the branch,
>> except SOLR-15036. Tim, I'll cut a branch once that is in, latest by
>> Wednesday AM (USA time).
>> >>> >> > Thanks,
>> >>> >> > Ishan
>> >>> >> >
>> >>> >> > On Thu, Jan 7, 2021 at 5:07 AM Timothy Potter <
>> thelabd...@gmail.com> wrote:
>> >>> >> >>
>> >>> >> >> Thanks for following up on this Ishan ... I intend to get
>> SOLR-15059 and -15036 into 8.8 as well. I should have a proper PR up for
>> SOLR-15036 by Friday sometime, which seems to align with other's timeframes
>> >>> >> >>
>> >>> >> >> Cheers,
>> >>> >> >> Tim
>> >>> >> >>
>> >>> >> >> On Wed, Jan 6, 2021 at 6:54 AM David Smiley <
>> dsmi...@apache.org> wrote:
>> >>> >> >>>
>> >>> >> >>> Happy New Year!
>> >>> >> >>> I would much prefer that ensure 8.8 includes SOLR-14923 (a
>> bad nested docs performance issue)
>> >>> >> >>>
>> >>> >> >>> ~ David Smiley
>>

Re: RFC: N-2 compatibility for file formats

2021-01-07 Thread jim ferenczi

The proposal is only about keeping the ability to read file-format up to
N-2. Everything that is done on top of the file format is not guaranteed
and should be supported on a best-effort basis.
That's an important aspect if we don't want to block innovation. So in
practice that means that queries that require some specific file format or
analyzers that change behaviors in major versions would not be part of the
extended guarantee.

Le mer. 6 janv. 2021 à 21:53, Yonik Seeley  a écrit :

> On Wed, Jan 6, 2021 at 4:40 AM Simon Willnauer 
> wrote:
>
>>  You can open a reader on an index created by
>> version N-2, but you cannot open an IndexWriter on it
>>
>
> +1
> There should definitely be more consideration given to back compat in
> general... it's caused a ton of pain to users over time.
>
> -Yonik
>
>
>

Re: Suggested query parser syntax change fuzzy and boost operators (term^3~2)

2020-09-18 Thread jim ferenczi

+1 to be more strict about the order of operators. That's a bug fix imo.

Le jeu. 17 sept. 2020 à 08:58, Dawid Weiss  a écrit :

> Just so that it's not overlooked. I suggest a cleanup of the
> (flexible?) query parser syntax in LUCENE-9528.
>
> In short, the current javacc code is a tangled mess that is hard to
> read, modify and make sense of.
>
> https://issues.apache.org/jira/browse/LUCENE-9528
>
> For example, these are all valid queries at the moment (flex qp):
>
> 1. assertQueryEquals("term~0.7", null, "term~1");
> 2. assertQueryEquals("term^3~", null, "(term~2)^3.0");
> 3. assertEquals(re, qp.parse("/http/~0.5", df));
>
> The thing is:
>
> 1) fuzzy (and slop) are integers. They shouldn't parse and accept
> floats, it's incorrect and misleading.
> 2) operator order in this case should matter: fuzzy should apply
> first, boost to any other expression underneath (it has a wider
> application than just term queries). This arbitrary-order syntax is
> hardcoded in the parser and is wrong. This parses, for example:
> term~3^3~4 and results in this query:
> 
> 3) Operators that don't apply to certain types of clauses should cause
> parser exceptions. Can you guess what the query "/http/~0.5" parses
> to? Looks like a regexp with a fuzzy factor, right? No, it parses to:
>
> 
>
> because regexps don't allow fuzziness.
>
> LUCENE-9528 cleans most of the above. The drawback: it is not a
> backwards-compatible change (arguably this fixes parser errors, not
> behavior).
>
> Speak up if you have an opinion about not changing the above.
>
> Dawid
>
> [1] https://en.wikipedia.org/wiki/Tears_in_rain_monologue
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: Avoiding false-positives in multivalued field search with intervals?

2020-09-10 Thread jim ferenczi

Ok so the more general question is whether we need an interval query parser

Le jeu. 10 sept. 2020 à 17:28, Dawid Weiss  a écrit :

> I am fine with the boundary token suggestion, actually. What I don't
> see at the moment is how I can marry it with an output of a general
> query parser (which returns any Query). I could give an attempt to
> process the query node tree from standard query parser (which we're
> using at the moment anyway) but if the tree becomes complex there is
> no guarantee I can extract subtrees that can be parsed into
> IntervalSources (and then in turn into IntervalQuery).
>
> Dawid
>
> On Thu, Sep 10, 2020 at 4:28 PM jim ferenczi 
> wrote:
> >
> > Right, I misunderstood Alan's answer. The boundary option is not
> "impure" in my opinion. It solves this issue nicely but maybe it needs
> something more packaged to add the boundaries and build queries easily.
> >
> > Le jeu. 10 sept. 2020 à 16:16, Dawid Weiss  a
> écrit :
> >>
> >> Yup - similar to what Alan suggested. I'd have to rewrite the (general
> >> text-to-query) query parser to only use intervals though. Still
> >> thinking about possible approaches to this.
> >>
> >> D.
> >>
> >> On Thu, Sep 10, 2020 at 3:58 PM jim ferenczi 
> wrote:
> >> >
> >> > You could set a very high position increment gap for multi-valued
> fields (Analyzer#getPositionIncrementGap) and perform something
> >> > like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ?
> >> >
> >> >
> >> > Le jeu. 10 sept. 2020 à 12:32, Dawid Weiss  a
> écrit :
> >> >>
> >> >> Yeah... I was thinking about adding synthetic boundaries but this
> >> >> seems... impure. :) Another quick reflection is that I'd have to
> >> >> somehow translate the original query (which can be arbitrarily
> >> >> complex) into an interval query. Tough.
> >> >>
> >> >> D.
> >> >>
> >> >> On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward 
> wrote:
> >> >> >
> >> >> > I’ve solved this sort of thing in the past by indexing boundary
> tokens, and wrapping the queries with the equivalent of
> Intervals.notContaining(query, boundary-query); you could also put a very
> large position increment gap and use a width filter, but that’s a bit more
> error prone if you could conceivably have lots of text in the individual
> field entries.
> >> >> >
> >> >> > > On 10 Sep 2020, at 10:38, Dawid Weiss 
> wrote:
> >> >> > >
> >> >> > > Hi Alan,
> >> >> > >
> >> >> > > You're the expert here so I thought I'd ask before I jump in
> deep. Do
> >> >> > > you think it's feasible to solve the following multivalued-field
> >> >> > > problem:
> >> >> > >
> >> >> > > doc: field=["foo", "bar"]
> >> >> > > query: field:(foo AND bar)
> >> >> > >
> >> >> > > I'd like the above to return zero hits (no single value contains
> both
> >> >> > > foo and bar), but since multi-valued fields are logically
> indexed as a
> >> >> > > single field, it returns doc. I recognize this as a well known
> problem
> >> >> > > but subdocuments are not fun to deal with so I'd like to avoid
> them at
> >> >> > > all costs.
> >> >> > >
> >> >> > > Would it be possible to solve the above with intervals? Say,
> something
> >> >> > > like this:
> >> >> > >
> >> >> > > Intervals.containing(valuePositionRanges(), query).
> >> >> > >
> >> >> > > I assume the containment relationship would get rid of
> false-positives
> >> >> > > crossing value boundary here. The problem is in how to construct
> those
> >> >> > > value position ranges... Store them at index-construction time
> >> >> > > somehow? Compute them on the fly for anything that has a chance
> to
> >> >> > > match query? Your thoughts would be very appreciated.
> >> >> > >
> >> >> > > Dawid
> >> >> > >
> >> >> > >
> -
> >> >> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> >> > > For additional commands, e-mail: dev-h...@lucene.apache.org
> >> >> > >
> >> >> >
> >> >> >
> >> >> >
> -
> >> >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> >> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >> >> >
> >> >>
> >> >> -
> >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >> >>
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: Avoiding false-positives in multivalued field search with intervals?

2020-09-10 Thread jim ferenczi

Right, I misunderstood Alan's answer. The boundary option is not "impure"
in my opinion. It solves this issue nicely but maybe it needs something
more packaged to add the boundaries and build queries easily.

Le jeu. 10 sept. 2020 à 16:16, Dawid Weiss  a écrit :

> Yup - similar to what Alan suggested. I'd have to rewrite the (general
> text-to-query) query parser to only use intervals though. Still
> thinking about possible approaches to this.
>
> D.
>
> On Thu, Sep 10, 2020 at 3:58 PM jim ferenczi 
> wrote:
> >
> > You could set a very high position increment gap for multi-valued fields
> (Analyzer#getPositionIncrementGap) and perform something
> > like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ?
> >
> >
> > Le jeu. 10 sept. 2020 à 12:32, Dawid Weiss  a
> écrit :
> >>
> >> Yeah... I was thinking about adding synthetic boundaries but this
> >> seems... impure. :) Another quick reflection is that I'd have to
> >> somehow translate the original query (which can be arbitrarily
> >> complex) into an interval query. Tough.
> >>
> >> D.
> >>
> >> On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward 
> wrote:
> >> >
> >> > I’ve solved this sort of thing in the past by indexing boundary
> tokens, and wrapping the queries with the equivalent of
> Intervals.notContaining(query, boundary-query); you could also put a very
> large position increment gap and use a width filter, but that’s a bit more
> error prone if you could conceivably have lots of text in the individual
> field entries.
> >> >
> >> > > On 10 Sep 2020, at 10:38, Dawid Weiss 
> wrote:
> >> > >
> >> > > Hi Alan,
> >> > >
> >> > > You're the expert here so I thought I'd ask before I jump in deep.
> Do
> >> > > you think it's feasible to solve the following multivalued-field
> >> > > problem:
> >> > >
> >> > > doc: field=["foo", "bar"]
> >> > > query: field:(foo AND bar)
> >> > >
> >> > > I'd like the above to return zero hits (no single value contains
> both
> >> > > foo and bar), but since multi-valued fields are logically indexed
> as a
> >> > > single field, it returns doc. I recognize this as a well known
> problem
> >> > > but subdocuments are not fun to deal with so I'd like to avoid them
> at
> >> > > all costs.
> >> > >
> >> > > Would it be possible to solve the above with intervals? Say,
> something
> >> > > like this:
> >> > >
> >> > > Intervals.containing(valuePositionRanges(), query).
> >> > >
> >> > > I assume the containment relationship would get rid of
> false-positives
> >> > > crossing value boundary here. The problem is in how to construct
> those
> >> > > value position ranges... Store them at index-construction time
> >> > > somehow? Compute them on the fly for anything that has a chance to
> >> > > match query? Your thoughts would be very appreciated.
> >> > >
> >> > > Dawid
> >> > >
> >> > >
> -
> >> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> > > For additional commands, e-mail: dev-h...@lucene.apache.org
> >> > >
> >> >
> >> >
> >> > -
> >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >> >
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: Avoiding false-positives in multivalued field search with intervals?

2020-09-10 Thread jim ferenczi

You could set a very high position increment gap for multi-valued fields
(Analyzer#getPositionIncrementGap) and perform something
like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ?


Le jeu. 10 sept. 2020 à 12:32, Dawid Weiss  a écrit :

> Yeah... I was thinking about adding synthetic boundaries but this
> seems... impure. :) Another quick reflection is that I'd have to
> somehow translate the original query (which can be arbitrarily
> complex) into an interval query. Tough.
>
> D.
>
> On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward 
> wrote:
> >
> > I’ve solved this sort of thing in the past by indexing boundary tokens,
> and wrapping the queries with the equivalent of
> Intervals.notContaining(query, boundary-query); you could also put a very
> large position increment gap and use a width filter, but that’s a bit more
> error prone if you could conceivably have lots of text in the individual
> field entries.
> >
> > > On 10 Sep 2020, at 10:38, Dawid Weiss  wrote:
> > >
> > > Hi Alan,
> > >
> > > You're the expert here so I thought I'd ask before I jump in deep. Do
> > > you think it's feasible to solve the following multivalued-field
> > > problem:
> > >
> > > doc: field=["foo", "bar"]
> > > query: field:(foo AND bar)
> > >
> > > I'd like the above to return zero hits (no single value contains both
> > > foo and bar), but since multi-valued fields are logically indexed as a
> > > single field, it returns doc. I recognize this as a well known problem
> > > but subdocuments are not fun to deal with so I'd like to avoid them at
> > > all costs.
> > >
> > > Would it be possible to solve the above with intervals? Say, something
> > > like this:
> > >
> > > Intervals.containing(valuePositionRanges(), query).
> > >
> > > I assume the containment relationship would get rid of false-positives
> > > crossing value boundary here. The problem is in how to construct those
> > > value position ranges... Store them at index-construction time
> > > somehow? Compute them on the fly for anything that has a chance to
> > > match query? Your thoughts would be very appreciated.
> > >
> > > Dawid
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: dev-h...@lucene.apache.org
> > >
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: [VOTE] Lucene logo contest, third time's a charm

2020-09-03 Thread jim ferenczi

A1 (binding)

Le jeu. 3 sept. 2020 à 07:09, Noble Paul  a écrit :

> A1, A2, D binding
>
> On Thu, Sep 3, 2020 at 7:22 AM Jason Gerlowski 
> wrote:
> >
> > A1, A2, D (binding)
> >
> > On Wed, Sep 2, 2020 at 10:47 AM Michael McCandless
> >  wrote:
> > >
> > > A2, A1, C5, D (binding)
> > >
> > > Thank you to everyone for working so hard to make such cool looking
> possible future Lucene logos!  And to Ryan for the challenging job of
> calling this VOTE :)
> > >
> > > Mike McCandless
> > >
> > > http://blog.mikemccandless.com
> > >
> > >
> > > On Tue, Sep 1, 2020 at 4:21 PM Ryan Ernst  wrote:
> > >>
> > >> Dear Lucene and Solr developers!
> > >>
> > >> Sorry for the multiple threads. This should be the last one.
> > >>
> > >> In February a contest was started to design a new logo for Lucene
> [jira-issue]. The initial attempt [first-vote] to call a vote resulted in
> some confusion on the rules, as well the request for one additional
> submission. The second attempt [second-vote] yesterday had incorrect links
> for one of the submissions. I would like to call a new vote, now with more
> explicit instructions on how to vote, and corrected links.
> > >>
> > >> Please read the following rules carefully before submitting your vote.
> > >>
> > >> Who can vote?
> > >>
> > >> Anyone is welcome to cast a vote in support of their favorite
> submission(s). Note that only PMC member's votes are binding. If you are a
> PMC member, please indicate with your vote that the vote is binding, to
> ease collection of votes. In tallying the votes, I will attempt to verify
> only those marked as binding.
> > >>
> > >> How do I vote?
> > >>
> > >> Votes can be cast simply by replying to this email. It is a
> ranked-choice vote [rank-choice-voting]. Multiple selections may be made,
> where the order of preference must be specified. If an entry gets more than
> half the votes, it is the winner. Otherwise, the entry with the lowest
> number of votes is removed, and the votes are retallied, taking into
> account the next preferred entry for those whose first entry was removed.
> This process repeats until there is a winner.
> > >>
> > >> The entries are broken up by variants, since some entries have
> multiple color or style variations. The entry identifiers are first a
> capital letter, followed by a variation id (described with each entry
> below), if applicable. As an example, if you prefer variant 1 of entry A,
> followed by variant 2 of entry A, variant 3 of entry C, entry D, and lastly
> variant 4e of entry B, the following should be in your reply:
> > >>
> > >> (binding)
> > >> vote: A1, A2, C3, D, B4e
> > >>
> > >> Entries
> > >>
> > >> The entries are as follows:
> > >>
> > >> A. Submitted by Dustin Haver. This entry has two variants, A1 and A2.
> > >>
> > >> [A1]
> https://issues.apache.org/jira/secure/attachment/12999548/Screen%20Shot%202020-04-10%20at%208.29.32%20AM.png
> > >> [A2]
> https://issues.apache.org/jira/secure/attachment/12997172/LuceneLogo.png
> > >>
> > >> B. Submitted by Stamatis Zampetakis. This has several variants.
> Within the linked entry there are 7 patterns and 7 color palettes. Any vote
> for B should contain the pattern number followed by the lowercase letter of
> the color palette. For example, B3e or B1a.
> > >>
> > >> [B]
> https://issues.apache.org/jira/secure/attachment/12997768/zabetak-1-7.pdf
> > >>
> > >> C. Submitted by Baris Kazar. This entry has 8 variants.
> > >>
> > >> [C1]
> https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo1_full.pdf
> > >> [C2]
> https://issues.apache.org/jira/secure/attachment/13006393/lucene_logo2_full.pdf
> > >> [C3]
> https://issues.apache.org/jira/secure/attachment/13006394/lucene_logo3_full.pdf
> > >> [C4]
> https://issues.apache.org/jira/secure/attachment/13006395/lucene_logo4_full.pdf
> > >> [C5]
> https://issues.apache.org/jira/secure/attachment/13006396/lucene_logo5_full.pdf
> > >> [C6]
> https://issues.apache.org/jira/secure/attachment/13006397/lucene_logo6_full.pdf
> > >> [C7]
> https://issues.apache.org/jira/secure/attachment/13006398/lucene_logo7_full.pdf
> > >> [C8]
> https://issues.apache.org/jira/secure/attachment/13006399/lucene_logo8_full.pdf
> > >>
> > >> D. The current Lucene logo.
> > >>
> > >> [D]
> https://lucene.apache.org/theme/images/lucene/lucene_logo_green_300.png
> > >>
> > >> Please vote for one of the above choices. This vote will close about
> one week from today, Mon, Sept 7, 2020 at 11:59PM.
> > >>
> > >> Thanks!
> > >>
> > >> [jira-issue] https://issues.apache.org/jira/browse/LUCENE-9221
> > >> [first-vote]
> http://mail-archives.apache.org/mod_mbox/lucene-dev/202006.mbox/%3cCA+DiXd74Mz4H6o9SmUNLUuHQc6Q1-9mzUR7xfxR03ntGwo=d...@mail.gmail.com%3e
> > >> [second-vote]
> http://mail-archives.apache.org/mod_mbox/lucene-dev/202009.mbox/%3cCA+DiXd7eBrQu5+aJQ3jKaUtUTJUqaG2U6o+kUZfNe-m=smn...@mail.gmail.com%3e
> > >> [rank-choice-voting]
> https://en.wikipedia.org/wiki/Instant-runoff_voting
> >
> >

Re: Welcome Atri Sharma to the PMC

2020-08-20 Thread jim ferenczi

Welcome Atri!

Le jeu. 20 août 2020 à 22:00, Jan Høydahl  a écrit :

> Welcome Atri!
>
> Jan
>
> 20. aug. 2020 kl. 20:16 skrev Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com>:
>
> 
> I am pleased to announce that Atri Sharma has accepted the PMC's
> invitation to join.
>
> Congratulations and welcome, Atri!
>
>
>

Re: Welcome Namgyu Kim to the PMC

2020-08-03 Thread jim ferenczi

Congratulations Namgyu!

Le lun. 3 août 2020 à 18:27, Steve Rowe  a écrit :

> Congrats and welcome, Namgyu!
>
> --
> Steve
>
> > On Aug 2, 2020, at 7:18 PM, Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com> wrote:
> >
> > I am pleased to announce that Namgyu Kim has accepted the PMC's
> invitation to join.
> >
> > Congratulations and welcome, Namgyu!
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Welcome Mayya Sharipova as Lucene/Solr committer

2020-06-08 Thread jim ferenczi

Hi all,

Please join me in welcoming Mayya Sharipova as the latest Lucene/Solr
committer.
Mayya, it's tradition for you to introduce yourself with a brief bio.

Congratulations and Welcome!

Jim

Re: [VOTE] Solr to become a top-level Apache project (TLP)

2020-05-12 Thread jim ferenczi

+1 (binding)

Le mar. 12 mai 2020 à 14:00, Simon Willnauer  a
écrit :

> +1 binding
>
> Sent from a mobile device
>
> > On 12. May 2020, at 13:33, Jason Gerlowski 
> wrote:
> >
> > -1 (binding)
> >
> >> On Tue, May 12, 2020 at 7:31 AM Alan Woodward 
> wrote:
> >>
> >> +1 (binding)
> >>
> >> Alan Woodward
> >>
>  On 12 May 2020, at 12:06, Jan Høydahl  wrote:
> >>>
> >>> +1 (binding)
> >>>
> >>> Jan Høydahl
> >>>
>  12. mai 2020 kl. 09:36 skrev Dawid Weiss :
> 
>  Dear Lucene and Solr developers!
> 
>  According to an earlier [DISCUSS] thread on the dev list [2], I am
>  calling for a vote on the proposal to make Solr a top-level Apache
>  project (TLP) and separate Lucene and Solr development into two
>  independent entities.
> 
>  To quickly recap the reasons and consequences of such a move: it seems
>  like the reasons for the initial merge of Lucene and Solr, around 10
>  years ago, have been achieved. Both projects are in good shape and
>  exhibit signs of independence already (mailing lists, committers,
>  patch flow). There are many technical considerations that would make
>  development much easier if we move Solr out into its own TLP.
> 
>  We discussed this issue [2] and both PMC members and committers had a
>  chance to review all the pros and cons and express their views. The
>  discussion showed that there are clearly different opinions on the
>  matter - some people are in favor, some are neutral, others are
>  against or not seeing the point of additional labor. Realistically, I
>  don't think reaching 100% level consensus is going to be possible --
>  we are a diverse bunch with different opinions and personalities. I
>  firmly believe this is the right direction hence the decision to put
>  it under the voting process. Should something take a wrong turn in the
>  future (as some folks worry it may), all blame is on me.
> 
>  Therefore, the proposal is to separate Solr from under Lucene TLP, and
>  make it a TLP on its own. The initial structure of the new PMC,
>  committer base, git repositories and other managerial aspects can be
>  worked out during the process if the decision passes.
> 
>  Please indicate one of the following (see [1] for guidelines):
> 
>  [ ] +1 - yes, I vote for the proposal
>  [ ] -1 - no, I vote against the proposal
> 
>  Please note that anyone in the Lucene+Solr community is invited to
>  express their opinion, though only Lucene+Solr committers cast binding
>  votes (indicate non-binding votes in your reply, please).
> 
>  The vote will be active for a week to give everyone a chance to read
>  and cast a vote.
> 
>  Dawid
> 
>  [1] https://www.apache.org/foundation/voting.html
>  [2]
> https://lists.apache.org/thread.html/rfae2440264f6f874e91545b2030c98e7b7e3854ddf090f7747d338df%40%3Cdev.lucene.apache.org%3E
> 
>  -
>  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>  For additional commands, e-mail: dev-h...@lucene.apache.org
> 
> >>>
> >>>
> >>> -
> >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >>> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>>
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: 7.7.3 bugfix release

2020-04-16 Thread jim ferenczi

Hi,

Ì merged LUCENE-9300 <https://issues.apache.org/jira/browse/LUCENE-9300> in
the 7.7 branch.

> I shall cut the branch in a day or two

I guess you meant create the first BC since the branch is already created
(branch_7_7) ;)

Thanks,
Jim


Le mer. 15 avr. 2020 à 09:21, Noble Paul  a écrit :

> Hi, Please merge all the required changes to the branch branch_7_7
>
> I shall cut the branch in a day or two
>
> On Mon, Apr 6, 2020 at 6:14 PM jim ferenczi 
> wrote:
> >
> > Hi Paul,
> > Ignacio have started the release process for a bug fix release of 8.5.1
> last week.
> > We cannot have two releases at the same time so would you agree to start
> 7.7.3 after 8.5.1 is out ?
> > I'd also like to backport LUCENE-9300 in 7.7 (the reason why we started
> a. 8.5.1 release) so don't hesitate if you need help or to delegate the
> release if you don't have the time at the moment.
> >
> > - Jim
> >
> >
> > Le mar. 18 févr. 2020 à 18:35, Houston Putman 
> a écrit :
> >>
> >> I've backported SOLR-13v69. After you add in SOLR-14013 Noble, we
> should be good to go with 7.7.3 I think.
> >>
> >> - Houston
> >>
> >> On Fri, Feb 14, 2020 at 1:17 PM Jan Høydahl 
> wrote:
> >>>
> >>> Falde alarm, I needed to update my branch :)
> >>>
> >>> Jan Høydahl
> >>>
> >>> 14. feb. 2020 kl. 19:11 skrev Jan Høydahl :
> >>>
> >>> What commit hash is the backport of SOLR-13971? I cannot find it and
> there is no CHANGES entry…?
> >>>
> >>> 14. feb. 2020 kl. 17:52 skrev Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com>:
> >>>
> >>> +1, Houston. That's my understanding as well. Please go ahead with the
> backport.
> >>>
> >>> On Fri, 14 Feb, 2020, 9:02 PM Houston Putman, 
> wrote:
> >>>>
> >>>> It looks like CVE-2019-17558 / SOLR-13971 has already been taken care
> of:
> https://issues.apache.org/jira/browse/SOLR-13971?focusedCommentId=17014356=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17014356
> >>>>
> >>>> So now CVE-2019-0193 / SOLR-13669 should be the only blocker. By the
> description in the JIRA, it looks like backporting
> https://github.com/apache/lucene-solr/commit/025f8763549151397284af28091cfd360307baa2
> should be enough. Is this correct, or am I missing something?
> >>>>
> >>>> - HOuston
> >>>>
> >>>> On Thu, Feb 13, 2020 at 12:59 PM Jan Høydahl 
> wrote:
> >>>>
> >>>> I’m afraid I don’t have the bandwidth the next couple of weeks.
> >>>>
> >>>> Jan Høydahl
> >>>>
> >>>> > 13. feb. 2020 kl. 16:27 skrev Noble Paul :
> >>>> >
> >>>> > Do you wish to backport them?
> >>>> >
> >>>> >> On Thu, Feb 13, 2020 at 7:55 PM Jan Høydahl 
> wrote:
> >>>> >>
> >>>> >> According to NVD, there are at least two published CVEs that
> affects 7.7.2 (CVE-2019-17558 / SOLR-13971 and CVE-2019-0193 / SOLR-13669).
> We cannot release 7.7.3 with these still present.
> >>>> >>
> >>>> >> Jan
> >>>> >>
> >>>> >> 13. feb. 2020 kl. 06:42 skrev Noble Paul :
> >>>> >>
> >>>> >> I'm planning to back port  SOLR-14013 and do a bug fix release
> soon.
> >>>> >> Please let me know if there is anything hat you wish to be included
> >>>> >>
> >>>> >> --
> >>>> >> -
> >>>> >> Noble Paul
> >>>> >>
> >>>> >>
> -
> >>>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >>>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>>> >>
> >>>> >>
> >>>> >
> >>>> >
> >>>> > --
> >>>> > -
> >>>> > Noble Paul
> >>>> >
> >>>> >
> -
> >>>> > To unsubscribe, %-mail: dev-unsubscr...@lucene.apache.org
> >>>> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >>>> >
> >>>>
> >>>> -
> >>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >>>> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>>>
> >>>
>
>
> --
> -
> Noble Paul
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: [VOTE] Release Lucene/Solr 8.5.1 RC1

2020-04-09 Thread jim ferenczi

+1

SUCCESS! [2:10:08.094546]

Le jeu. 9 avr. 2020 à 10:19, Alan Woodward  a écrit :

> +1
>
> SUCCESS! [1:18:54.574272]
>
> On 8 Apr 2020, at 21:21, Nhat Nguyen 
> wrote:
>
> +1
>
> SUCCESS! [0:52:20.920081]
>
>
> On Wed, Apr 8, 2020 at 6:31 AM Ignacio Vera  wrote:
>
>>
>> Please vote for release candidate 1 for Lucene/Solr 8.5.1
>>
>> The artifacts can be downloaded from:
>>
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.5.1-RC1-revedb9fc409398f2c3446883f9f80595c884d245d0
>>
>> You can run the smoke tester directly with this command:
>>
>> python3 -u dev-tools/scripts/smokeTestRelease.py \
>>
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.5.1-RC1-revedb9fc409398f2c3446883f9f80595c884d245d0
>>
>> The vote will be open for at least 72 hours i.e. until 2020-04-15 11:00
>> UTC.
>>
>> [ ] +1  approve
>> [ ] +0  no opinion
>> [ ] -1  disapprove (and reason why)
>>
>> Here is my +1
>>
>> SUCCESS! [1:02:16.691004]
>>
>>
>

Re: 7.7.3 bugfix release

2020-04-06 Thread jim ferenczi

Hi Paul,
Ignacio have started the release process for a bug fix release of 8.5.1
last week.
We cannot have two releases at the same time so would you agree to start
7.7.3 after 8.5.1 is out ?
I'd also like to backport LUCENE-9300 in 7.7 (the reason why we started a.
8.5.1 release) so don't hesitate if you need help or to delegate the
release if you don't have the time at the moment.

- Jim


Le mar. 18 févr. 2020 à 18:35, Houston Putman  a
écrit :

> I've backported SOLR-13v69
> . After you add in
> SOLR-14013 Noble, we should be good to go with 7.7.3 I think.
>
> - Houston
>
> On Fri, Feb 14, 2020 at 1:17 PM Jan Høydahl  wrote:
>
>> Falde alarm, I needed to update my branch :)
>>
>> Jan Høydahl
>>
>> 14. feb. 2020 kl. 19:11 skrev Jan Høydahl :
>>
>> What commit hash is the backport of SOLR-13971? I cannot find it and
>> there is no CHANGES entry…?
>>
>> 14. feb. 2020 kl. 17:52 skrev Ishan Chattopadhyaya <
>> ichattopadhy...@gmail.com>:
>>
>> +1, Houston. That's my understanding as well. Please go ahead with the
>> backport.
>>
>> On Fri, 14 Feb, 2020, 9:02 PM Houston Putman, 
>> wrote:
>>
>>> It looks like CVE-2019-17558 / SOLR-13971 has already been taken care
>>> of:
>>> https://issues.apache.org/jira/browse/SOLR-13971?focusedCommentId=17014356=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17014356
>>>
>>> So now CVE-2019-0193 / SOLR-13669 should be the only blocker. By the
>>> description in the JIRA, it looks like backporting
>>> https://github.com/apache/lucene-solr/commit/025f8763549151397284af28091cfd360307baa2
>>> 
>>>  should
>>> be enough. Is this correct, or am I missing something?
>>>
>>> - HOuston
>>>
>>> On Thu, Feb 13, 2020 at 12:59 PM Jan Høydahl 
>>> wrote:
>>>
>>> I’m afraid I don’t have the bandwidth the next couple of weeks.
>>>
>>> Jan Høydahl
>>>
>>> > 13. feb. 2020 kl. 16:27 skrev Noble Paul :
>>> >
>>> > Do you wish to backport them?
>>> >
>>> >> On Thu, Feb 13, 2020 at 7:55 PM Jan Høydahl 
>>> wrote:
>>> >>
>>> >> According to NVD, there are at least two published CVEs that affects
>>> 7.7.2 (CVE-2019-17558 / SOLR-13971 and CVE-2019-0193 / SOLR-13669). We
>>> cannot release 7.7.3 with these still present.
>>> >>
>>> >> Jan
>>> >>
>>> >> 13. feb. 2020 kl. 06:42 skrev Noble Paul >> >:
>>> >>
>>> >> I'm planning to back port  SOLR-14013 and do a bug fix release soon.
>>> >> Please let me know if there is anything hat you wish to be included
>>> >>
>>> >> --
>>> >> -
>>> >> Noble Paul
>>> >>
>>> >> -
>>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>>> >>
>>> >>
>>> >
>>> >
>>> > --
>>> > -
>>> > Noble Paul
>>> >
>>> > -
>>> > To unsubscribe, %-mail: dev-unsubscr...@lucene.apache.org
>>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>>> >
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>
>>

Re: Lucene/Solr 8.5.1 bugfix release

2020-04-03 Thread jim ferenczi

+1, thanks Ignacio.
I merged the fix for LUCENE-9300
 and backported to the
8.5 branch.

Le jeu. 2 avr. 2020 à 21:48, Adrien Grand  a écrit :

> My general take on this is that it's ok to upgrade a dependency in a patch
> release if the dependency upgrade itself is a new patch release of the same
> minor version. The changelog of Tika 1.24 seems to include not only bug
> fixes but also some enhancements[1], so I'd rather do a 8.6 release in the
> near future than backport this dependency upgrade to 8.5.
>
> [1] https://tika.apache.org/1.24/index.html
>
> On Thu, Apr 2, 2020 at 9:33 PM Cassandra Targett 
> wrote:
>
>> Should we consider backporting SOLR-14367 (the most recent Tika upgrade)?
>> It addresses a CVE in Tika, and while I think we usually avoid changing 3rd
>> party component versions in patch releases, but maybe we should in this
>> case? The upgrade also looks like it was pretty straightforward (drop-in
>> replacement).
>>
>> Cassandra
>> On Apr 2, 2020, 12:47 PM -0500, Ignacio Vera , wrote:
>>
>> Hi,
>>
>> I propose a quick 8.5.1 bugfix release and I volunteer as RM. The main
>> motivation for this release is LUCENE-9300 where Jim addressed a serious
>> bug that can lead to data corruption when merging indices via IW#addIndices.
>>
>> If there are no objections I am planning to create a RC early next week.
>>
>> Best regards,
>>
>> Ignacio
>>
>>
>>
>>
>
> --
> Adrien
>

Re: [VOTE] Release Lucene/Solr 8.5.0 RC1

2020-03-17 Thread jim ferenczi

+1

SUCCESS! [1:18:55.683704]

Le mar. 17 mars 2020 à 01:35, Mike Drob  a écrit :

> +1 (non-binding)
>
> All testing was with Java 11.0.5
>
> Smoke tester didn't work (expected)
>
> Manually ran lucene and solr tests, had a few solr failures but they
> passed when rerunning individually.
> Went through the tutorial, have a few minor updates to make that I'll take
> care of in the next few days but nothing critical.
>
> On Mon, Mar 16, 2020 at 3:12 PM Tomás Fernández Löbbe <
> tomasflo...@gmail.com> wrote:
>
>> +1
>>
>> SUCCESS! [1:20:34.327672]
>>
>> On Mon, Mar 16, 2020 at 12:59 PM Kevin Risden  wrote:
>>
>>> +1
>>>
>>> SUCCESS! [1:24:43.574849]
>>>
>>> Kevin Risden
>>>
>>> On Mon, Mar 16, 2020 at 3:40 PM Nhat Nguyen
>>>  wrote:
>>> >
>>> > +1
>>> >
>>> > SUCCESS! [0:52:39.991003]
>>> >
>>> > On Mon, Mar 16, 2020 at 11:14 AM Cassandra Targett <
>>> casstarg...@gmail.com> wrote:
>>> >>
>>> >> I pushed the Solr Ref Guide DRAFT up this morning (thought I did it
>>> on Friday, sorry): https://lucene.apache.org/solr/guide/8_5/.
>>> >>
>>> >> Cassandra
>>> >> On Mar 15, 2020, 6:06 PM -0500, Uwe Schindler ,
>>> wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I instructed Policeman Jenkins to automatically test the release for
>>> me, the result (Java 8 / Java 9 combined Smoketesting):
>>> >>
>>> >> SUCCESS! [1:24:47.422173]
>>> >> (see
>>> https://jenkins.thetaphi.de/job/Lucene-Solr-Release-Tester/30/console)
>>> >>
>>> >> I also downloaded the artifacts and tested manually:
>>> >> - Solr starts and stops perfectly on Windows with whitespace in path
>>> name: Java 8, Java 11 and Java 14 (coming out soon)
>>> >> - Javadocs of Lucene look fine
>>> >> - JAR files look good
>>> >> - All links to repos in pom.xml and ant use HTTPS
>>> >>
>>> >> So I am fine with releasing this.
>>> >> +1 to RELEASE!
>>> >>
>>> >> Uwe
>>> >>
>>> >> -
>>> >> Uwe Schindler
>>> >> Achterdiek 19, D-28357 Bremen
>>> >> https://www.thetaphi.de
>>> >> eMail: u...@thetaphi.de
>>> >>
>>> >> -Original Message-
>>> >> From: Alan Woodward 
>>> >> Sent: Friday, March 13, 2020 3:27 PM
>>> >> To: dev@lucene.apache.org
>>> >> Subject: [VOTE] Release Lucene/Solr 8.5.0 RC1
>>> >>
>>> >> Please vote for release candidate 1 for Lucene/Solr 8.5.0
>>> >>
>>> >> The artifacts can be downloaded from:
>>> >> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.5.0-RC1-
>>> >> rev7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42
>>> >>
>>> >> You can run the smoke tester directly with this command:
>>> >>
>>> >> python3 -u dev-tools/scripts/smokeTestRelease.py \
>>> >> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.5.0-RC1-
>>> >> rev7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42
>>> >>
>>> >> The vote will be open for three working days i.e. until next Tuesday,
>>> 2020-03-
>>> >> 18 14:00 UTC.
>>> >>
>>> >> [ ] +1 approve
>>> >> [ ] +0 no opinion
>>> >> [ ] -1 disapprove (and reason why)
>>> >>
>>> >> Here is my +1
>>> >> -
>>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>>> >>
>>> >>
>>> >>
>>> >> -
>>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>>> >>
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>

Re: Lucene/Solr 8.4

2019-11-22 Thread jim ferenczi

+1

Le ven. 22 nov. 2019 à 10:08, Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> a écrit :

> +1
>
> On Fri, Nov 22, 2019 at 2:16 PM Atri Sharma  wrote:
> >
> > +1
> >
> > On Fri, Nov 22, 2019 at 2:08 PM Adrien Grand  wrote:
> > >
> > > Hello all,
> > >
> > > With Thanksgiving and then Christmas coming up, this is going to be a
> > > busy time for most of us. I'd like to get a new release before the end
> > > of the year, so I'm proposing the following schedule for Lucene/Solr
> > > 8.4:
> > >  - cutting the branch on December 12th
> > >  - building the first RC on December 14th
> > > and hopefully we'll have a release in the following week.
> > >
> > > --
> > > Adrien
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: dev-h...@lucene.apache.org
> > >
> >
> >
> > --
> > Regards,
> >
> > Atri
> > Apache Concerted
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: [VOTE] Release Lucene/Solr 8.3.0 RC1

2019-10-22 Thread jim ferenczi

I backported the fix for https://issues.apache.org/jira/browse/LUCENE-9022 to
the 8.3 branch.
Thanks Uwe and Ishan.


Le mar. 22 oct. 2019 à 15:09, Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> a écrit :

> Okay, Jim. Let's respin and have it included as well.
>
> On Tue, 22 Oct, 2019, 6:36 PM Uwe Schindler,  wrote:
>
>> Yes, I would suggest to add this change. It's a big and as the other
>> query was already fixed, that's needed for consistency, otherwise you would
>> see strange bugs.
>>
>> Uwe
>>
>> Am October 22, 2019 12:38:18 PM UTC schrieb jim ferenczi <
>> jim.feren...@gmail.com>:
>>>
>>> If we respin I'd like to include
>>> https://issues.apache.org/jira/browse/LUCENE-9022 that makes all join
>>> queries non-eligible for the query cache.
>>> The fix is ready and approved so I can backport any time if you are ok
>>> with it.
>>>
>>> Le mar. 22 oct. 2019 à 00:04, Ishan Chattopadhyaya <
>>> ichattopadhy...@gmail.com> a écrit :
>>>
>>>> +1 to re-spin if this can be fixed quickly enough.
>>>>
>>>> On Tue, Oct 22, 2019 at 3:30 AM David Smiley 
>>>> wrote:
>>>> >
>>>> > I just discovered this:
>>>> https://issues.apache.org/jira/browse/SOLR-13855  which is in effect
>>>> since 8.1, Distributed URP in cloud mode doesn't propagate finish().  I see
>>>> that Run URP further propagates this to the UpdateLog.  Not doing this
>>>> looks concerning to me.  The bug should be easy to fix, which I plan to do
>>>> if Bar Rotstein doesn't grab it very quickly.  Maybe you/others can
>>>> ascertain the seriousness but I want to raise awareness here.
>>>> >
>>>> > ~ David Smiley
>>>> > Apache Lucene/Solr Search Developer
>>>> > http://www.linkedin.com/in/davidwsmiley
>>>> >
>>>> >
>>>> > On Mon, Oct 21, 2019 at 1:51 PM Ishan Chattopadhyaya <
>>>> ichattopadhy...@gmail.com> wrote:
>>>> >>
>>>> >> Please vote for release candidate 1 for Lucene/Solr 8.3.0
>>>> >>
>>>> >> The artifacts can be downloaded from:
>>>> >>
>>>> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.3.0-RC1-revd796eca84dbabe3ae9b3c27afc01ef3bee35acb1
>>>> >>
>>>> >> You can run the smoke tester directly with this command:
>>>> >>
>>>> >> python3 -u dev-tools/scripts/smokeTestRelease.py \
>>>> >>
>>>> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.3.0-RC1-revd796eca84dbabe3ae9b3c27afc01ef3bee35acb1
>>>> >>
>>>> >> The vote will be open for at least 3 working days, i.e. until
>>>> >> 2019-10-24 18:00 UTC.
>>>> >>
>>>> >> [ ] +1  approve
>>>> >> [ ] +0  no opinion
>>>> >> [ ] -1  disapprove (and reason why)
>>>> >>
>>>> >> Here is my +1
>>>> >>
>>>> >> -
>>>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>> >>
>>>>
>>>> -
>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>
>>>>
>> --
>> Uwe Schindler
>> Achterdiek 19, 28357 Bremen
>> https://www.thetaphi.de
>>
>

Re: [VOTE] Release Lucene/Solr 8.3.0 RC1

2019-10-22 Thread jim ferenczi

If we respin I'd like to include
https://issues.apache.org/jira/browse/LUCENE-9022 that makes all join
queries non-eligible for the query cache.
The fix is ready and approved so I can backport any time if you are ok with
it.

Le mar. 22 oct. 2019 à 00:04, Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> a écrit :

> +1 to re-spin if this can be fixed quickly enough.
>
> On Tue, Oct 22, 2019 at 3:30 AM David Smiley 
> wrote:
> >
> > I just discovered this: https://issues.apache.org/jira/browse/SOLR-13855
> which is in effect since 8.1, Distributed URP in cloud mode doesn't
> propagate finish().  I see that Run URP further propagates this to the
> UpdateLog.  Not doing this looks concerning to me.  The bug should be easy
> to fix, which I plan to do if Bar Rotstein doesn't grab it very quickly.
> Maybe you/others can ascertain the seriousness but I want to raise
> awareness here.
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
> >
> > On Mon, Oct 21, 2019 at 1:51 PM Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com> wrote:
> >>
> >> Please vote for release candidate 1 for Lucene/Solr 8.3.0
> >>
> >> The artifacts can be downloaded from:
> >>
> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.3.0-RC1-revd796eca84dbabe3ae9b3c27afc01ef3bee35acb1
> >>
> >> You can run the smoke tester directly with this command:
> >>
> >> python3 -u dev-tools/scripts/smokeTestRelease.py \
> >>
> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.3.0-RC1-revd796eca84dbabe3ae9b3c27afc01ef3bee35acb1
> >>
> >> The vote will be open for at least 3 working days, i.e. until
> >> 2019-10-24 18:00 UTC.
> >>
> >> [ ] +1  approve
> >> [ ] +0  no opinion
> >> [ ] -1  disapprove (and reason why)
> >>
> >> Here is my +1
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: Welcome Atri Sharma as Lucene/Solr committer

2019-09-18 Thread jim ferenczi

Congratulations Atri!

Le mer. 18 sept. 2019 à 09:28, Ignacio Vera  a écrit :

> Welcome Atri!
>
> On Wed, Sep 18, 2019 at 9:12 AM Adrien Grand  wrote:
>
>> Hi all,
>>
>> Please join me in welcoming Atri Sharma as Lucene/ Solr committer!
>>
>> If you are following activity on Lucene, this name will likely sound
>> familiar to you: Atri has been very busy trying to improve Lucene over
>> the past months. In particular, Atri recently started improving our
>> top-hits optimizations like early termination on sorted indexes and
>> WAND, when indexes are searched using multiple threads.
>>
>> Congratulations and welcome! It is a tradition to introduce yourself
>> with a brief bio.
>>
>> --
>> Adrien
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

[jira] [Updated] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-13 Thread Jim Ferenczi (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8966:
-
Fix Version/s: 8.3
   master (9.0)
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> KoreanTokenizer should split unknown words on digits
> 
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
>  Issue Type: Improvement
>    Reporter: Jim Ferenczi
>Priority: Minor
> Fix For: master (9.0), 8.3
>
> Attachments: LUCENE-8966.patch, LUCENE-8966.patch
>
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8977) Handle punctuation characters in KoreanTokenizer

2019-09-13 Thread Jim Ferenczi (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929027#comment-16929027
 ] 

Jim Ferenczi commented on LUCENE-8977:
--

I wonder why you think that this is an issue. Punctuations are removed by 
default so this is only an issue if you want to use the Korean number filter ?

> Handle punctuation characters in KoreanTokenizer
> 
>
> Key: LUCENE-8977
> URL: https://issues.apache.org/jira/browse/LUCENE-8977
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Namgyu Kim
>Priority: Minor
>
> As we discussed on LUCENE-8966, KoreanTokenizer always divides into one and 
> the others now when there are continuous punctuation marks.
>  (사이즈 => [사이즈] [.] [...])
>  But KoreanTokenizer doesn't divide when first character is punctuation.
>  (...사이즈 => [...] [사이즈])
> It looks like the result from the viterbi path, but users can think weird 
> about the following case:
>  ("사이즈" means "size" in Korean)
> ||Case #1||Case #2||
> |Input : "...사이즈..."|Input : "...4..4사이즈"|
> |Result : [...] [사이즈] [.] [..]|Result : [...] [4] [.] [.] [4] [사이즈]|
> From what I checked, Nori has a punctuation characters(like . ,) in the 
> dictionary but Kuromoji is not.
>  ("サイズ" means "size" in Japanese)
> ||Case #1||Case #2||
> |Input : "...サイズ..."|Input : "...4..4サイズ"|
> |Result : [...] [サイズ] [...]|Result : [...] [4] [..] [4] [サイズ]|
> There are some ways to resolve it like hard-coding for punctuation but it 
> seems not good.
>  So I think we need to discuss it.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-09 Thread Jim Ferenczi (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925588#comment-16925588
 ] 

Jim Ferenczi commented on LUCENE-8966:
--

I don't think it's a bug [~danmuzi] or at least that it's related to this 
issue. In your example the first dot ('.' is a word dictionary) is considered a 
better path than grouping all dots eagerly. We process the unknown words 
greedily so we compare the path "[4], [.], [.]" with  "[4], [.], [.], 
[]", "[4], [.], [.], [.], [...]", ... "[4], [..]". Keeping the first 
dot separated from the rest indicates that a number followed by a dot is a 
better splitting path than multiple dots in our model. We can discuss this 
behavior in a new issue if you think this should be configurable (for instance 
the JapaneseTokenizer process unknown words greedily only in search mode) ?

> KoreanTokenizer should split unknown words on digits
> 
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8966.patch, LUCENE-8966.patch
>
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-05 Thread Jim Ferenczi (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923394#comment-16923394
 ] 

Jim Ferenczi commented on LUCENE-8966:
--

{quote}

Would you consider grouping numbers and (at least some) punctuation together so 
that we can preserve decimals and fractions?

{quote}

For complex number grouping and normalization, [~danmuzi] added a 
KoreanNumberFilter in https://issues.apache.org/jira/browse/LUCENE-8812

It is identical to the JapaneseNumberFilter excepts that it only detects Korean 
hangul numbers. I don't think it handles fractions though but this could be 
added if needed. 

> KoreanTokenizer should split unknown words on digits
> 
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
>  Issue Type: Improvement
>    Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8966.patch, LUCENE-8966.patch
>
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-05 Thread Jim Ferenczi (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8966:
-
Attachment: LUCENE-8966.patch
Status: Patch Available  (was: Patch Available)

New patch without the dead code

> KoreanTokenizer should split unknown words on digits
> 
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
>  Issue Type: Improvement
>    Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8966.patch, LUCENE-8966.patch
>
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-05 Thread Jim Ferenczi (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923357#comment-16923357
 ] 

Jim Ferenczi commented on LUCENE-8966:
--

Thanks for looking [~thetaphi]. These two private static functions are dead 
codes that I forgot to remove. The other places use Character.isDigit() 
consistently. 

> KoreanTokenizer should split unknown words on digits
> 
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
>  Issue Type: Improvement
>    Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8966.patch
>
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-05 Thread Jim Ferenczi (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923222#comment-16923222
 ] 

Jim Ferenczi commented on LUCENE-8966:
--

Here is a patch that breaks unknown words on digits instead of grouping them 
with other types.

> KoreanTokenizer should split unknown words on digits
> 
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
>  Issue Type: Improvement
>    Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8966.patch
>
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-05 Thread Jim Ferenczi (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8966:
-
Status: Patch Available  (was: Open)

> KoreanTokenizer should split unknown words on digits
> 
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
>  Issue Type: Improvement
>    Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8966.patch
>
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-05 Thread Jim Ferenczi (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8966:
-
Attachment: LUCENE-8966.patch

> KoreanTokenizer should split unknown words on digits
> 
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
>  Issue Type: Improvement
>    Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8966.patch
>
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-05 Thread Jim Ferenczi (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8966:
-
Description: 
Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
groups characters of unknown words if they belong to the same script or an 
inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
rest in Latin) but this rule doesn't work well on digits since they are 
considered common with other scripts. For instance the input "44사이즈" is kept as 
is even though "사이즈" is part of the dictionary. We should restore the original 
behavior and splits any unknown words if a digit is followed by another type.

This issue was first discovered in 
[https://github.com/elastic/elasticsearch/issues/46365]

  was:
Since LUCENE-XXX the Korean tokenizer groups characters of unknown words if 
they belong to the same script or an inherited one. This is ok for inputs like 
Мoscow (with a Cyrillic М and the rest in Latin) but this rule doesn't work 
well on digits since they are considered common with other scripts. For 
instance the input "44사이즈" is kept as is even though "사이즈" is part of the 
dictionary. We should restore the original behavior and splits any unknown 
words if a digit is followed by another type.

This issue was first discovered in 
[https://github.com/elastic/elasticsearch/issues/46365]


> KoreanTokenizer should split unknown words on digits
> 
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-05 Thread Jim Ferenczi (Jira)

Jim Ferenczi created LUCENE-8966:


 Summary: KoreanTokenizer should split unknown words on digits
 Key: LUCENE-8966
 URL: https://issues.apache.org/jira/browse/LUCENE-8966
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Jim Ferenczi


Since LUCENE-XXX the Korean tokenizer groups characters of unknown words if 
they belong to the same script or an inherited one. This is ok for inputs like 
Мoscow (with a Cyrillic М and the rest in Latin) but this rule doesn't work 
well on digits since they are considered common with other scripts. For 
instance the input "44사이즈" is kept as is even though "사이즈" is part of the 
dictionary. We should restore the original behavior and splits any unknown 
words if a digit is followed by another type.

This issue was first discovered in 
[https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8959) JapaneseNumberFilter does not take whitespaces into account when concatenating numbers

2019-08-29 Thread Jim Ferenczi (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8959:
-
Description: Today the JapaneseNumberFilter tries to concatenate numbers 
even if they are separated by whitespaces. So for instance "10 100" is 
rewritten into "10100" -even if the tokenizer doesn't discard punctuations-. In 
practice this is not an issue but this can lead to giant number of tokens if 
there are a lot of numbers separated by spaces. The number of concatenation 
should be configurable with a sane default limit in order to avoid creating 
giant tokens that slows down the analysis if the tokenizer is not correctly 
configured.  (was: Today the JapaneseNumberFilter tries to concatenate numbers 
even if they are separated by whitespaces. So for instance "10 100" is 
rewritten into "10100" even if the tokenizer doesn't discard punctuations. In 
practice this is not an issue but this can lead to giant number of tokens if 
there are a lot of numbers separated by spaces. The number of concatenation 
should be configurable with a sane default limit in order to avoid creating big 
tokens that slows down the analysis.)

> JapaneseNumberFilter does not take whitespaces into account when 
> concatenating numbers
> --
>
> Key: LUCENE-8959
> URL: https://issues.apache.org/jira/browse/LUCENE-8959
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> Today the JapaneseNumberFilter tries to concatenate numbers even if they are 
> separated by whitespaces. So for instance "10 100" is rewritten into "10100" 
> -even if the tokenizer doesn't discard punctuations-. In practice this is not 
> an issue but this can lead to giant number of tokens if there are a lot of 
> numbers separated by spaces. The number of concatenation should be 
> configurable with a sane default limit in order to avoid creating giant 
> tokens that slows down the analysis if the tokenizer is not correctly 
> configured.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8959) JapaneseNumberFilter does not take whitespaces into account when concatenating numbers

2019-08-29 Thread Jim Ferenczi (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918532#comment-16918532
 ] 

Jim Ferenczi commented on LUCENE-8959:
--

*Update:* Whitespaces were removed in my tests because I was using the default 
JapanesePartOfSpeechStopFilter before the JapaneseNumberFilter. The behavior is 
correct when discardPunctuations is correctly set and the 
JapanesePartOfSpeechStopFilter is the first filter in the chain. We could 
protect against the rabbit hole for users that forget to set 
discardPunctuations to false or remove the whitespaces in a preceding filter 
but the behavior is correct. Sorry for the false alarm.

> JapaneseNumberFilter does not take whitespaces into account when 
> concatenating numbers
> --
>
> Key: LUCENE-8959
> URL: https://issues.apache.org/jira/browse/LUCENE-8959
> Project: Lucene - Core
>  Issue Type: Improvement
>    Reporter: Jim Ferenczi
>Priority: Minor
>
> Today the JapaneseNumberFilter tries to concatenate numbers even if they are 
> separated by whitespaces. So for instance "10 100" is rewritten into "10100" 
> even if the tokenizer doesn't discard punctuations. In practice this is not 
> an issue but this can lead to giant number of tokens if there are a lot of 
> numbers separated by spaces. The number of concatenation should be 
> configurable with a sane default limit in order to avoid creating big tokens 
> that slows down the analysis.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8959) JapaneseNumberFilter does not take whitespaces into account when concatenating numbers

2019-08-29 Thread Jim Ferenczi (Jira)

Jim Ferenczi created LUCENE-8959:


 Summary: JapaneseNumberFilter does not take whitespaces into 
account when concatenating numbers
 Key: LUCENE-8959
 URL: https://issues.apache.org/jira/browse/LUCENE-8959
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Jim Ferenczi


Today the JapaneseNumberFilter tries to concatenate numbers even if they are 
separated by whitespaces. So for instance "10 100" is rewritten into "10100" 
even if the tokenizer doesn't discard punctuations. In practice this is not an 
issue but this can lead to giant number of tokens if there are a lot of numbers 
separated by spaces. The number of concatenation should be configurable with a 
sane default limit in order to avoid creating big tokens that slows down the 
analysis.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

2019-08-12 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905332#comment-16905332
 ] 

Jim Ferenczi commented on LUCENE-8943:
--

{quote}

Your post made me think of the problem in another way. If we had something like 
MultiWordsSynonymQuery, we could have even more control. Similar to 
SynonymQuery we could use one IDF value for all synonyms. Synonym boost would 
work much more reliably.

{quote}

 

Yes, that's what I tried to explain in my post. It is a specific issue with 
multi-words synonyms so we should have a dedicated query. 

 

{quote}

Usually the values for pseudoStats would be computed bottom up (SpanWeight, 
PhraseWeight) from the subqueries. But we could implement a general 
MultiWordsSynonymQuery as subclass of BooleanQuery (only allowing disjunction) 
which would set (adapt) pseudoStats in all its subweights (docFreq as max 
docFreq of all synonyms just as SynonymQuery currently does).

{quote}

 

+1, that's how I'd start with this. We don't need to handle all type of queries 
though, only Term (e.g.: body:ny), conjunction of Term queries (e.g.: body:new 
AND body:york) and phrase queries (e.g.: "new york") should be accepted.

> Incorrect IDF in MultiPhraseQuery and SpanOrQuery
> -
>
> Key: LUCENE-8943
> URL: https://issues.apache.org/jira/browse/LUCENE-8943
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring
>Affects Versions: 8.0
>Reporter: Christoph Goller
>Priority: Major
>
> I recently stumbled across a very old bug in the IDF computation for 
> MultiPhraseQuery and SpanOrQuery.
> BM25Similarity and TFIDFSimilarity / ClassicSimilarity have a method for 
> combining IDF values from more than on term / TermStatistics.
> I mean the method:
> Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics 
> termStats[])
> It simply adds up the IDFs from all termStats[].
> This method is used e.g. in PhraseQuery where it makes sense. If we assume 
> that for the phrase "New York" the occurrences of both words are independent, 
> we can multiply their probabilitis and since IDFs are logarithmic we add them 
> up. Seems to be a reasonable approximation. However, this method is also used 
> to add up the IDFs of all terms in a MultiPhraseQuery as can be seen in:
> Similarity.SimScorer getStats(IndexSearcher searcher)
> A MultiPhraseQuery is actually a PhraseQuery with alternatives at individual 
> positions. IDFs of alternative terms for one position should not be added up. 
> Instead we should use the minimum value as an approcimation because this 
> corresponds to the docFreq of the most frequent term and we know that this is 
> a lower bound for the docFreq for this position.
> In SpanOrQuerry we have the same problem It uses buildSimWeight(...) from 
> SpanWeight and adds up all IDFs of all OR-clauses.
> If my arguments are not convincing, look at SynonymQuery / SynonymWeight in 
> the constructor:
> SynonymWeight(Query query, IndexSearcher searcher, ScoreMode scoreMode, float 
> boost) 
> A SynonymQuery is also a kind of OR-query and it uses the maximum of the 
> docFreq of all its alternative terms. I think this is how it should be.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

2019-08-09 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903968#comment-16903968
 ] 

Jim Ferenczi commented on LUCENE-8943:
--

I don't think we can realistically approximate the doc freq of phrases, 
especially if you consider more than 2 terms. The issue with the score 
difference of "wifi" (single term) vs "wi fi" (multiple terms) is more a 
synonym issue where the association between these terms is made at search time. 
Currently BM25 similarity sums the idf values but this was done to limit the 
difference with the classic (tfidf) similarity. The other similarities take a 
simpler approach that just sum the score of each term that appear in the query 
like a boolean query would do (see MultiSimilarity). It's difficult to pick one 
approach over the other here but the context is important. For single term 
synonym (terms that appear at the same position) we have the SynonymQuery that 
is used to blend the score of such terms. I tend to agree that the 
MultiPhraseQuery should take the same approach so that each position can score 
once instead of per terms. However it is difficult to expand this strategy to 
variable length multi words synonyms. We could try with a specialized 
MultiWordsSynonymQuery that would apply some strategy (approximation of the doc 
count like you propose or anything that makes sense here ;) ) to make sure that 
all variations are scored the same. Does this makes sense ?

> Incorrect IDF in MultiPhraseQuery and SpanOrQuery
> -
>
> Key: LUCENE-8943
> URL: https://issues.apache.org/jira/browse/LUCENE-8943
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring
>Affects Versions: 8.0
>Reporter: Christoph Goller
>Priority: Major
>
> I recently stumbled across a very old bug in the IDF computation for 
> MultiPhraseQuery and SpanOrQuery.
> BM25Similarity and TFIDFSimilarity / ClassicSimilarity have a method for 
> combining IDF values from more than on term / TermStatistics.
> I mean the method:
> Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics 
> termStats[])
> It simply adds up the IDFs from all termStats[].
> This method is used e.g. in PhraseQuery where it makes sense. If we assume 
> that for the phrase "New York" the occurrences of both words are independent, 
> we can multiply their probabilitis and since IDFs are logarithmic we add them 
> up. Seems to be a reasonable approximation. However, this method is also used 
> to add up the IDFs of all terms in a MultiPhraseQuery as can be seen in:
> Similarity.SimScorer getStats(IndexSearcher searcher)
> A MultiPhraseQuery is actually a PhraseQuery with alternatives at individual 
> positions. IDFs of alternative terms for one position should not be added up. 
> Instead we should use the minimum value as an approcimation because this 
> corresponds to the docFreq of the most frequent term and we know that this is 
> a lower bound for the docFreq for this position.
> In SpanOrQuerry we have the same problem It uses buildSimWeight(...) from 
> SpanWeight and adds up all IDFs of all OR-clauses.
> If my arguments are not convincing, look at SynonymQuery / SynonymWeight in 
> the constructor:
> SynonymWeight(Query query, IndexSearcher searcher, ScoreMode scoreMode, float 
> boost) 
> A SynonymQuery is also a kind of OR-query and it uses the maximum of the 
> docFreq of all its alternative terms. I think this is how it should be.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8747) Allow access to submatches from Matches instances

2019-08-06 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901091#comment-16901091
 ] 

Jim Ferenczi commented on LUCENE-8747:
--

Can we return a list of Matches in findNamedMatches ? The set of string is 
useful for testing purpose but it should be easy to extract any named Matches 
from a global Matches object ?

> Allow access to submatches from Matches instances
> -
>
> Key: LUCENE-8747
> URL: https://issues.apache.org/jira/browse/LUCENE-8747
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8747.patch, LUCENE-8747.patch, LUCENE-8747.patch, 
> LUCENE-8747.patch
>
>
> A Matches object currently allows access to all matching terms from a query, 
> but the structure of the matching query is flattened out, so if you want to 
> find which subqueries have matched you need to iterate over all matches, 
> collecting queries as you go.  It should be easier to get this information 
> from the parent Matches object.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8941) Build wildcard matches more lazily

2019-08-01 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898316#comment-16898316
 ] 

Jim Ferenczi commented on LUCENE-8941:
--

+1 the patch looks good. Can you add an assert in the additional test that 
checks that the number of segments in the reader is 1 ?This is required since 
the test makes some assumptions regarding the distribution of the terms and 
this could be broken easily if we change the setUp for the entire test suite 
later.

> Build wildcard matches more lazily
> --
>
> Key: LUCENE-8941
> URL: https://issues.apache.org/jira/browse/LUCENE-8941
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8941.patch
>
>
> When retrieving a Matches object from a multi-term query, such as an 
> AutomatonQuery or TermInSetQuery, we currently find all matching term 
> iterators up-front, to return a disjunction over all of them.  This can be 
> inefficient if we're only interested in finding out if anything matched, and 
> are iterating over a different field to retrieve offsets.
> We can improve this by returning immediately when the first matching term is 
> found, and only collecting other matching terms when we start iterating.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-8935) BooleanQuery with no scoring clauses cannot skip documents when running TOP_SCORES mode

2019-07-29 Thread Jim Ferenczi (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-8935.
--
   Resolution: Fixed
Fix Version/s: 8.3
   master (9.0)

> BooleanQuery with no scoring clauses cannot skip documents when running 
> TOP_SCORES mode
> ---
>
> Key: LUCENE-8935
> URL: https://issues.apache.org/jira/browse/LUCENE-8935
> Project: Lucene - Core
>  Issue Type: Improvement
>    Reporter: Jim Ferenczi
>Priority: Minor
> Fix For: master (9.0), 8.3
>
> Attachments: LUCENE-8935.patch
>
>
> Today a boolean query that is composed of filtering clauses only (more than 
> one) cannot skip documents when the search is executed with the TOP_SCORES 
> mode. However since all documents have a score of 0 it should be possible to 
> early terminate the query as soon as we collected enough top hits. Wrapping 
> the resulting boolean scorer in a constant score scorer should allow early 
> termination in this case and would speed up the retrieval of top hits case 
> considerably if the total hit count is not requested.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8935) BooleanQuery with no scoring clauses cannot skip documents when running TOP_SCORES mode

2019-07-26 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893713#comment-16893713
 ] 

Jim Ferenczi commented on LUCENE-8935:
--

Sorry I misunderstood the logic but the number of scoring clauses is already 
computed from the pruned list of scorers so the actual patch works. It's the 
scorer supplier that can be null but in such case they would not appear in 
Boolean2ScorerSupplier. 

> BooleanQuery with no scoring clauses cannot skip documents when running 
> TOP_SCORES mode
> ---
>
> Key: LUCENE-8935
> URL: https://issues.apache.org/jira/browse/LUCENE-8935
> Project: Lucene - Core
>  Issue Type: Improvement
>    Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8935.patch
>
>
> Today a boolean query that is composed of filtering clauses only (more than 
> one) cannot skip documents when the search is executed with the TOP_SCORES 
> mode. However since all documents have a score of 0 it should be possible to 
> early terminate the query as soon as we collected enough top hits. Wrapping 
> the resulting boolean scorer in a constant score scorer should allow early 
> termination in this case and would speed up the retrieval of top hits case 
> considerably if the total hit count is not requested.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8935) BooleanQuery with no scoring clauses cannot skip documents when running TOP_SCORES mode

2019-07-26 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893708#comment-16893708
 ] 

Jim Ferenczi commented on LUCENE-8935:
--

The logic is already at the bottom of Boolean2ScorerSupplier#get but good call 
on the SHOULD clause that can produce a null scorer.

We can check the number of scoring clauses after the build instead of checking 
the number of scorer suppliers. I'll work on a fix.

> BooleanQuery with no scoring clauses cannot skip documents when running 
> TOP_SCORES mode
> ---
>
> Key: LUCENE-8935
> URL: https://issues.apache.org/jira/browse/LUCENE-8935
> Project: Lucene - Core
>  Issue Type: Improvement
>    Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8935.patch
>
>
> Today a boolean query that is composed of filtering clauses only (more than 
> one) cannot skip documents when the search is executed with the TOP_SCORES 
> mode. However since all documents have a score of 0 it should be possible to 
> early terminate the query as soon as we collected enough top hits. Wrapping 
> the resulting boolean scorer in a constant score scorer should allow early 
> termination in this case and would speed up the retrieval of top hits case 
> considerably if the total hit count is not requested.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

2019-07-26 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893677#comment-16893677
 ] 

Jim Ferenczi commented on LUCENE-8933:
--

{quote}

Should we go further and check that the concatenation of the segments is equal 
to the surface form?

{quote}

 

+1 too, the user dictionary should be used for segmentations purpose only. This 
would be a breaking change though since users seem to abuse this functionality 
to normalize input (see example). Maybe we can check the length in 8x and the 
content in master only ?

> JapaneseTokenizer creates Token objects with corrupt offsets
> 
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
> at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
> at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8935) BooleanQuery with no scoring clauses cannot skip documents when running TOP_SCORES mode

2019-07-26 Thread Jim Ferenczi (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8935:
-
Attachment: LUCENE-8935.patch
Status: Open  (was: Open)

Here is a patch that wraps the boolean scorer in a constant score scorer when 
there is no scoring clause and the score mode is TOP_SCORES.

> BooleanQuery with no scoring clauses cannot skip documents when running 
> TOP_SCORES mode
> ---
>
> Key: LUCENE-8935
> URL: https://issues.apache.org/jira/browse/LUCENE-8935
> Project: Lucene - Core
>  Issue Type: Improvement
>    Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8935.patch
>
>
> Today a boolean query that is composed of filtering clauses only (more than 
> one) cannot skip documents when the search is executed with the TOP_SCORES 
> mode. However since all documents have a score of 0 it should be possible to 
> early terminate the query as soon as we collected enough top hits. Wrapping 
> the resulting boolean scorer in a constant score scorer should allow early 
> termination in this case and would speed up the retrieval of top hits case 
> considerably if the total hit count is not requested.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8935) BooleanQuery with no scoring clauses cannot skip documents when running TOP_SCORES mode

2019-07-26 Thread Jim Ferenczi (JIRA)

Jim Ferenczi created LUCENE-8935:


 Summary: BooleanQuery with no scoring clauses cannot skip 
documents when running TOP_SCORES mode
 Key: LUCENE-8935
 URL: https://issues.apache.org/jira/browse/LUCENE-8935
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Jim Ferenczi


Today a boolean query that is composed of filtering clauses only (more than 
one) cannot skip documents when the search is executed with the TOP_SCORES 
mode. However since all documents have a score of 0 it should be possible to 
early terminate the query as soon as we collected enough top hits. Wrapping the 
resulting boolean scorer in a constant score scorer should allow early 
termination in this case and would speed up the retrieval of top hits case 
considerably if the total hit count is not requested.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

2019-07-26 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893478#comment-16893478
 ] 

Jim Ferenczi commented on LUCENE-8933:
--

{quote}
If there are no other opinions or objections, I'd like to create a patch that 
add a validation rule to the UserDictionary.
{quote}

Thanks [~tomoko]!

{quote}
For purpose of format validation, I think it would be better that we check if 
the sum of length of segments is equal to the length of its surface form.
i.e., we also should not allow such entry "aabbcc,a b c,aa bb cc,pos_tag" even 
if this does not cause any exceptions.
{quote}

+1



> JapaneseTokenizer creates Token objects with corrupt offsets
> 
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
> at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
> at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

2019-07-25 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892671#comment-16892671
 ] 

Jim Ferenczi commented on LUCENE-8933:
--

The first argument of the dictionary rule is the original block to detect and 
the second argument is the segmentation for the block. So the rule "aaa,aa a,," 
will split the input "aaa" into two tokens "aa" and "a". When computing the 
offsets of the splitted terms in the user dictionary we assume that the 
segmentation has the same character than the input minus the whitespaces. We 
don't check that is the case so rules with broken offsets are only detected 
when they match in a token stream. You don't need emojis or surrogate pairs to 
break this, just provide a rule where the length of the segmentation is greater 
than the input minus the whitespaces:
{code:java}
UserDictionary dict = UserDictionary.open(new StringReader("aaa,,,"));
JapaneseTokenizer tok = new JapaneseTokenizer(dict, true, Mode.NORMAL);
tok.setReader(new StringReader("aaa"));
tok.reset();
tok.incrementToken();
{code}

I think we just need to validate the input and throw an exception if the 
assumption are not met at build time.



> JapaneseTokenizer creates Token objects with corrupt offsets
> 
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
> at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
> at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8889) Remove Dead Code From PointRangeQuery

2019-06-27 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16873992#comment-16873992
 ] 

Jim Ferenczi commented on LUCENE-8889:
--

Why is it an issue ? We have some use cases in Elasticsearch that requires to 
access these points and I guess that there are other cases outside of Lucene. 
It's library so we should not expect that all accessors are used directly. We 
can add a test case if you're concerned by the fact that they are never used 
internally. 

> Remove Dead Code From PointRangeQuery
> -
>
> Key: LUCENE-8889
> URL: https://issues.apache.org/jira/browse/LUCENE-8889
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Minor
>
> PointRangeQuery has accessors for the underlying points in the query but 
> those are never accessed. We should remove them



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-8859) Add an option to load the completion suggester's FST off-heap

2019-06-26 Thread Jim Ferenczi (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-8859.
--
   Resolution: Fixed
Fix Version/s: 8.2
   master (9.0)

> Add an option to load the completion suggester's FST off-heap
> -
>
> Key: LUCENE-8859
> URL: https://issues.apache.org/jira/browse/LUCENE-8859
> Project: Lucene - Core
>  Issue Type: Improvement
>    Reporter: Jim Ferenczi
>Priority: Minor
> Fix For: master (9.0), 8.2
>
> Attachments: LUCENE-8859.patch
>
>
> Now that FSTs can be loaded off-heap 
> (https://issues.apache.org/jira/browse/LUCENE-8635) it would be nice to 
> expose this option in the completion suggester postings format. I didn't ran 
> any benchmark yet so I can't say it this really makes sense or not but I 
> wanted to get some opinion whether this could be a good trade-off.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-7714) Optimize range queries for the sorted case

2019-06-26 Thread Jim Ferenczi (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-7714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-7714.
--
   Resolution: Fixed
Fix Version/s: 8.2
   master (9.0)

Thanks [~jtibshirani]!

> Optimize range queries for the sorted case
> --
>
> Key: LUCENE-7714
> URL: https://issues.apache.org/jira/browse/LUCENE-7714
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: master (9.0), 8.2
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> It feels like we could make range queries faster when the index is sorted, 
> maybe by running on doc values, figuring ou the first and last matching 
> documents with a binary search and returning a doc id set iterator that 
> iterates through this range of documents?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8806) WANDScorer should support two-phase iterator

2019-06-25 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872213#comment-16872213
 ] 

Jim Ferenczi commented on LUCENE-8806:
--

I am testing with wikimediumall

> WANDScorer should support two-phase iterator
> 
>
> Key: LUCENE-8806
> URL: https://issues.apache.org/jira/browse/LUCENE-8806
> Project: Lucene - Core
>  Issue Type: Improvement
>    Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8806.patch, LUCENE-8806.patch
>
>
> Following https://issues.apache.org/jira/browse/LUCENE-8770 the WANDScorer 
> should leverage two-phase iterators in order to be faster when used in 
> conjunctions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8806) WANDScorer should support two-phase iterator

2019-06-25 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872169#comment-16872169
 ] 

Jim Ferenczi commented on LUCENE-8806:
--

{quote}
FYI we have an issue for phrases already LUCENE-8311.
{quote}

I forgot about this one, thanks! 

{quote}
I was thinking this could only get faster than before since we would now 
leverage two-phase iterators instead of using iterators naively.
{quote}

That was my assumption too but not checking the two-phase iterator when looking 
for candidates forces the second clause (the one with the lowest score) to 
advance even when the first clause is a false positive. So It might be related 
to the fact that checking for a match on high frequency phrases is faster than 
advancing the other clause.

> WANDScorer should support two-phase iterator
> 
>
> Key: LUCENE-8806
> URL: https://issues.apache.org/jira/browse/LUCENE-8806
> Project: Lucene - Core
>  Issue Type: Improvement
>    Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8806.patch, LUCENE-8806.patch
>
>
> Following https://issues.apache.org/jira/browse/LUCENE-8770 the WANDScorer 
> should leverage two-phase iterators in order to be faster when used in 
> conjunctions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8848) UnifiedHighlighter should highlight all Query types that implement Weight.matches

2019-06-25 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872150#comment-16872150
 ] 

Jim Ferenczi commented on LUCENE-8848:
--

The RandomIndexWriter is created but not closed if the condition line 1367 
matches. I'll push a fix.

> UnifiedHighlighter should highlight all Query types that implement 
> Weight.matches
> -
>
> Key: LUCENE-8848
> URL: https://issues.apache.org/jira/browse/LUCENE-8848
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Fix For: 8.2
>
> Attachments: LUCENE-8848.patch
>
>
> The UnifiedHighlighter internally extracts terms and automata from the query. 
>  Usually this works perfectly but it's possible a Query might be of a type it 
> doesn't know -- a leaf query that is perhaps in effect similar to a 
> MultiTermQuery yet it might not even be a subclass of this or it does but the 
> UH doesn't know how to extract an automata from it.  The UH is oblivious to 
> this and probably won't highlight this query.  If re-analysis of the text is 
> necessary, the UH will pre-filter all terms to only those it _thinks_ are 
> pertinent.  Or if offsets are in the postings then the UH could perform very 
> poorly by unleashing this query on the index for each highlighted document 
> without recognizing re-analysis is a more appropriate path.
> I think to solve this, the UnifiedHighlighter.getFieldHighlighter needs to 
> inspect the query (using a QueryVisitor) to see if it can find a leaf query 
> that is not one it knows how to pull automata from, and is otherwise not in a 
> special list (like MatchAllDocsQuery).  If we find one, we avoid choosing 
> OffsetSource.POSTINGS or OffsetSource.NONE_NEEDED since we might in effect 
> have an MTQ like query.  If a MemoryIndex is needed then we don't pre-filter 
> the terms since we can't assume we know precisely which terms are pertinent.
> We needn't bother extracting terms & automata in this case either; it's 
> wasted effort which can involve building a CharacterRunAutomaton (see 
> MultiTermHighlighting.binaryToCharRunAutomaton).  Speaking of which, it'd be 
> nice to avoid that in other cases as well, like for WEIGHT_MATCHES when we 
> aren't using MemoryIndex (thus no term pre-filtering).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8806) WANDScorer should support two-phase iterator

2019-06-25 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872142#comment-16872142
 ] 

Jim Ferenczi commented on LUCENE-8806:
--

I ran luceneutil with some disjunctions of phrase and term queries:
{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff
  HighPhraseHighTerm8.47  (1.6%)4.78  (2.6%)  
-43.6% ( -47% -  -40%)
   MedPhraseHighTerm   15.54  (1.2%)9.41  (2.5%)  
-39.5% ( -42% -  -36%)
HighPhraseHighPhrase5.99  (1.4%)3.65  (3.0%)  
-39.0% ( -42% -  -35%)
 HighPhraseLowPhrase   15.57  (1.2%)   14.26  (3.6%)   
-8.4% ( -13% -   -3%)
  LowPhraseLowPhrase   27.25  (2.0%)   31.75  (4.5%)   
16.5% (   9% -   23%)
   HighPhraseLowTerm   26.31  (0.9%)   31.42  (3.4%)   
19.4% (  14% -   24%)
   HighPhraseMedTerm   12.95  (1.0%)   15.74  (3.8%)   
21.6% (  16% -   26%)
  MedPhraseMedPhrase9.21  (2.4%)   11.50  (8.3%)   
24.9% (  13% -   36%)
MedPhraseLowTerm   24.85  (1.6%)   31.52  (5.5%)   
26.8% (  19% -   34%)
  MedPhraseLowPhrase   11.64  (2.3%)   15.06  (7.1%)   
29.3% (  19% -   39%)
 HighPhraseMedPhrase8.27  (2.0%)   10.77  (7.2%)   
30.2% (  20% -   40%)
MedPhraseMedTerm   14.53  (1.7%)   19.33  (5.6%)   
33.0% (  25% -   40%)
{noformat}

While the change speeds up some cases it also shows a non-negligible regression 
with high and med frequencies.
Currently the phrase scorer doesn't check impacts to compute the max score per 
blocks so I tried to hack a simple patch that merges the impacts of the terms 
that appear in the phrase query. The patch keeps the minimum frequency per norm 
value in order to compute an upper bound of the score of the phrase query. I 
ran luceneutil again with the modified patch and results are much better:
{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff
  HighPhraseHighTerm8.22  (3.3%)8.83  (1.9%)
7.4% (   2% -   12%)
  LowPhraseLowPhrase   26.57  (0.7%)   28.55  (5.5%)
7.4% (   1% -   13%)
 HighPhraseMedPhrase7.98  (0.8%)9.01  (5.0%)   
12.9% (   7% -   18%)
  MedPhraseMedPhrase8.95  (1.4%)   10.11  (6.6%)   
12.9% (   4% -   21%)
   MedPhraseHighTerm   15.10  (1.1%)   17.69  (4.6%)   
17.2% (  11% -   23%)
  MedPhraseLowPhrase   11.17  (1.1%)   13.11  (4.9%)   
17.4% (  11% -   23%)
 HighPhraseLowPhrase   15.09  (1.5%)   18.85  (7.4%)   
24.9% (  15% -   34%)
HighPhraseHighPhrase5.75  (2.3%)7.26  (4.5%)   
26.2% (  18% -   33%)
   HighPhraseLowTerm   25.68  (0.7%)   34.46  (2.4%)   
34.2% (  30% -   37%)
MedPhraseMedTerm   14.23  (0.1%)   20.71  (2.3%)   
45.5% (  43% -   47%)
MedPhraseLowTerm   24.30  (0.6%)   38.47  (2.4%)   
58.3% (  55% -   61%)
   HighPhraseMedTerm   12.77  (0.6%)   22.21  (3.1%)   
73.9% (  69% -   77%)
{noformat}

However simple phrase queries (without disjunctions) seem to be slower with the 
merging of impacts:
{noformat}
  TaskQPS baseline  StdDev   QPS patch  StdDev  
  Pct diff
  HighPhrase   10.48  (0.0%)9.74  (0.0%)   
-7.1% (  -7% -   -7%)
   MedPhrase   20.92  (0.0%)   20.25  (0.0%)   
-3.2% (  -3% -   -3%)
   LowPhrase   24.07  (0.0%)   23.33  (0.0%)   
-3.1% (  -3% -   -3%)
{noformat}

I am not sure that the merging of impacts is correct so far so I'll add some 
tests. It's also unrelated to this change (even if it helps for performance) so 
I'll open a separate issue to discuss this merging of impacts for phrase query 
separately.
Considering the results of this change alone (two-phase iterator for the wand) 
I will not merge it yet since it doesn't improve queries with lots of matches 
but we can revisit when/if the merging of impacts for phrase queries is 
implemented. WDYT ?

> WANDScorer should support two-phase iterator
> 
>
> Key: LUCENE-8806
> URL: https://issues.apache.org/jira/browse/LUCENE-8806
> Project: Lucene - Core
>  Issue Type: Improvement
>    Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8806.patch, LUCENE-8806.patch
>
>
> Following https://issues.apache.org/jira/browse/LUCENE-8770 the WANDScorer 
> should leverage two-phase iterators in order

[jira] [Commented] (LUCENE-8859) Add an option to load the completion suggester's FST off-heap

2019-06-18 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866367#comment-16866367
 ] 

Jim Ferenczi commented on LUCENE-8859:
--

Thanks for looking Adrien. Currently users can add the file extension (.lkp) to 
the list of files to preload but I agree that it could be simplified. Are you 
concerned by the fact that we could preload a file even when it is consumed 
once (i.e.: if the postings format loads the FST on-heap) ?

> Add an option to load the completion suggester's FST off-heap
> -
>
> Key: LUCENE-8859
> URL: https://issues.apache.org/jira/browse/LUCENE-8859
> Project: Lucene - Core
>  Issue Type: Improvement
>    Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8859.patch
>
>
> Now that FSTs can be loaded off-heap 
> (https://issues.apache.org/jira/browse/LUCENE-8635) it would be nice to 
> expose this option in the completion suggester postings format. I didn't ran 
> any benchmark yet so I can't say it this really makes sense or not but I 
> wanted to get some opinion whether this could be a good trade-off.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8859) Add an option to load the completion suggester's FST off-heap

2019-06-14 Thread Jim Ferenczi (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8859:
-
Attachment: LUCENE-8859.patch

> Add an option to load the completion suggester's FST off-heap
> -
>
> Key: LUCENE-8859
> URL: https://issues.apache.org/jira/browse/LUCENE-8859
> Project: Lucene - Core
>  Issue Type: Improvement
>    Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8859.patch
>
>
> Now that FSTs can be loaded off-heap 
> (https://issues.apache.org/jira/browse/LUCENE-8635) it would be nice to 
> expose this option in the completion suggester postings format. I didn't ran 
> any benchmark yet so I can't say it this really makes sense or not but I 
> wanted to get some opinion whether this could be a good trade-off.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8859) Add an option to load the completion suggester's FST off-heap

2019-06-14 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863810#comment-16863810
 ] 

Jim Ferenczi commented on LUCENE-8859:
--

Here is a patch that exposes an option to force the load on-heap, off-heap or 
auto (make the decision based on the type of directory that is used, e.g.: mmap 
vs others). [^LUCENE-8859.patch] 

> Add an option to load the completion suggester's FST off-heap
> -
>
> Key: LUCENE-8859
> URL: https://issues.apache.org/jira/browse/LUCENE-8859
> Project: Lucene - Core
>  Issue Type: Improvement
>    Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8859.patch
>
>
> Now that FSTs can be loaded off-heap 
> (https://issues.apache.org/jira/browse/LUCENE-8635) it would be nice to 
> expose this option in the completion suggester postings format. I didn't ran 
> any benchmark yet so I can't say it this really makes sense or not but I 
> wanted to get some opinion whether this could be a good trade-off.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8859) Add an option to load the completion suggester's FST off-heap

2019-06-14 Thread Jim Ferenczi (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8859:
-
Priority: Minor  (was: Major)

> Add an option to load the completion suggester's FST off-heap
> -
>
> Key: LUCENE-8859
> URL: https://issues.apache.org/jira/browse/LUCENE-8859
> Project: Lucene - Core
>  Issue Type: Improvement
>    Reporter: Jim Ferenczi
>Priority: Minor
>
> Now that FSTs can be loaded off-heap 
> (https://issues.apache.org/jira/browse/LUCENE-8635) it would be nice to 
> expose this option in the completion suggester postings format. I didn't ran 
> any benchmark yet so I can't say it this really makes sense or not but I 
> wanted to get some opinion whether this could be a good trade-off.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap

2019-06-14 Thread Jim Ferenczi (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8635:
-
Priority: Major  (was: Minor)

> Lazy loading Lucene FST offheap using mmap
> --
>
> Key: LUCENE-8635
> URL: https://issues.apache.org/jira/browse/LUCENE-8635
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/FSTs
> Environment: I used below setup for es_rally tests:
> single node i3.xlarge running ES 6.5
> es_rally was running on another i3.xlarge instance
>Reporter: Ankit Jain
>Priority: Major
> Fix For: 8.0, 8.x, master (9.0)
>
> Attachments: fst-offheap-ra-rev.patch, fst-offheap-rev.patch, 
> offheap.patch, optional_offheap_ra.patch, ra.patch, rally_benchmark.xlsx
>
>
> Currently, FST loads all the terms into heap memory during index open. This 
> causes frequent JVM OOM issues if the term size gets big. A better way of 
> doing this will be to lazily load FST using mmap. That ensures only the 
> required terms get loaded into memory.
>  
> Lucene can expose API for providing list of fields to load terms offheap. I'm 
> planning to take following approach for this:
>  # Add a boolean property fstOffHeap in FieldInfo
>  # Pass list of offheap fields to lucene during index open (ALL can be 
> special keyword for loading ALL fields offheap)
>  # Initialize the fstOffHeap property during lucene index open
>  # FieldReader invokes default FST constructor or OffHeap constructor based 
> on fstOffHeap field
>  
> I created a patch (that loads all fields offheap), did some benchmarks using 
> es_rally and results look good.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8859) Add an option to load the completion suggester's FST off-heap

2019-06-14 Thread Jim Ferenczi (JIRA)

Jim Ferenczi created LUCENE-8859:


 Summary: Add an option to load the completion suggester's FST 
off-heap
 Key: LUCENE-8859
 URL: https://issues.apache.org/jira/browse/LUCENE-8859
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Jim Ferenczi


Now that FSTs can be loaded off-heap 
(https://issues.apache.org/jira/browse/LUCENE-8635) it would be nice to expose 
this option in the completion suggester postings format. I didn't ran any 
benchmark yet so I can't say it this really makes sense or not but I wanted to 
get some opinion whether this could be a good trade-off.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap

2019-06-14 Thread Jim Ferenczi (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8635:
-
Priority: Minor  (was: Major)

> Lazy loading Lucene FST offheap using mmap
> --
>
> Key: LUCENE-8635
> URL: https://issues.apache.org/jira/browse/LUCENE-8635
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/FSTs
> Environment: I used below setup for es_rally tests:
> single node i3.xlarge running ES 6.5
> es_rally was running on another i3.xlarge instance
>Reporter: Ankit Jain
>Priority: Minor
> Fix For: 8.0, 8.x, master (9.0)
>
> Attachments: fst-offheap-ra-rev.patch, fst-offheap-rev.patch, 
> offheap.patch, optional_offheap_ra.patch, ra.patch, rally_benchmark.xlsx
>
>
> Currently, FST loads all the terms into heap memory during index open. This 
> causes frequent JVM OOM issues if the term size gets big. A better way of 
> doing this will be to lazily load FST using mmap. That ensures only the 
> required terms get loaded into memory.
>  
> Lucene can expose API for providing list of fields to load terms offheap. I'm 
> planning to take following approach for this:
>  # Add a boolean property fstOffHeap in FieldInfo
>  # Pass list of offheap fields to lucene during index open (ALL can be 
> special keyword for loading ALL fields offheap)
>  # Initialize the fstOffHeap property during lucene index open
>  # FieldReader invokes default FST constructor or OffHeap constructor based 
> on fstOffHeap field
>  
> I created a patch (that loads all fields offheap), did some benchmarks using 
> es_rally and results look good.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8845) Allow maxExpansions to be set on multi-term Intervals

2019-06-11 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860693#comment-16860693
 ] 

Jim Ferenczi commented on LUCENE-8845:
--

+1

> Allow maxExpansions to be set on multi-term Intervals
> -
>
> Key: LUCENE-8845
> URL: https://issues.apache.org/jira/browse/LUCENE-8845
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 8.2
>
> Attachments: LUCENE-8845.patch
>
>
> MultiTermIntervalsSource has a maxExpansions parameter which is always set to 
> 128 by the factory methods Intervals.prefix() and Intervals.wildcard().  We 
> should keep 128 as the default, but also add additional methods that take a 
> configurable maximum.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8845) Allow maxExpansions to be set on multi-term Intervals

2019-06-10 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860252#comment-16860252
 ] 

Jim Ferenczi commented on LUCENE-8845:
--

{quote}
2) I think this is covered by the javadocs and the 'expert' marking. Some users 
really do need to see all expansions, and if they're aware of the trade-offs 
involved then I don't think we need any further hard caps.
{quote}

I think we should try to prevent users to shoot themselves in the foot. IMO 
this is more important than for other queries because reaching the limit throw 
an error so I expect that users raise the limit until they find a number that 
work for all queries. Can we add an hard limit equals to max_boolean_clause ? 
This would be consistent with the discussion in 
https://issues.apache.org/jira/browse/LUCENE-8811 that should also check the 
number of sources in an interval query ?

> Allow maxExpansions to be set on multi-term Intervals
> -
>
> Key: LUCENE-8845
> URL: https://issues.apache.org/jira/browse/LUCENE-8845
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 8.2
>
> Attachments: LUCENE-8845.patch
>
>
> MultiTermIntervalsSource has a maxExpansions parameter which is always set to 
> 128 by the factory methods Intervals.prefix() and Intervals.wildcard().  We 
> should keep 128 as the default, but also add additional methods that take a 
> configurable maximum.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: No email notifications from JIRA when attaching a patch

2019-06-10 Thread jim ferenczi

> Jim's update (the first link I shared) includes both an attachment
and a comment and we got neither of them.

I use the comment box when attaching a patch so I guess this is why the
notification was not triggered.



Le lun. 10 juin 2019 à 19:20, Adrien Grand  a écrit :

> This might explain why the second issue didn't trigger a notification,
> but Jim's update (the first link I shared) includes both an attachment
> and a comment and we got neither of them.
>
> On Mon, Jun 10, 2019 at 7:09 PM Uwe Schindler  wrote:
> >
> > Hi,
> >
> > I think, if you only attach a file with no comment, it produces no
> message. This is why I generally use dragndrop to attach patches. I move
> them from my file manager into a comment field, so it produces a new
> comment with the file name ready to be clicked from the comment.
> >
> > Uwe
> >
> > -
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > https://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> > > -Original Message-
> > > From: Adrien Grand 
> > > Sent: Monday, June 10, 2019 6:34 PM
> > > To: Lucene Dev 
> > > Subject: Re: No email notifications from JIRA when attaching a patch
> > >
> > > Right, I'm not sure what exactly happens that makes JIRA not sent
> > > notifications, some attachments get notifications, others don't. Here
> > > is another example that didn't trigger an email notification:
> > > https://issues.apache.org/jira/browse/LUCENE-8654. The issue has an
> > > attached patch, but there is nothing about it in the archives:
> > >
> https://lists.apache.org/list.html?dev@lucene.apache.org:gte=1d:Polygon2D
> > > %23relateTriangle%20.
> > >
> > > I opened https://issues.apache.org/jira/browse/INFRA-18587.
> > >
> > >
> > > On Mon, Jun 10, 2019 at 4:18 PM Cassandra Targett
> > >  wrote:
> > > >
> > > > Adrien,
> > > >
> > > > Do you mean you want a notification to be sent when someone attaches
> a
> > > patch file to a Jira issue? If I understand how the LUCENE project is
> set up to
> > > send notifications, one should be sent.
> > > >
> > > > Both LUCENE and SOLR projects use the same Notification Scheme
> > > (https://issues.apache.org/jira/plugins/servlet/project-
> > > config/LUCENE/notifications), which is configured to send a mail to
> the list on
> > > any issue update. AFAIUI, adding an attachment is considered an issue
> > > update.
> > > >
> > > > You can see that a notification is sent properly for a SOLR issue:
> > > https://lists.apache.org/list.html?dev@lucene.apache.org:lte=1M:SOLR-
> > > 11263. Since LUCENE and SOLR are using the same scheme, it should
> behave
> > > the same.
> > > >
> > > > I think it might be worth checking with Infra on it, since as far as
> I can tell
> > > things are set up properly and I think it will involve some system
> admin-level
> > > permissions to dig any deeper into why it’s not working for both
> projects.
> > > >
> > > > Cassandra
> > > > On Jun 10, 2019, 3:26 AM -0500, Adrien Grand ,
> > > wrote:
> > > >
> > > > Hello,
> > > >
> > > > It seems like attaching a patch doesn't trigger a notification to the
> > > > list. For instance we got no email notification for the following
> > > > update to LUCENE-8806:
> > > > https://issues.apache.org/jira/browse/LUCENE-
> > > 8806?focusedCommentId=16858420=com.atlassian.jira.plugin.system.
> > > issuetabpanels:comment-tabpanel#comment-16858420,
> > > > you can double check via the archives, which have my comment as the
> > > > last update on the issue:
> > > >
> > >
> https://lists.apache.org/list.html?dev@lucene.apache.org:lte=1M:WANDScor
> > > er
> > > >
> > > > I have been bitten a couple more times by it, I can try to find which
> > > > JIRA issues exactly if that helps.
> > > >
> > > > This made me miss a couple patches for review recently. I'm not very
> > > > familiar with JIRA, there doesn't seem to be anything wrong with the
> > > > way notificiations are configured today[1], do we need to contact
> > > > infra for this?
> > > >
> > > > [1] https://issues.apache.org/jira/plugins/servlet/project-
> > > config/LUCENE/notifications
> > > >
> > > > --
> > > > Adrien
> > > >
> > > > -
> > > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > > For additional commands, e-mail: dev-h...@lucene.apache.org
> > > >
> > >
> > >
> > > --
> > > Adrien
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
>
> --
> Adrien
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail:

[jira] [Commented] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer

2019-06-10 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859778#comment-16859778
 ] 

Jim Ferenczi commented on LUCENE-8812:
--

Thanks [~danmuzi]!

> add KoreanNumberFilter to Nori(Korean) Analyzer
> ---
>
> Key: LUCENE-8812
> URL: https://issues.apache.org/jira/browse/LUCENE-8812
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Assignee: Namgyu Kim
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: LUCENE-8812.patch
>
>
> This is a follow-up issue to LUCENE-8784.
> The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to 
> regular Arabic decimal numbers in half-width characters.
> Logic is similar to JapaneseNumberFilter.
> It should be able to cover the following test cases.
> 1) Korean Word to Number
> 십만이천오백 => 102500
> 2) 1 character conversion
> 일영영영 => 1000
> 3) Decimal Point Calculation
> 3.2천 => 3200
> 4) Comma between three digits
> 4,647.0010 => 4647.001



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8840) TopTermsBlendedFreqScoringRewrite should use SynonymQuery

2019-06-07 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858600#comment-16858600
 ] 

Jim Ferenczi commented on LUCENE-8840:
--

{quote}
I am curious to understand how including doc frequencies can be better than the 
overall score. IMO, including BM25 scores gives us some additional advantages, 
such as defending against cases where the overall non matching token count in a 
document is significantly high. Did you see any scenarios that had relevance 
troubles due to inclusion of entire BM25 scores?
{quote}

The idea of the SynonymQuery is to score the terms as if they were indexed as a 
single term. I think this fits nicely with the fuzzy query. For instance 
imagine a fuzzy query with the terms "bad" and "baz". With the current solution 
if a document contains both terms it will rank significantly higher than 
documents that contain only one of them. This can change depending on the inner 
doc frequencies but this doesn't seem right IMO. On the contrary the synonym 
query would give the same score to documents containing "baz" with a frequency 
of 4 than another document containing "bad" and "baz" 2 times. This feels more 
natural to me because we shouldn't favor documents that contain multiple 
variations of the same fuzzy term. 

{quote}
On a different note, I am also wondering if we should devise relevance tests 
which allow us to measure the relevance impact of a change. Something added to 
luceneutil should be nice. Thoughts?
{quote}

That would be great but this doesn't look like a low hanging fruit. Maybe open 
a separate issue to discuss ?

{quote}
IMO if we want to restrict the contribution of each term to the blended query's 
final score, then we could think of a blended scorer step which utilizes 
something on the lines of BM25's term frequency saturation when merging scores 
from different blended terms. WDYT?
{quote}

I am not sure I fully understand but the SynonymQuery kind of does that. It 
sums the inner doc frequencies of all matching terms to ensure that the 
contribution of each term to the final score is bounded. 

> TopTermsBlendedFreqScoringRewrite should use SynonymQuery
> -
>
> Key: LUCENE-8840
> URL: https://issues.apache.org/jira/browse/LUCENE-8840
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8840.patch
>
>
> Today the TopTermsBlendedFreqScoringRewrite, which is the default rewrite 
> method for Fuzzy queries, uses the BlendedTermQuery to score documents that 
> match the fuzzy terms. This query blends the frequencies used for scoring 
> across the terms and creates a disjunction of all the blended terms. This 
> means that each fuzzy term that match in a document will add their BM25 score 
> contribution. We already have a query that can blend the statistics of 
> multiple terms in a single scorer that sums the doc frequencies rather than 
> the entire BM25 score: the SynonymQuery. Since 
> https://issues.apache.org/jira/browse/LUCENE-8652 this query also handles 
> boost between 0 and 1 so it should be easy to change the default rewrite 
> method for Fuzzy queries to use it instead of the BlendedTermQuery. This 
> would bound the contribution of each term to the final score which seems a 
> better alternative in terms of relevancy than the current solution. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8840) TopTermsBlendedFreqScoringRewrite should use SynonymQuery

2019-06-07 Thread Jim Ferenczi (JIRA)

Jim Ferenczi created LUCENE-8840:


 Summary: TopTermsBlendedFreqScoringRewrite should use SynonymQuery
 Key: LUCENE-8840
 URL: https://issues.apache.org/jira/browse/LUCENE-8840
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Jim Ferenczi


Today the TopTermsBlendedFreqScoringRewrite, which is the default rewrite 
method for Fuzzy queries, uses the BlendedTermQuery to score documents that 
match the fuzzy terms. This query blends the frequencies used for scoring 
across the terms and creates a disjunction of all the blended terms. This means 
that each fuzzy term that match in a document will add their BM25 score 
contribution. We already have a query that can blend the statistics of multiple 
terms in a single scorer that sums the doc frequencies rather than the entire 
BM25 score: the SynonymQuery. Since 
https://issues.apache.org/jira/browse/LUCENE-8652 this query also handles boost 
between 0 and 1 so it should be easy to change the default rewrite method for 
Fuzzy queries to use it instead of the BlendedTermQuery. This would bound the 
contribution of each term to the final score which seems a better alternative 
in terms of relevancy than the current solution. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer

2019-06-06 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857911#comment-16857911
 ] 

Jim Ferenczi commented on LUCENE-8812:
--

Sorry I didn't see your reply. I agree with you that it is ambiguous to put it 
in analysis-common so +1 to add it in the nori module for now and revisit 
if/when we create a separate module for the mecab tokenizer. 

> add KoreanNumberFilter to Nori(Korean) Analyzer
> ---
>
> Key: LUCENE-8812
> URL: https://issues.apache.org/jira/browse/LUCENE-8812
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8812.patch
>
>
> This is a follow-up issue to LUCENE-8784.
> The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to 
> regular Arabic decimal numbers in half-width characters.
> Logic is similar to JapaneseNumberFilter.
> It should be able to cover the following test cases.
> 1) Korean Word to Number
> 십만이천오백 => 102500
> 2) 1 character conversion
> 일영영영 => 1000
> 3) Decimal Point Calculation
> 3.2천 => 3200
> 4) Comma between three digits
> 4,647.0010 => 4647.001



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Welcome Namgyu Kim as Lucene/Solr committer

2019-06-05 Thread jim ferenczi

Welcome Namgyu!

Le mer. 5 juin 2019 à 13:54, Ignacio Vera  a écrit :

> Welcome!
>
> On Wed, Jun 5, 2019 at 1:53 PM Michael Sokolov  wrote:
>
>> Namgyu! Welcome
>>
>> Mike
>>
>> On Mon, Jun 3, 2019 at 1:52 PM Adrien Grand  wrote:
>> >
>> > Hi all,
>> >
>> > Please join me in welcoming Namgyu Kim as Lucene/ Solr committer!
>> >
>> > Kim has been helping address technical debt and fixing bugs in the
>> > last year, including a cleanup to our DutchAnalyzer[0] and
>> > improvements to the StoredFieldsVisitor API[1]. More recently he also
>> > started improving our korean analyzer[2].
>> >
>> > [0] https://issues.apache.org/jira/browse/LUCENE-8582
>> > [1] https://issues.apache.org/jira/browse/LUCENE-8805
>> > [2] https://issues.apache.org/jira/browse/LUCENE-8784
>> >
>> > Congratulations and welcome! It is a tradition to introduce yourself
>> > with a brief bio.
>> >
>> > --
>> > Adrien
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

[jira] [Commented] (LUCENE-8812) add KoreanNumberFilter to Nori(Korean) Analyzer

2019-05-30 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851778#comment-16851778
 ] 

Jim Ferenczi commented on LUCENE-8812:
--

The patch looks good [~danmuzi], I wonder if it would be difficult to have a 
base class for the Japanese and Korean number filter since they share a large 
amount of code. However I think it's ok to merge this first and we can tackle 
the merge in a follow up, wdyt ?

> add KoreanNumberFilter to Nori(Korean) Analyzer
> ---
>
> Key: LUCENE-8812
> URL: https://issues.apache.org/jira/browse/LUCENE-8812
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8812.patch
>
>
> This is a follow-up issue to LUCENE-8784.
> The KoreanNumberFilter is a TokenFilter that normalizes Korean numbers to 
> regular Arabic decimal numbers in half-width characters.
> Logic is similar to JapaneseNumberFilter.
> It should be able to cover the following test cases.
> 1) Korean Word to Number
> 십만이천오백 => 102500
> 2) 1 character conversion
> 일영영영 => 1000
> 3) Decimal Point Calculation
> 3.2천 => 3200
> 4) Comma between three digits
> 4,647.0010 => 4647.001



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-30 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851574#comment-16851574
 ] 

Jim Ferenczi commented on LUCENE-8816:
--

This sounds like a great plan [~tomoko]. Decoupling the system dictionary 
should help for the merge of the Korean tokenizer but I agree that this merge 
is out of scope for this issue.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-28 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849778#comment-16849778
 ] 

Jim Ferenczi commented on LUCENE-8816:
--

We discussed this when we added the Korean module and said that we could have a 
separate module to handle "mecab-like" tokenization and one module per 
dictionary (ipadic, mecab-ko-dic, ...). There are some assertions in the 
JapaneseTokenizer that checks some invariant of the ipadic (leftId == rightId 
for instance) but I guess we could move them in the dictionary module. This 
could be a nice cleanup if the goal is to handle multiple mecab dictionaries 
(in different languages).

 

{quote}

While it has been slowly obsoleted, well-maintained and/or extended 
dictionaries risen up in recent years (e.g. 
[mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
[UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
attempts/projects/efforts are made in Japan.

{quote}

 

While allowing more flexibility would be nice I wonder if there are that many 
different dictionaries. If the ipadic is obsolete we could also adapt the main 
distribution (kuromoji) to use the UniDic instead. Even if we handle multiple 
dictionaries we'll still need to provide a way for users to add custom entries. 
Mecab has an option to compute the leftId, rightId and cost automatically from 
a partial user entry so I wonder if this could help to avoid users to 
reimplement a dictionary from scratch ?

 

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-27 Thread Jim Ferenczi (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-8784.
--
   Resolution: Fixed
Fix Version/s: 8.2
   master (9.0)

Thanks [~danmuzi]! I pushed to master and branch_8x. I also removed the 
discardPunctuations option from the KoreanAnalyzer in order to be consistent 
with the JapaneseAnalyzer. It's an advanced option that should be used with a 
specific token filter in mind (KoreanNumberFilter for instance).

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, 
> LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-24 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847695#comment-16847695
 ] 

Jim Ferenczi commented on LUCENE-8784:
--

The last patch for this issue looks good to me. I'll test locally and merge if 
all tests pass. 

Thanks for opening LUCENE-8812, I'll take a look when this issue gets merged.

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch, 
> LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8788) Order LeafReaderContexts by Estimated Number Of Hits

2019-05-24 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847455#comment-16847455
 ] 

Jim Ferenczi commented on LUCENE-8788:
--

{quote}

I like the idea [~jim.ferenczi] proposed. I can open a Jira for that and work 
on a patch for it as well, unless Jim wants to do it himself?

{quote}

Something is needed for the search side and this issue is the right place to 
add such functionalities. I wonder if we need an issue for the merge side 
though since it's already possible to change the order of segments in a custom 
FilterMergePolicy. I tried to do it in a POC and the change is trivial so I am 
not sure that we need to do anything in core.

> Order LeafReaderContexts by Estimated Number Of Hits
> 
>
> Key: LUCENE-8788
> URL: https://issues.apache.org/jira/browse/LUCENE-8788
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> We offer no guarantee on the order in which an IndexSearcher will look at 
> segments during a search operation. This can be improved for use cases where 
> an engine using Lucene invokes early termination and uses the partially 
> collected hits. A better model would be if we sorted segments by the 
> estimated number of hits, thus increasing the probability of the overall 
> relevance of the returned partial results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-24 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847381#comment-16847381
 ] 

Jim Ferenczi commented on LUCENE-8784:
--

{quote}

By the way, would not it be better to leave the constructors that do not use 
discardPunctuation parameters?
(Existing Nori users have to modify the code after uploading)

{quote}

Yes we should do that, otherwise it's a breaking change and we cannot push to 
8x.

{quote}

I also added Javadoc for discardPunctuation in your patch. (KoreanAnalyzer, 
KoreanTokenizerFactory)

{quote}

thanks!

{quote}

I developed KoreanNumberFilter by referring to JapaneseNumberFilter.
Please check my patch :D

{quote}

The patch looks good but we should iterate on this in a new issue. We try to do 
one feature at a time in a single issue so let's add discardPunctuation in this 
one and we can open a new one as a follow up to add the KoreanNumberFilter ?

 

 

 

 

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch, LUCENE-8784.patch, LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8784) Nori(Korean) tokenizer removes the decimal point.

2019-05-22 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845856#comment-16845856
 ] 

Jim Ferenczi commented on LUCENE-8784:
--

Hi [~danmuzi],
I don't think we should have one option for every punctuation type and the 
current check in the patch based on Character.OTHER_PUNCTUATION would match 
more than just the full stop character. If we want to preserve punctuations we 
can add the same option than for Kuromoji (discardPunctuation) and output a 
token for each punctuation group. So for an input like "10.1?" we would output 
4 tokens: "10", ".", "1", "?". Then if you need to "regroup" tokens based on 
additional rules you can add another filter to do this like the 
JapaneseNumberFilter does. The other option would be to detect numbers with 
decimal points accurately like the standard tokenizer does but we don't want to 
reinvent the wheel either. If we want the same grouping for unknown words in 
this tokenizer we should probably implement it on top of the standard or ICU 
tokenizer directly. 
.

>  Nori(Korean) tokenizer removes the decimal point. 
> ---
>
> Key: LUCENE-8784
> URL: https://issues.apache.org/jira/browse/LUCENE-8784
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Munkyu Im
>Priority: Major
> Attachments: LUCENE-8784.patch
>
>
> This is the same issue that I mentioned to 
> [https://github.com/elastic/elasticsearch/issues/41401#event-2293189367]
> unlike standard analyzer, nori analyzer removes the decimal point.
> nori tokenizer removes "." character by default.
>  In this case, it is difficult to index the keywords including the decimal 
> point.
> It would be nice if there had the option whether add a decimal point or not.
> Like Japanese tokenizer does,  Nori need an option to preserve decimal point.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-8770) BlockMaxConjunctionScorer should support two-phase scorers

2019-05-21 Thread Jim Ferenczi (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-8770.
--
   Resolution: Fixed
Fix Version/s: 8.2
   master (9.0)

Thanks [~jpountz]!

> BlockMaxConjunctionScorer should support two-phase scorers
> --
>
> Key: LUCENE-8770
> URL: https://issues.apache.org/jira/browse/LUCENE-8770
> Project: Lucene - Core
>  Issue Type: Improvement
>    Reporter: Jim Ferenczi
>Priority: Minor
> Fix For: master (9.0), 8.2
>
> Attachments: LUCENE-8770.patch, LUCENE-8770.patch
>
>
> The support for two-phase scorers in BlockMaxConjunctionScorer is missing. 
> This can slow down some queries that need to execute costly second phase on 
> more documents.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8806) WANDScorer should support two-phase iterator

2019-05-21 Thread Jim Ferenczi (JIRA)

Jim Ferenczi created LUCENE-8806:


 Summary: WANDScorer should support two-phase iterator
 Key: LUCENE-8806
 URL: https://issues.apache.org/jira/browse/LUCENE-8806
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Jim Ferenczi


Following https://issues.apache.org/jira/browse/LUCENE-8770 the WANDScorer 
should leverage two-phase iterators in order to be faster when used in 
conjunctions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8770) BlockMaxConjunctionScorer should support two-phase scorers

2019-05-21 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844612#comment-16844612
 ] 

Jim Ferenczi commented on LUCENE-8770:
--

{quote}
 I wonder how useful computing the score in the two-phase and the iterator help 
now, can we get rid of it or would it hurt?
{quote}

I think we can, I ran the benchmark without the score check in the iterator and 
here's the result:


{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff
  AndHighMed   58.08  (4.9%)   54.60 (11.0%)   
-6.0% ( -20% -   10%)
 AndHighHigh   23.08  (7.5%)   22.64 (12.1%)   
-1.9% ( -19% -   19%)
  AndHighLow  427.13  (4.7%)  434.58 (10.5%)
1.7% ( -12% -   17%)
{noformat}


> BlockMaxConjunctionScorer should support two-phase scorers
> --
>
> Key: LUCENE-8770
> URL: https://issues.apache.org/jira/browse/LUCENE-8770
> Project: Lucene - Core
>  Issue Type: Improvement
>    Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8770.patch
>
>
> The support for two-phase scorers in BlockMaxConjunctionScorer is missing. 
> This can slow down some queries that need to execute costly second phase on 
> more documents.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8770) BlockMaxConjunctionScorer should support two-phase scorers

2019-05-20 Thread Jim Ferenczi (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi updated LUCENE-8770:
-
Attachment: (was: LUCENE-8770.patch)

> BlockMaxConjunctionScorer should support two-phase scorers
> --
>
> Key: LUCENE-8770
> URL: https://issues.apache.org/jira/browse/LUCENE-8770
> Project: Lucene - Core
>  Issue Type: Improvement
>    Reporter: Jim Ferenczi
>Priority: Minor
>
> The support for two-phase scorers in BlockMaxConjunctionScorer is missing. 
> This can slow down some queries that need to execute costly second phase on 
> more documents.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [VOTE] Release Lucene/Solr 8.1.0 RC2

2019-05-09 Thread jim ferenczi

+1
SUCCESS! [1:14:41.737009]

Le jeu. 9 mai 2019 à 18:56, Kevin Risden  a écrit :

> +1
> SUCCESS! [1:17:45.727492]
>
> Kevin Risden
>
>
> On Thu, May 9, 2019 at 11:37 AM Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com> wrote:
>
>> Please vote for release candidate 2 for Lucene/Solr 8.1.0
>>
>> The artifacts can be downloaded from:
>>
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.1.0-RC2-revdbe5ed0b2f17677ca6c904ebae919363f2d36a0a
>>
>> You can run the smoke tester directly with this command:
>>
>> python3 -u dev-tools/scripts/smokeTestRelease.py \
>>
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.1.0-RC2-revdbe5ed0b2f17677ca6c904ebae919363f2d36a0a
>>
>> Here's my +1
>> SUCCESS! [0:44:31.244021]
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

[jira] [Resolved] (LUCENE-7840) BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least 1 MUST/FILTER clause and 0==minShouldMatch

2019-05-09 Thread Jim Ferenczi (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-7840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Ferenczi resolved LUCENE-7840.
--
   Resolution: Fixed
Fix Version/s: 8.2
   master (9.0)

Thanks [~atris]!

> BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least 
> 1 MUST/FILTER clause and 0==minShouldMatch
> ---
>
> Key: LUCENE-7840
> URL: https://issues.apache.org/jira/browse/LUCENE-7840
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Hoss Man
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: LUCENE-7840.patch, LUCENE-7840.patch, LUCENE-7840.patch
>
>
> I haven't thought this through completely, let alone write up a patch / test 
> case, but IIUC...
> We should be able to optimize  {{ BooleanQuery rewriteNoScoring() }} so that 
> (after converting MUST clauses to FILTER clauses) we can check for the common 
> case of {{0==getMinimumNumberShouldMatch()}} and throw away any SHOULD 
> clauses as long as there is is at least one FILTER clause.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7840) BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least 1 MUST/FILTER clause and 0==minShouldMatch

2019-05-07 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-7840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834756#comment-16834756
 ] 

Jim Ferenczi commented on LUCENE-7840:
--

Thanks [~atris], it looks good to me too. I'll commit shortly

> BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least 
> 1 MUST/FILTER clause and 0==minShouldMatch
> ---
>
> Key: LUCENE-7840
> URL: https://issues.apache.org/jira/browse/LUCENE-7840
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Hoss Man
>Priority: Major
> Attachments: LUCENE-7840.patch, LUCENE-7840.patch
>
>
> I haven't thought this through completely, let alone write up a patch / test 
> case, but IIUC...
> We should be able to optimize  {{ BooleanQuery rewriteNoScoring() }} so that 
> (after converting MUST clauses to FILTER clauses) we can check for the common 
> case of {{0==getMinimumNumberShouldMatch()}} and throw away any SHOULD 
> clauses as long as there is is at least one FILTER clause.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7840) BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least 1 MUST/FILTER clause and 0==minShouldMatch

2019-05-07 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-7840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834417#comment-16834417
 ] 

Jim Ferenczi commented on LUCENE-7840:
--

Can you build the new query in a single pass ? You could check the condition to 
add the SHOULD clauses before the loop with:
{code:java}
boolean keepShould = getMinimumNumberShouldMatch() > 0 || 
(clauseSets.get(Occur.MUST).size() + clauseSets.get(Occur.FILTER).size() == 0);
{code}
and then add the should clauses in the main loop if keepShould is true ?

> BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least 
> 1 MUST/FILTER clause and 0==minShouldMatch
> ---
>
> Key: LUCENE-7840
> URL: https://issues.apache.org/jira/browse/LUCENE-7840
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Hoss Man
>Priority: Major
> Attachments: LUCENE-7840.patch
>
>
> I haven't thought this through completely, let alone write up a patch / test 
> case, but IIUC...
> We should be able to optimize  {{ BooleanQuery rewriteNoScoring() }} so that 
> (after converting MUST clauses to FILTER clauses) we can check for the common 
> case of {{0==getMinimumNumberShouldMatch()}} and throw away any SHOULD 
> clauses as long as there is is at least one FILTER clause.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7840) BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least 1 MUST/FILTER clause and 0==minShouldMatch

2019-05-06 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-7840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833727#comment-16833727
 ] 

Jim Ferenczi commented on LUCENE-7840:
--

I think so yes, we don't need to build the scorer supplier for the SHOULD 
clauses so it makes sense to move the logic there.

> BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least 
> 1 MUST/FILTER clause and 0==minShouldMatch
> ---
>
> Key: LUCENE-7840
> URL: https://issues.apache.org/jira/browse/LUCENE-7840
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Hoss Man
>Priority: Major
>
> I haven't thought this through completely, let alone write up a patch / test 
> case, but IIUC...
> We should be able to optimize  {{ BooleanQuery rewriteNoScoring() }} so that 
> (after converting MUST clauses to FILTER clauses) we can check for the common 
> case of {{0==getMinimumNumberShouldMatch()}} and throw away any SHOULD 
> clauses as long as there is is at least one FILTER clause.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7840) BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least 1 MUST/FILTER clause and 0==minShouldMatch

2019-05-06 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-7840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833621#comment-16833621
 ] 

Jim Ferenczi commented on LUCENE-7840:
--

Note that the logic to remove SHOULD clauses is already implemented when we 
build the Scorer:
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java#L391
Moving the logic to rewriteNoScoring makes sense to me but this won't optimize 
anything since the removal is already in place.


> BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least 
> 1 MUST/FILTER clause and 0==minShouldMatch
> ---
>
> Key: LUCENE-7840
> URL: https://issues.apache.org/jira/browse/LUCENE-7840
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Hoss Man
>Priority: Major
>
> I haven't thought this through completely, let alone write up a patch / test 
> case, but IIUC...
> We should be able to optimize  {{ BooleanQuery rewriteNoScoring() }} so that 
> (after converting MUST clauses to FILTER clauses) we can check for the common 
> case of {{0==getMinimumNumberShouldMatch()}} and throw away any SHOULD 
> clauses as long as there is is at least one FILTER clause.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8772) [nori] A word that is registered in advance, but the words are not separated and recognized as 'UNKNOWN'

2019-04-19 Thread Jim Ferenczi (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821810#comment-16821810
 ] 

Jim Ferenczi commented on LUCENE-8772:
--

That's expected since the unknown word heuristic is to group characters of the 
same class together. In this case `갊수학` is considered as a single word and `갊` 
is unknown so we jump to the end of the unknown word to find new entries. You 
can add `갊` in the user dict or a special rule `갊수학 갊 수학` that will decompose 
the terms. We could also change the heuristic to add unknown word of length 1 
in order to be able to detect user words inside unknown blocks but I wonder if 
the cost to do that is not prohibitive.

> [nori]  A word that is registered in advance, but the words are not separated 
> and recognized as 'UNKNOWN'
> -
>
> Key: LUCENE-8772
> URL: https://issues.apache.org/jira/browse/LUCENE-8772
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 7.5, 7.6, 7.7, 7.7.1, 8.0
>Reporter: YOO JEONGIN
>Priority: Major
> Attachments: image-2019-04-19-11-32-56-310.png
>
>
> hello,
> In case of 'nori', if there is no word starting from the left, 'UNKNOWN' is 
> analyzed even if there is a word already registered in the middle.
>  So here is the question.
>  Does nori analyze only on the left side and do not analyze from the right 
> side?
>  Could this be solved?
>  
> ex)
> input => 갊수학
> Condition
> dictionary registered : 수학
>  dictionary Unregistered : 갊
> result => 갊수학
> !image-2019-04-19-11-32-56-310.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8770) BlockMaxConjunctionScorer should support two-phase scorers

2019-04-18 Thread Jim Ferenczi (JIRA)

Jim Ferenczi created LUCENE-8770:


 Summary: BlockMaxConjunctionScorer should support two-phase scorers
 Key: LUCENE-8770
 URL: https://issues.apache.org/jira/browse/LUCENE-8770
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Jim Ferenczi


The support for two-phase scorers in BlockMaxConjunctionScorer is missing. This 
can slow down some queries that need to execute costly second phase on more 
documents.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

1 2 3 4 5 6 7 >

1 - 100 of 624 matches

Mail list logo