[ANNOUNCE] Apache Lucene 9.11.0 released

2024-06-06 Thread Benjamin Trent
The Lucene PMC is pleased to announce the release of Apache Lucene 9.11.0.

Apache Lucene is a high-performance, full-featured search engine library
written entirely in Java. It is a technology suitable for nearly any
application that requires structured search, full-text search, faceting,
nearest-neighbor search across high-dimensionality vectors, spell
correction or query suggestions.

This release contains numerous bug fixes, optimizations, and improvements,
some of which are highlighted below. The release is available for immediate
download at:

https://lucene.apache.org/core/downloads.html

Lucene 9.11.0 Release Highlights:

New features:

 * Add support for posix_madvise to MMapDirectory: If running on
Linux/macOS and Java 21 or later, MMapDirectory uses IOContext to pass
suitable MADV flags to the kernel of the operating system. This may improve
paging logic especially when working with large indexes under memory
pressure.
 * Expand support for new scalar bit levels for HNSW vectors. This includes
4-bit vectors and an option to compress them to gain a 50% reduction in
memory usage.
 * Recursive graph bisection is now supported on indexes that have blocks

Improvements:

 * MergeScheduler can now provide an executor for intra-merge parallelism.
The first implementation is the ConcurrentMergeScheduler.
 * Upgrade icu4j to version 74.2.

Optimizations:

 * Use RWLock to access LRUQueryCache to reduce contention.
 * Speedup multi-segment HNSW graph search for diversifying child kNN
queries.
 * Add a MemorySegment Vector scorer - for scoring without copying on-heap.
This can improve search latency by almost 2x for byte vectors.
 * Switch to using optimized, primitive collections where possible to
improve performance and heap utilization.

...And many more optimizations and bugfixes.

Please read CHANGES.txt for a full list of new features and changes:
https://lucene.apache.org/core/9_11_0/changes/Changes.html

Please report any feedback to the mailing lists (
http://lucene.apache.org/core/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases. It is possible that the mirror you are using may
not have replicated the release yet. If that is the case, please try
another mirror. This also applies to Maven access.


Re: [VOTE] Release Lucene 9.11.0 RC1

2024-06-06 Thread Benjamin Trent
It's been >72h since the vote was initiated and the result is:

+1  12  (11 binding)
 0  0
-1  0

This vote has PASSED


Thanks!

Ben Trent

On Thu, Jun 6, 2024 at 12:27 AM Patrick Zhai  wrote:

> +1
>
> SUCCESS! [1:01:30.064666]
>
> On Wed, Jun 5, 2024 at 11:08 AM Houston Putman  wrote:
>
>> +1
>>
>> SUCCESS! [1:49:36.192513]
>>
>> - Houston Putman
>>
>> On Wed, Jun 5, 2024 at 12:58 PM Michael McCandless <
>> luc...@mikemccandless.com> wrote:
>>
>>> +1 SUCCESS! [0:24:55.332837]
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Wed, Jun 5, 2024 at 11:21 AM Adrien Grand  wrote:
>>>
 +1 SUCCESS! [1:09:30.262027]

 On Wed, Jun 5, 2024 at 4:15 PM Tomás Fernández Löbbe <
 tomasflo...@gmail.com> wrote:

> +1
>
> SUCCESS! [1:12:30.029470]
>
> On Wed, Jun 5, 2024 at 9:22 AM Bruno Roustant <
> bruno.roust...@gmail.com> wrote:
>
>> +1
>>
>> SUCCESS! [0:41:14.593265]
>>
>> Bruno
>>
>>>

 --
 Adrien

>>>


[VOTE] Release Lucene 9.11.0 RC1

2024-06-03 Thread Benjamin Trent
Please vote for release candidate 1 for Lucene 9.11.0

The artifacts can be downloaded from:
https://dist.apache.org/repos/dist/dev/lucene/lucene-9.11.0-RC1-rev-d433394b292e3562e0bb34222f7dd4f307e2b8ca

You can run the smoke tester directly with this command:

python3 -u dev-tools/scripts/smokeTestRelease.py \
https://dist.apache.org/repos/dist/dev/lucene/lucene-9.11.0-RC1-rev-d433394b292e3562e0bb34222f7dd4f307e2b8ca

The vote will be open for at least 72 hours i.e. until 2024-06-06 12:00 UTC.

[ ] +1  approve
[ ] +0  no opinion
[ ] -1  disapprove (and reason why)

Here is my +1

Thanks!

Ben Trent


Re: Lucene 9.11

2024-05-29 Thread Benjamin Trent
Hey y'all,

As part of the release process, I have cut the 9.11 branch & bumped
versions. So, be aware when backporting bug fixes. I am still fighting with
Jenkins on getting the periodic build jobs (I may not have the correct
permissions...).

I will be continuing the release process over the next day or so. It's my
first time, so I am swamped with reading :).

Thanks!

Ben


On Wed, May 29, 2024 at 9:04 AM Michael McCandless <
luc...@mikemccandless.com> wrote:

> Thanks Ben!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, May 29, 2024 at 12:45 AM Stefan Vodita 
> wrote:
>
>> Ben, I just merged #13414 <https://github.com/apache/lucene/pull/13414>,
>> so it's not a blocker for the release.
>> Thanks again for volunteering to be release manager!
>>
>> Stefan
>>
>> On Tue, 28 May 2024 at 14:58, Benjamin Trent 
>> wrote:
>>
>>> Hey y'all,
>>>
>>> I am planning on starting the release process tomorrow (May 29).
>>>
>>> I am in the Eastern USA time zone, so I will start the process around
>>> noon UTC.
>>>
>>> I noticed one PR from Stefan. I can wait for that one if I need to.
>>>
>>> Did we figure out the hppc concerns? I saw some PR activity, wanted to
>>> make sure we are all still good with starting the release process this week.
>>>
>>> Anything else I should be aware of or wait for?
>>>
>>> Thanks!
>>>
>>> Ben Trent
>>>
>>> On Wed, May 15, 2024, 3:58 AM Chris Hegarty
>>>  wrote:
>>>
>>>> +1
>>>>
>>>> -Chris.
>>>>
>>>> > On 14 May 2024, at 16:10, Adrien Grand  wrote:
>>>> >
>>>> > +1 the 9.11 changelog looks great!
>>>> >
>>>> > On Tue, May 14, 2024 at 4:50 PM Benjamin Trent 
>>>> wrote:
>>>> > Hey y'all,
>>>> >
>>>> > Looking at changes for 9.11, we are building a significant list. I
>>>> propose we do a release in the next couple of weeks.
>>>> >
>>>> > While this email is a little early (I am about to go on vacation for
>>>> a bit), I volunteer myself as release manager.
>>>> >
>>>> > Unless there are objections, I plan on kicking off the release
>>>> process May 28th.
>>>> >
>>>> > Thanks!
>>>> >
>>>> > Ben
>>>> >
>>>> >
>>>> > --
>>>> > Adrien
>>>>
>>>>
>>>> -
>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>
>>>>


Re: Lucene 9.11

2024-05-28 Thread Benjamin Trent
Hey y'all,

I am planning on starting the release process tomorrow (May 29).

I am in the Eastern USA time zone, so I will start the process around noon
UTC.

I noticed one PR from Stefan. I can wait for that one if I need to.

Did we figure out the hppc concerns? I saw some PR activity, wanted to make
sure we are all still good with starting the release process this week.

Anything else I should be aware of or wait for?

Thanks!

Ben Trent

On Wed, May 15, 2024, 3:58 AM Chris Hegarty
 wrote:

> +1
>
> -Chris.
>
> > On 14 May 2024, at 16:10, Adrien Grand  wrote:
> >
> > +1 the 9.11 changelog looks great!
> >
> > On Tue, May 14, 2024 at 4:50 PM Benjamin Trent 
> wrote:
> > Hey y'all,
> >
> > Looking at changes for 9.11, we are building a significant list. I
> propose we do a release in the next couple of weeks.
> >
> > While this email is a little early (I am about to go on vacation for a
> bit), I volunteer myself as release manager.
> >
> > Unless there are objections, I plan on kicking off the release process
> May 28th.
> >
> > Thanks!
> >
> > Ben
> >
> >
> > --
> > Adrien
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Lucene 9.11

2024-05-14 Thread Benjamin Trent
Hey y'all,

Looking at changes for 9.11, we are building a significant list. I propose
we do a release in the next couple of weeks.

While this email is a little early (I am about to go on vacation for a
bit), I volunteer myself as release manager.

Unless there are objections, I plan on kicking off the release process May
28th.

Thanks!

Ben


Format metadata versioning vs. new named Formats

2024-04-12 Thread Benjamin Trent
Hey y'all,

I am confused about when we should supply a new format name (e.g.
Lucene911... vs. Lucene99) versus using a new metadata header version
(incrementing VERSION_CURRENT).

Are there general rules to follow?

At first glance, using a new Lucene format name prefix is functionally the
same as adjusting the metadata header version. Older versions won't be able
to read it. Newer versions will be able to read it and will be able to read
older formats (both named and via metadata versioning).

Thanks!

Ben


Re: [apache/lucene] Run failed: Run nightly: buildAndPushRelease and smokeTestRelease.py - main (df154cd)

2024-04-05 Thread Benjamin Trent
Hmm, yeah. Honestly, I am not sure what to do about this either. I am going
to remove the 9.10.1 versioning from all branches but 9.10 (it's there to
capture the next bugfix).

I thought I was doing something helpful, but I guess I was a little too
eager.

On Fri, Apr 5, 2024 at 3:13 AM Dawid Weiss  wrote:

>
> Hi Ben,
>
> This fails in the smoke tester - failed last night too, so it reproduces.
>
> https://github.com/apache/lucene/actions/workflows/run-nightly-smoketester.yml
>
> I looked it up to getAllLuceneReleases in the smoke tester script, which
> in turn lists all releases available at:
> https://archive.apache.org/dist/lucene/java/
>
> version 9.10.1 isn't there so it complains with:
> RuntimeError: tested version=9.10.1 but it was not released?
>
> I'm not sure how to deal with yet-unreleased minor versions myself!
>
> D.
>
> On Thu, Apr 4, 2024 at 4:20 PM Benjamin Trent 
> wrote:
>
>> This seems related to us forgetting to make the back-compat indices &
>> versions when 9.10.1 was released and me adding them later.
>>
>> I have since added the 9.10.1 to Version.java and version.txt in main and
>> 9x. Now, both main and 9x have the back-compat indices (these changes were
>> not at the same time, and cause a separate build failure noticed by Mike
>> M.).
>>
>> But this failure commit on main df154cdc2288a33747edb8849509ea5c3cbf792e,
>> contains both of my changes (the 9.10.1 version & back compat indices).
>>
>> Was there something else forgotten during this bugfix release that we
>> need to address? I am very new to the back-compat and release logic in
>> Lucene and I am eager to learn.
>>
>> On Thu, Apr 4, 2024 at 9:43 AM Dawid Weiss  wrote:
>>
>>>
>>> https://github.com/apache/lucene/actions/runs/8548297347/job/23421799032
>>>
>>> This smoketester run failed with:
>>>
>>> > RuntimeError: tested version=9.10.1 but it was not released?
>>>
>>> I guess it's not a hiccup but something recent?
>>>
>>> On Thu, Apr 4, 2024 at 3:24 AM Dawid Weiss 
>>> wrote:
>>>
>>>>
>>>> [image: GitHub] [apache/lucene] Run nightly: buildAndPushRelease and
>>>> smokeTestRelease.py workflow run
>>>>
>>>>   Run nightly: buildAndPushRelease and smokeTestRelease.py: Some jobs
>>>> were not successful
>>>>
>>>> View workflow run
>>>> <https://github.com/apache/lucene/actions/runs/8548297347>
>>>>
>>>> [image: Smoke test release on jdk 21, ubuntu-latest]
>>>>
>>>> *Run nightly: buildAndPushRelease and smokeTestRelease.py* / Smoke
>>>> test release on jdk 21, ubuntu-latest
>>>> Failed in 9 minutes and 22 seconds
>>>> [image: annotations for Run nightly: buildAndPushRelease and
>>>> smokeTestRelease.py / Smoke test release on jdk 21, ubuntu-latest] 1
>>>> <https://github.com/apache/lucene/actions/runs/8548297347>
>>>> [image: Smoke test release on jdk 22-ea, ubuntu-latest]
>>>>
>>>> *Run nightly: buildAndPushRelease and smokeTestRelease.py* / Smoke
>>>> test release on jdk 22-ea, ubuntu-latest
>>>> Cancelled
>>>> [image: annotations for Run nightly: buildAndPushRelease and
>>>> smokeTestRelease.py / Smoke test release on jdk 22-ea, ubuntu-latest] 2
>>>> <https://github.com/apache/lucene/actions/runs/8548297347>
>>>>
>>>>
>>>>
>>>> —
>>>> You are receiving this because you are subscribed to this thread.
>>>> Manage your GitHub Actions notifications
>>>> <https://github.com/settings/notifications>
>>>>
>>>>
>>>> GitHub, Inc. ・88 Colin P Kelly Jr Street ・San Francisco, CA 94107
>>>>
>>>>
>>>


Re: [apache/lucene] Run failed: Run nightly: buildAndPushRelease and smokeTestRelease.py - main (df154cd)

2024-04-04 Thread Benjamin Trent
This seems related to us forgetting to make the back-compat indices &
versions when 9.10.1 was released and me adding them later.

I have since added the 9.10.1 to Version.java and version.txt in main and
9x. Now, both main and 9x have the back-compat indices (these changes were
not at the same time, and cause a separate build failure noticed by Mike
M.).

But this failure commit on main df154cdc2288a33747edb8849509ea5c3cbf792e,
contains both of my changes (the 9.10.1 version & back compat indices).

Was there something else forgotten during this bugfix release that we need
to address? I am very new to the back-compat and release logic in Lucene
and I am eager to learn.

On Thu, Apr 4, 2024 at 9:43 AM Dawid Weiss  wrote:

>
> https://github.com/apache/lucene/actions/runs/8548297347/job/23421799032
>
> This smoketester run failed with:
>
> > RuntimeError: tested version=9.10.1 but it was not released?
>
> I guess it's not a hiccup but something recent?
>
> On Thu, Apr 4, 2024 at 3:24 AM Dawid Weiss 
> wrote:
>
>>
>> [image: GitHub] [apache/lucene] Run nightly: buildAndPushRelease and
>> smokeTestRelease.py workflow run
>>
>>   Run nightly: buildAndPushRelease and smokeTestRelease.py: Some jobs
>> were not successful
>>
>> View workflow run
>> 
>>
>> [image: Smoke test release on jdk 21, ubuntu-latest]
>>
>> *Run nightly: buildAndPushRelease and smokeTestRelease.py* / Smoke test
>> release on jdk 21, ubuntu-latest
>> Failed in 9 minutes and 22 seconds
>> [image: annotations for Run nightly: buildAndPushRelease and
>> smokeTestRelease.py / Smoke test release on jdk 21, ubuntu-latest] 1
>> 
>> [image: Smoke test release on jdk 22-ea, ubuntu-latest]
>>
>> *Run nightly: buildAndPushRelease and smokeTestRelease.py* / Smoke test
>> release on jdk 22-ea, ubuntu-latest
>> Cancelled
>> [image: annotations for Run nightly: buildAndPushRelease and
>> smokeTestRelease.py / Smoke test release on jdk 22-ea, ubuntu-latest] 2
>> 
>>
>>
>>
>> —
>> You are receiving this because you are subscribed to this thread.
>> Manage your GitHub Actions notifications
>> 
>>
>>
>> GitHub, Inc. ・88 Colin P Kelly Jr Street ・San Francisco, CA 94107
>>
>>
>


Re: [JENKINS] Lucene » Lucene-NightlyTests-main - Build # 1315 - Still Unstable!

2024-04-02 Thread Benjamin Trent
This is me. We missed the 9.10.1 version in the 9x branch and the main
branch. So, I added it. But, obviously, I didn't think about generating all
the bwc indices that we didn't generate when that release was pushed.

We can remove it, I would just need to adjust some new BWC tests I added
that were built on the 9.10.1 indices to build them on 9.10.0.

On Tue, Apr 2, 2024 at 11:11 AM Michael McCandless <
luc...@mikemccandless.com> wrote:

> Hmm this failure looks not great.
>
> I tried the "Reproduce with:" for one of the failures (see below) but it
> fails to run any tests at all?  Maybe because of the cool parameterized
> testing we now have for our back compat tests?  If I remove the "{...}"
> pattern then the failures do repro.
>
> ./gradlew :lucene:backward-codecs:test --tests
> "org.apache.lucene.backward_index.TestBinaryBackwardsCompatibility.testSearchOldIndex
> {Lucene-Version:9.10.1; Pattern: unsupported.%1$s-cfs.zip}" -Ptests.jvms=4
> -Ptests.jvmargs= -Ptests.seed=AED171B219\
> 72F50D -Ptests.multiplier=2 -Ptests.nightly=true -Ptests.gui=true
> -Ptests.file.encoding=ISO-8859-1
> -Ptests.linedocsfile=/home/jenkins/jenkins-slave/workspace/Lucene/Lucene-NightlyTests-main/test-data/enwiki.random.lines.txt
> -Ptests.vectorsize=256
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Apr 2, 2024 at 4:52 AM Apache Jenkins Server <
> jenk...@builds.apache.org> wrote:
>
>> Build:
>> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/1315/
>>
>> 6 tests failed.
>> FAILED:
>> org.apache.lucene.backward_index.TestBinaryBackwardsCompatibility.testSearchOldIndex
>> {Lucene-Version:9.10.1; Pattern: unsupported.%1$s-cfs.zip}
>>
>> Error Message:
>> java.lang.AssertionError: Index name 9.10.1 not found:
>> unsupported.9.10.1-cfs.zip
>>
>> Stack Trace:
>> java.lang.AssertionError: Index name 9.10.1 not found:
>> unsupported.9.10.1-cfs.zip
>> at
>> __randomizedtesting.SeedInfo.seed([AED171B21972F50D:E4679B8937FD59F]:0)
>> at junit@4.13.1/org.junit.Assert.fail(Assert.java:89)
>> at junit@4.13.1/org.junit.Assert.assertTrue(Assert.java:42)
>> at junit@4.13.1/org.junit.Assert.assertNotNull(Assert.java:713)
>> at
>> org.apache.lucene.backward_index.BackwardsCompatibilityTestBase.setUp(BackwardsCompatibilityTestBase.java:145)
>> at
>> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
>> at java.base/java.lang.reflect.Method.invoke(Method.java:580)
>> at randomizedtesting.runner@2.8.1
>> /com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
>> at randomizedtesting.runner@2.8.1
>> /com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:980)
>> at randomizedtesting.runner@2.8.1
>> /com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
>> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT
>> /org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
>> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT
>> /org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
>> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT
>> /org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
>> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT
>> /org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
>> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT
>> /org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
>> at junit@4.13.1
>> /org.junit.rules.RunRules.evaluate(RunRules.java:20)
>> at randomizedtesting.runner@2.8.1
>> /com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>> at randomizedtesting.runner@2.8.1
>> /com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
>> at randomizedtesting.runner@2.8.1
>> /com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
>> at randomizedtesting.runner@2.8.1
>> /com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
>> at randomizedtesting.runner@2.8.1
>> /com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
>> at randomizedtesting.runner@2.8.1
>> /com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
>> at randomizedtesting.runner@2.8.1
>> /com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
>> at randomizedtesting.runner@2.8.1
>> /com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)

Re: [Vote] Bump the Lucene main branch to Java 21

2024-02-23 Thread Benjamin Trent
+1

On Fri, Feb 23, 2024 at 8:54 AM Adrien Grand  wrote:

> +1
>
> On Fri, Feb 23, 2024 at 12:54 PM Uwe Schindler  wrote:
> >
> > Here is my +1
> >
> > Uwe
> >
> > Am 23.02.2024 um 12:24 schrieb Chris Hegarty:
> > > Hi,
> > >
> > > Since the discussion on bumping the Lucene main branch to Java 21 is
> winding down, let's hold a vote on this important change.
> > >
> > > Once bumped, the next major release of Lucene (whenever that will be)
> will require a version of Java greater than or equal to Java 21.
> > >
> > > The vote will be open for at least 72 hours (and allow some additional
> time for the weekend) i.e. until 2024-02-28 12:00 UTC.
> > >
> > > [ ] +1  approve
> > > [ ] +0  no opinion
> > > [ ] -1  disapprove (and reason why)
> > >
> > > Here is my +1
> > >
> > > -Chris.
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: dev-h...@lucene.apache.org
> > >
> > --
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > https://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
>
> --
> Adrien
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: [VOTE] Release Lucene 9.9.2 RC1

2024-01-25 Thread Benjamin Trent
+1

SUCCESS! [0:47:01.998711]

And I verified via a local monster test that this bug is fixed:
https://github.com/apache/lucene/pull/13027

I need to contribute back the monster integration test to fully exercise
that code path.

Thanks Chris!

On Thu, Jan 25, 2024 at 11:01 AM Michael McCandless <
luc...@mikemccandless.com> wrote:

> +1
>
> SUCCESS! [0:18:29.298410]
>
> Thank you Chris!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Jan 25, 2024 at 6:57 AM Chris Hegarty
>  wrote:
>
>> Please vote for release candidate 1 for Lucene 9.9.2
>>
>> The artifacts can be downloaded from:
>>
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.9.2-RC1-rev-a2939784c4ca60bc28bf488b5479c02fc2e5e22c
>>
>> You can run the smoke tester directly with this command:
>>
>> python3 -u dev-tools/scripts/smokeTestRelease.py \
>>
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.9.2-RC1-rev-a2939784c4ca60bc28bf488b5479c02fc2e5e22c
>>
>> The vote will be open for 96 hours ( allowing some additional time for
>> weekend span) i.e. until 2024-01-29 12:00 UTC.
>>
>> [ ] +1  approve
>> [ ] +0  no opinion
>> [ ] -1  disapprove (and reason why)
>>
>> Here is my +1
>>
>> Draft release notes can be found at
>> https://cwiki.apache.org/confluence/display/LUCENE/ReleaseNote9_9_2
>>
>> -Chris.
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>


Re: [VOTE] Release Lucene 9.9.1 RC1

2023-12-13 Thread Benjamin Trent
SUCCESS! [1:06:02.232333]

+ 1!

On Wed, Dec 13, 2023 at 3:26 PM Greg Miller  wrote:

> SUCCESS! [2:27:01.875939]
>
> +1
>
> Thanks!
> -Greg
>
> On Wed, Dec 13, 2023 at 3:58 AM Chris Hegarty
>  wrote:
>
>> And (short) release note:
>>
>>   https://cwiki.apache.org/confluence/display/LUCENE/ReleaseNote9_9_1
>>
>> -Chris.
>>
>> > On 13 Dec 2023, at 11:55, Chris Hegarty 
>> wrote:
>> >
>> > Hi,
>> >
>> > Please vote for release candidate 1 for Lucene 9.9.1
>> >
>> > The artifacts can be downloaded from:
>> >
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.9.1-RC1-rev-eee32cbf5e072a8c9d459c349549094230038308
>> >
>> > You can run the smoke tester directly with this command:
>> >
>> > python3 -u dev-tools/scripts/smokeTestRelease.py \
>> >
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.9.1-RC1-rev-eee32cbf5e072a8c9d459c349549094230038308
>> >
>> > The vote will be open for at least 72 hours i.e. until 2023-12-16 12:00
>> UTC.
>> >
>> > [ ] +1  approve
>> > [ ] +0  no opinion
>> > [ ] -1  disapprove (and reason why)
>> >
>> > Here is my +1
>> >
>> > -Chris.
>> >
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>


Re: [VOTE] Release Lucene 9.9.0 RC2

2023-11-30 Thread Benjamin Trent
SUCCESS! [0:44:05.132154]

+1

On Thu, Nov 30, 2023 at 1:09 PM Chris Hegarty
 wrote:

> Please vote for release candidate 2 for Lucene 9.9.0
>
>
> The artifacts can be downloaded from:
>
>
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.9.0-RC2-rev-06070c0dceba07f0d33104192d9ac98ca16fc500
>
>
> You can run the smoke tester directly with this command:
>
>
> python3 -u dev-tools/scripts/smokeTestRelease.py \
>
>
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.9.0-RC2-rev-06070c0dceba07f0d33104192d9ac98ca16fc500
>
>
> The vote will be open for at least 72 hours, and given the weekend in
> between, let’s keep it open until 2023-12-04 12:00 UTC.
>
> [ ] +1  approve
>
> [ ] +0  no opinion
>
> [ ] -1  disapprove (and reason why)
>
>
> Here is my +1
>
>
> -Chris.
>
>


Re: [VOTE] Release Lucene 9.9.0 RC1

2023-11-30 Thread Benjamin Trent
SUCCESS! [0:47:11.013106]

+1

On Thu, Nov 30, 2023 at 7:16 AM Ignacio Vera  wrote:

> SUCCESS! [0:52:59.891964]
>
>
> +1
>
> On Thu, Nov 30, 2023 at 12:42 PM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> +1 to release.
>>
>> I hit a corner-case test failure and opened a PR to fix it:
>> https://github.com/apache/lucene/pull/12859
>>
>> I don't think this should block the release? -- it looks exotic.
>>
>> Thanks Chris!
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, Nov 30, 2023 at 1:16 AM Patrick Zhai  wrote:
>>
>>> SUCCESS! [1:03:54.880200]
>>>
>>> +1. Thank you Chris!
>>>
>>> On Wed, Nov 29, 2023 at 8:45 PM Nhat Nguyen
>>>  wrote:
>>>
 SUCCESS! [1:11:30.037919]

 +1. Thanks, Chris!

 On Wed, Nov 29, 2023 at 8:53 AM Chris Hegarty
  wrote:

> Hi,
>
>
> Please vote for release candidate 1 for Lucene 9.9.0
>
>
> The artifacts can be downloaded from:
>
>
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.9.0-RC1-rev-92a5e5b02e0e083126c4122f2b7a02426c21a037
>
>
> You can run the smoke tester directly with this command:
>
>
> python3 -u dev-tools/scripts/smokeTestRelease.py \
>
>
> https://dist.apache.org/repos/dist/dev/lucene/lucene-9.9.0-RC1-rev-92a5e5b02e0e083126c4122f2b7a02426c21a037
>
>
> The vote will be open for at least 72 hours, and given the weekend in
> between, let’s it open until 2023-12-04 12:00 UTC.
>
>
> [ ] +1  approve
>
> [ ] +0  no opinion
>
> [ ] -1  disapprove (and reason why)
>
>
> Here is my +1
>
>
> Draft release highlights can be viewed here (comments and feedback
> welcome):
> https://cwiki.apache.org/confluence/display/LUCENE/ReleaseNote9_9_0
>
> -Chris.
>



Re: Lucene 9.9.0 Release

2023-11-21 Thread Benjamin Trent
+1 9.9 will be a stellar release!

Thank you Chris!

On Tue, Nov 21, 2023 at 7:31 AM Adrien Grand  wrote:

> +1 9.9 has plenty of great changes indeed! Thanks for volunteering as a
> RM, Chris.
>
> It would be good to try and fix the PKLookup regression that was
> introduced since 9.8:
> http://people.apache.org/~mikemccand/lucenebench/PKLookup.html. Is it
> just about getting #12699 
> merged?
>
> Separately, I have a PR that does a small change to the file format of
> postings and skip lists. It's certainly not a blocker for 9.9, but it would
> be convenient to get it into 9.9 since we already changed file formats for
> the switch from PFOR to FOR. Does someone have time to take a look? #12810
> 
>
> On Tue, Nov 21, 2023 at 11:16 AM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> +1
>>
>> Thank you for volunteering as RC Chris!
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Tue, Nov 21, 2023 at 4:52 AM Chris Hegarty
>>  wrote:
>>
>>> Hi,
>>>
>>> It's been a while since the 9.8.0 release and we’ve accumulated quite a
>>> few changes. I’d like to propose that we release 9.9.0.
>>>
>>> If there's no objections, I volunteer to be the release manager and will
>>> cut the feature branch a week from now, 12:00 28th Nov UTC.
>>>
>>> -Chris.
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>
>
> --
> Adrien
>


Re: Quantization for vector search

2023-11-04 Thread Benjamin Trent
Hey Michael,

In short, it's being worked on :).

Could you point to the LinkedIN post? Is Nils talking about the model
output quantized output or that their default output is easily compressible
because of how the embeddings are built?

I have done a bad job of linking back against that original issue the work
that is being done:

The initial implementation of adding int8 (really, its int7 because of
signed bytes...): https://github.com/apache/lucene/pull/12582

A significant refactor to make adding new quantized storage easier:
https://github.com/apache/lucene/pull/12729

Lucene already supports folks just giving it signed `byte[]` values. But
this only gets so far. The additional work should get Lucene further down
the road towards better lossy-compression for vectors.

Thanks!

Ben

On Sat, Nov 4, 2023 at 4:07 AM Michael Wechner 
wrote:

> Hi
>
> If I understand correctly some devs are working on introducing
> quantization for vector search or at least considering it
>
> https://github.com/apache/lucene/issues/12497
>
> Just being curious what is the status on this resp. is somebody working on
> this actively?
>
>
> It came to my mind, because Cohere recently made their new embedding model
> "Embed v3" available
>
> https://txt.cohere.com/introducing-embed-v3/
>
> whereas IIUC, Cohere intends to also provide embeddings optimized for
> compression soon.
>
> Nils Reimers recently wrote on LinkedIn:
>
> 
> "... what we see on the BioASQ dataset:
> 4x - 99.99% search quality
> 16x - 99.9% search quality
> 32x - 95% search quality
> 64x - 85% search quality
> But it requires that the respective vector DB supports these modes, what
> we currently work on with partners."
> 
>
> This might be interesting for Lucene as well, resp. I am not sure whether
> somebody at Lucene is already working on something like this.
>
> Thanks
>
> Michael
>


Re: Squash vs merge of PRs

2023-11-04 Thread Benjamin Trent
TL;DR, forcing non-committers to squash things is a good idea. Enforcing
through some measure for committers is a bad idea.

Since this thread is now in Robert's spam, I am guessing it won't have any
impact :). I do not think Robert is actively trying hurt the project in any
way. It seems to me that he doesn't think a clean git history is worth the
effort.

Having a clean git history makes things easier for everyone. Comparing
histories between branches with git-bisect to find bugs is just one
example. Another is simply reading commits to see when
features/bug fixes/etc. were added.

I do NOT think we should add procedures or branch protections to actively
enforce this.

Small personal sacrifices (like dealing with commit conflicts) are
necessary for a community. Being part of a community is about buying into
what the community is about and working towards a common goal. Many times
we do things we don't agree with, or make things slightly more difficult
for us, for the community as a whole. This thing being OSS shows that we
all buy into its importance and are willing to put work into the project.

Having a cultural default of "make things nice for others" is good.
Enforcing this ideology on others is antithesis to its definition.



On Sat, Nov 4, 2023 at 9:02 AM Robert Muir  wrote:

> This isn't a community issue, it is me avoiding useless unnecessary
> merge conflicts. Word "community" is invoked here to try to make it
> out, like you can hold a vote about what git commands i should type on
> my computer? You know that isn't gonna work. have some humility.
>
> thread moved to spam.
>
> On Sat, Nov 4, 2023 at 8:36 AM Mike Drob  wrote:
> >
> > We all agree on using Java though, and using a specific version, and
> even the style output from gradle tidy. Is that nanny state or community
> consensus?
> >
> > On Sat, Nov 4, 2023 at 7:29 AM Robert Muir  wrote:
> >>
> >> example of a nanny state IMO, trying to dictate what git commands to
> >> use, or what editor to use. Maybe this works for you in your corporate
> >> hellholes, but I think some folks have a bit of a power issue, are
> >> accustomed to dictacting this stuff to their employees and so on, but
> >> this is open-source. I don't report to you, i dont use the editor you
> >> tell me, or the git commands you tell me.
> >>
> >> On Sat, Nov 4, 2023 at 8:21 AM Uwe Schindler  wrote:
> >> >
> >> > Hi,
> >> >
> >> > I just wanted to give your attention to the following discussion:
> >> > https://github.com/apache/lucene/pull/12737#issuecomment-1793426911
> >> >
> >> >  From my knowledge the Lucene (and Solr) community decided a while
> back
> >> > to disable merging and only allow squashig of PRs. Robert always did
> >> > this, but because of a one-time problem with two branches he was
> working
> >> > on in parallel, he suddenly changed his mind and did merges on his
> own,
> >> > not sqashing the branch and pushing to ASF Git.
> >> >
> >> > I am also not a fan of removing all history, but especially for heavy
> >> > committing branches like the given PR, I think we should invite our
> >> > committers to also adhere to community standards everyone else
> >> > practices. I would agree with merging those branches if all commit
> >> > messages in the branch would be well-formed with issue ID or PR
> number,
> >> > but in the above case you get a history of random commits which is no
> >> > longer linear and not easy readable.
> >> >
> >> > What do others think?
> >> >
> >> > Uwe
> >> >
> >> > --
> >> > Uwe Schindler
> >> > Achterdiek 19, D-28357 Bremen
> >> > https://www.thetaphi.de
> >> > eMail: u...@thetaphi.de
> >> >
> >> >
> >> > -
> >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >> >
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Weird HNSW merge performance result

2023-10-11 Thread Benjamin Trent
Heya Patrick,

What version of Lucene Util are you using? There was a bug where
`forceMerge` was not actually using your configured maxConn & beamWidth.
See: https://github.com/mikemccand/luceneutil/pull/232

Do you have that commit and rebuilt the KnnGraphTester?

On Wed, Oct 11, 2023 at 10:10 AM Patrick Zhai  wrote:

> Hi Adrien,
> I'm using the default CMS, but I doubt whether the merge will be triggered
> at all in the background. Since no merge policy is changed the default TMP
> will likely only merge the segments after they reach 10 I believe? But the
> index is about 300M and the buffer size is around 50M so I don't think we
> will have enough segments to trigger the merge when I'm building the index?
>
> On Wed, Oct 11, 2023, 02:47 Adrien Grand  wrote:
>
>> Regarding building time, did you configure a SerialMergeScheduler?
>> Otherwise merges run in separate threads, which would explain the speedup
>> as adding vectors to the graph gets more and more expensive as the size of
>> the graph increases.
>>
>> Le mer. 11 oct. 2023, 05:07, Patrick Zhai  a écrit :
>>
>>> Hi folks,
>>> I was running the HNSW benchmark today and found some weird results.
>>> Want to share it here and see whether people have any ideas.
>>>
>>> The set up is:
>>> the 384 dimension vector that's available in luceneutil, 100k documents.
>>> And lucene main branch.
>>> max_conn=64, fanout=0, beam_width=250
>>>
>>> I first tried with the default setting where we use a 1994MB writer
>>> buffer, so with 100k documents, there will be no merge happening and I will
>>> have 1 segment at the end.
>>> This gives me 0.755 recall and 101113ms index building time.
>>>
>>> Then I tried with 50MB writer buffer and then forcemerge at the last,
>>> and with 100k documents, I'll get several segments (the final index is
>>> around 300MB so I guess 5 or 6) before merge, and then merge them into 1 at
>>> last.
>>> This gives me 0.692 recall but it took only 81562ms (including 34394ms
>>> doing the merge) to index.
>>> I have also tried disabling the initialize from graph feature (such that
>>> when we merge we always rebuild the whole graph), or change the random
>>> seed, but still get the similar result.
>>>
>>> I'm wondering:
>>> 1. Why recall drops that much in the later setup?
>>> 2. Why index time is way better? I think we still need to rebuild the
>>> whole graph, or maybe it's just because we're using more off-heap memory
>>> (and less heap) when merge (do we?)?
>>>
>>> Best
>>> Patrick
>>>
>>


Re: Disconnectedness in HNSW graphs in Lucene

2023-08-24 Thread Benjamin Trent
> Can I create a github issue for this and continue updating there?

I think that would be great. If y'all are suffering from this and
discovered an issue, others are unknowingly having the same issue. Having
tools to discover it and fix it will make Lucene better. Hopefully we find
a bug and can address it!

One thing I know we do is prune connections to ensure we have at most
`maxConn` (`m` in the paper) connections. Even after diverse addition of
backlinks and neighbors. The paper hints at a "keep pruned connections"
option, but I don't think any HNSW implementation actually has that option.


Another experiment would be increasing y'alls `maxConn` (default of `16`) &
`beamWidth` (default of `100`). Since this is a production system, you may
want a fix sooner rather than later. Increasing these might alleviate some
of the disconnectedness, but it would increase indexing cost.

I realize such experiments might be expensive in light of y'alls search &
indexing load.

Thanks!

Ben Trent

On Thu, Aug 24, 2023 at 4:03 AM Nitiraj Rathore  wrote:

> Thanks Benjamin for the reply and confirming that connected can be issue.(
> btw I am same Nitiraj, just using my apache.org email id from now on to
> communicate).
>
> I will do some more experiment to reproduce the issue and see the
> connectedness across the graph and not just with the Entry point. But since
> these are our production indexes which receive fast updates and re-merges
> happens frequently I am not sure if it will be easy to reproduce. In that
> case I will just work with some particular indexes that show this behaviour.
>
> I will also try to see if `hnswlib` shows such behaviour and if not what
> measures have been taken to ensure connectedness. Although, the paper does
> talk that heuristic helps in keeping the clustered graph connected (excerpt
> below), but it seems some improvement might be required. I will check the
> implementations in the `addDiverseNeighbors()`, `findWorstNonDiverse()` and
> `selectAndLinkDiverse()` method of lucene code and come up with something.
> At the same time may be just not removing some connections from the graph
> that can result in disconnectedness can help, but such a check can be very
> expensive.
>
> Can I create a github issue for this and continue updating there?
>
> -- Excerpt from Paper : https://arxiv.org/pdf/1603.09320.pdf
> ```
> The relative neighborhood graph allows easily keeping the global connected
> component, even in case of highly clustered data (see Fig. 2 for
> illustration). Note that the heuristic creates extra edges compared to the
> exact relative neighborhood graphs, allowing controlling the number of the
> connections which is important for search performance.
> ```
>
> On 2023/08/23 16:07:55 Benjamin Trent wrote:
> > Nitiraj,
> >
> > Good experimentation! Connectedness within layers is indeed important.
> The
> > algorithm itself should ensure connectedness of disjoint NSWs as it
> > mutually connects nodes (selected over diversity).
> >
> > However, if the data is extremely clustered, this can cause connectedness
> > to drop (few densely packed clusters may not connect to other densely
> > packed clusters).
> >
> > For your extreme examples, is the data densely clustered?
> >
> > What would you suggest as an improvement in Lucene regarding the
> algorithm
> > implementation?
> >
> > An interesting experiment would be to see if `hnswlib` has the same
> > connected issues if it indexes the same vectors in the same order.
> >
> > Thanks!
> >
> > Ben Trent
> >
> >
> >
> > On Wed, Aug 23, 2023 at 5:07 AM Nitiraj Singh Rathore <
> > nitiraj.rath...@gmail.com> wrote:
> >
> > > Hi Lucene developers,
> > >
> > > I work for Amazon Retail Product search and we are using Lucene KNN for
> > > semantic search of products. We index product embeddings (vectors) into
> > > lucene (hnsw graph) and search them by generating query embedding at
> > > runtime. The product embeddings also receive regular updates and the
> index
> > > geometry keeps changing because of merges.
> > > We recently noticed that the hnsw graphs generated are not always
> strongly
> > > connected and in worst case scenario some products may be
> undiscoverable.
> > > Connectedness of Hierarchical graph can be complicated, so below I am
> > > mentioning my experiment details.
> > >
> > > - Experiment:
> > > I took the Lucene indexes from our production servers and for each
> segment
> > > (hnsw graph) I did following test.
> > > At each level graph I took the same entry point, the

Re: Disconnectedness in HNSW graphs in Lucene

2023-08-23 Thread Benjamin Trent
Nitiraj,

Good experimentation! Connectedness within layers is indeed important. The
algorithm itself should ensure connectedness of disjoint NSWs as it
mutually connects nodes (selected over diversity).

However, if the data is extremely clustered, this can cause connectedness
to drop (few densely packed clusters may not connect to other densely
packed clusters).

For your extreme examples, is the data densely clustered?

What would you suggest as an improvement in Lucene regarding the algorithm
implementation?

An interesting experiment would be to see if `hnswlib` has the same
connected issues if it indexes the same vectors in the same order.

Thanks!

Ben Trent



On Wed, Aug 23, 2023 at 5:07 AM Nitiraj Singh Rathore <
nitiraj.rath...@gmail.com> wrote:

> Hi Lucene developers,
>
> I work for Amazon Retail Product search and we are using Lucene KNN for
> semantic search of products. We index product embeddings (vectors) into
> lucene (hnsw graph) and search them by generating query embedding at
> runtime. The product embeddings also receive regular updates and the index
> geometry keeps changing because of merges.
> We recently noticed that the hnsw graphs generated are not always strongly
> connected and in worst case scenario some products may be undiscoverable.
> Connectedness of Hierarchical graph can be complicated, so below I am
> mentioning my experiment details.
>
> - Experiment:
> I took the Lucene indexes from our production servers and for each segment
> (hnsw graph) I did following test.
> At each level graph I took the same entry point, the entry point of HNSW
> graph, checked how many nodes are reachable from this entrypoint. Note that
> connectedness at each level was checked independently of other levels.
> Sample code attached. My observations are as below.
>
> - Observation :
> 1. Of all the graphs across all the segments, across 100s of indexes
> that I considered, one graph for each "level" of HNSW, almost 18% of the
> graphs had some disconnectedness.
> 2. Disconnectedness was observed at all the levels of HNSW graphs. We have
> at most 3 levels in HNSW graphs.
> 3. percentage disconnectedness ranged from small fractions 0.000386% (1
> disconnected out of 259342)  to 3.7% (eg. 87 disconnected out of 2308).
> In some extreme case the entry-point in zeroth level graph was
> disconnected from rest of the graph making the %age disconnected as high as 
> 99.9%
> (65 reachable nodes from EP out of 252275). But this does not necessarily
> mean that the 99.9% of nodes were not discoverable, it just means that if
> unluckily we end up on EP in the 0th level graph for a query, there can at
> max be 65 nodes that can be reached. But had we diverted our path from EP
> to some other node in the upper level graphs then may be more nodes be
> discoverable via that node.
>
> - What I Not Checked :
> What I have not checked till now is the connectedness for the whole HNSW
> graph including edges of all the levels.
> Also, I have not checked the number of disconnected components in a graph.
> I have just checked the number of connected nodes to the entry-point.
>
> But irrespective of that, I think graphs should be strongly connected at
> each level and disconnectedness if at all should be very very rare.
>
> Thanks Kaival Parikh for discovering the issue in the first place and the
> script for checking connectedness.
>
> What do others think about this?
>
> public class CheckHNSWConnectedness {
> private static int getReachableNodes(HnswGraph graph, int level) throws 
> IOException {
> Set visited = new HashSet<>();
> Stack candidates = new Stack<>();
> candidates.push(graph.entryNode());
>
> while (!candidates.isEmpty()) {
> int node = candidates.pop();
>
> if (visited.contains(node)) {
> continue;
> }
>
> visited.add(node);
> graph.seek(level, node);
>
> int friendOrd;
> while ((friendOrd = graph.nextNeighbor()) != NO_MORE_DOCS) {
> candidates.push(friendOrd);
> }
> }
> return visited.size();
> }
>
> public static void checkConnected(String index, String hnswField) throws 
> IOException, NoSuchFieldException, IllegalAccessException {
> try (FSDirectory dir = FSDirectory.open(Paths.get(index));
>  IndexReader indexReader = DirectoryReader.open(dir)) {
>  for (LeafReaderContext ctx : indexReader.leaves() ) {
>  KnnVectorsReader reader = 
> ((PerFieldKnnVectorsFormat.FieldsReader) ((SegmentReader) 
> ctx.reader()).getVectorReader()).getFieldReader(hnswField);
>
>  if (reader != null) {
>  HnswGraph graph = ((Lucene95HnswVectorsReader) 
> reader).getGraph(hnswField);
>  for (int l = 0; l < graph.numLevels(); l++){
>  int reachableNodes = getReachableNodes(graph, l);
> // 

Re: [VOTE] Release PyLucene 9.7.0-rc1

2023-07-07 Thread Benjamin Trent
+1

I tested getting ann-benchmarks updated and it worked just fine. Was also
able to build locally and run some tests (non-exhaustive) on my M1 macbook.

Hope everyone else has the same success!

On Thu, Jul 6, 2023 at 3:47 AM Andi Vajda  wrote:

>
> The PyLucene 9.7.0 (rc1) release tracking the recent release of
> Apache Lucene 9.7.0 is ready.
>
> A release candidate is available from:
> https://dist.apache.org/repos/dist/dev/lucene/pylucene/9.7.0-rc1/
>
> PyLucene 9.7.0 is built with JCC 3.13, included in these release artifacts.
>
> JCC 3.13 supports Python 3.3 up to Python 3.11.
> PyLucene may also be built with Python 2 but this configuration is no
> longer
> tested.
>
> Please vote to release these artifacts as PyLucene 9.7.0.
> Anyone interested in this release can and should vote !
>
> Thanks !
>
> Andi..
>
> ps: the KEYS file for PyLucene release signing is at:
> https://dist.apache.org/repos/dist/release/lucene/pylucene/KEYS
> https://dist.apache.org/repos/dist/dev/lucene/pylucene/KEYS
>
> pps: here is my +1
>


New release for PyLucene?

2023-07-05 Thread Benjamin Trent
Lucene 9.7 was just released and contains multiple desirable improvements:
https://lucene.apache.org/core/9_7_0/changes/Changes.html

Could we kick off a new release for PyLucene?

Thanks!

Ben


Re: [VOTE] Dimension Limit for KNN Vectors

2023-05-16 Thread Benjamin Trent
My vote is for option 3. Prevents Lucene from having the limit increased.
Allows others who implement a different codec to set a limit of their
choosing.

Though I don't know the historical reasons for putting specific
configuration items at the codec level. This limit is performance related
and various codec implementations would have different performance concerns.


On Tue, May 16, 2023, 8:02 AM Michael Wechner 
wrote:

> +1 to Gus' reply.
>
> I think that Robert's veto or anyone else's veto is fair enough, but I
> also think that anyone who is vetoing should be very clear about the
> objectives / goals to be achieved, in order to get a +1.
>
> If no clear objectives / goals can be defined and agreed on, then the
> whole thing becomes arbitrary.
>
> Therefore I would also be interested to know the objectives / goals to be
> met that there will be a +1 re this vote?
>
> Thanks
>
> Michael
>
>
>
> Am 16.05.23 um 13:45 schrieb Gus Heck:
>
> Robert,
>
> Can you explain in clear technical terms the standard that must be met for
> performance? A benchmark that must run in X time on Y hardware for example
> (and why that test is suitable)? Or some other reproducible criteria? So
> far I've heard you give an *opinion* that it's unusable, but that's not a
> technical criteria, others may have a different concept of what is usable
> to them.
>
> Forgive me if I misunderstand, but the essence of your argument has seemed
> to be
>
> "Performance isn't good enough, therefore we should force anyone who wants
> to experiment with something bigger to fork the code base to do it"
>
> Thus, it is necessary to have a clear unambiguous standard that anyone can
> verify for "good enough". A clear standard would also focus efforts at
> improvement.
>
> Where are the goal posts?
>
> FWIW I'm +1 on any of 2-4 since I believe the existence of a hard limit is
> fundamentally counterproductive in an open source setting, as it will lead
> to *fewer people* pushing the limits. Extremely few people are going to
> get into the nitty-gritty of optimizing things unless they are staring at
> code that they can prove does something interesting, but doesn't run fast
> enough for their purposes. If people hit a hard limit, more of them give up
> and never develop the code that will motivate them to look for
> optimizations.
>
> -Gus
>
> On Tue, May 16, 2023 at 6:04 AM Robert Muir  wrote:
>
>> i still feel -1 (veto) on increasing this limit. sending more emails does
>> not change the technical facts or make the veto go away.
>>
>> On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti <
>> a.benede...@sease.io> wrote:
>>
>>> Hi all,
>>> we have finalized all the options proposed by the community and we are
>>> ready to vote for the preferred one and then proceed with the
>>> implementation.
>>>
>>> *Option 1*
>>> Keep it as it is (dimension limit hardcoded to 1024)
>>> *Motivation*:
>>> We are close to improving on many fronts. Given the criticality of
>>> Lucene in computing infrastructure and the concerns raised by one of the
>>> most active stewards of the project, I think we should keep working toward
>>> improving the feature as is and move to up the limit after we can
>>> demonstrate improvement unambiguously.
>>>
>>> *Option 2*
>>> make the limit configurable, for example through a system property
>>> *Motivation*:
>>> The system administrator can enforce a limit its users need to respect
>>> that it's in line with whatever the admin decided to be acceptable for
>>> them.
>>> The default can stay the current one.
>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>> and any sort of plugin development
>>>
>>> *Option 3*
>>> Move the max dimension limit lower level to a HNSW specific
>>> implementation. Once there, this limit would not bind any other potential
>>> vector engine alternative/evolution.
>>> *Motivation:* There seem to be contradictory performance
>>> interpretations about the current HNSW implementation. Some consider its
>>> performance ok, some not, and it depends on the target data set and use
>>> case. Increasing the max dimension limit where it is currently (in top
>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>> other use-cases) to be based on a lower limit.
>>>
>>> *Option 4*
>>> Make it configurable and move it to an appropriate place.
>>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>>> 1024) should be enough.
>>> *Motivation*:
>>> Both are good and not mutually exclusive and could happen in any order.
>>> Someone suggested to perfect what the _default_ limit should be, but
>>> I've not seen an argument _against_ configurability.  Especially in this
>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>
>>> I'll keep this [VOTE] open for a week and then proceed to the
>>> implementation.
>>> --
>>> *Alessandro Benedetti*
>>> Director @ Sease Ltd.
>>> *Apache Lucene/Solr Committer*
>>> *Apache 

Re: Lucene 9.6 release

2023-04-19 Thread Benjamin Trent
+1 !

You rock Alan!

On Wed, Apr 19, 2023, 9:54 AM Ignacio Vera  wrote:

> +1
>
> Thanks Alan!
>
> On Wed, Apr 19, 2023 at 1:27 PM Alan Woodward 
> wrote:
>
>> Hi all,
>>
>> It’s been a while since our last release, and we have a number of nice
>> improvements and optimisations sitting in the 9x branch.  I propose that we
>> start the process for a 9.6 release, and I will volunteer to be the release
>> manager.  If there are no objections, I will cut a release branch one week
>> today, April 26th.
>>
>> - Alan
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>


Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-07 Thread Benjamin Trent
>From all I have seen when hooking up JFR when indexing a medium number of
vectors(1M +), almost all the time is spent simply comparing the vectors
(e.g. dot_product).

This indicates to me that another algorithm won't really help index build
time tremendously. Unless others do dramatically fewer vector comparisons
(from what I can tell, this is at least not true for DiskAnn, unless some
fancy footwork is done when building the PQ codebook).

I would also say comparing vector index build time to indexing terms are
apples and oranges. Yeah, they both live in Lucene, but the number of
calculations required (no matter the data structure used), will be
magnitudes greater.


On Fri, Apr 7, 2023, 4:59 PM Robert Muir  wrote:

> On Fri, Apr 7, 2023 at 7:47 AM Michael Sokolov  wrote:
> >
> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
> >
> > Robert, since you're the only on-the-record veto here, does this
> > change your thinking at all, or if not could you share some test
> > results that didn't go the way you expected? Maybe we can find some
> > mitigation if we focus on a specific issue.
> >
>
> My scale concerns are both space and time. What does the execution
> time look like if you don't set insanely large IW rambuffer? The
> default is 16MB. Just concerned we're shoving some problems under the
> rug :)
>
> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
> to index 4M documents with these 2k vectors. Whereas you'd measure
> this in seconds with typical lucene indexing, its nothing.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Welcome Ben Trent as Lucene committer

2023-01-27 Thread Benjamin Trent
Hey y'all!

This is truly an honor!

Well, I am Ben Trent and have been writing code for over a decade now.
Which I know is not a very long time compared to most folks. I originally
wanted to do research and work in pure mathematics (my baccalaureate), but
quickly realized I am nowhere near smart enough to make money at that. So,
like many folks, I switched to computing and haven't looked back.

In my spare time (when not wrangling one of my children), I enjoy movies
(especially old kung fu, anything Golden Harvest or Shaw Brothers), good
beer, reading, playing guitar, and hiking.

Thank you all for the warm welcome! See you online!

Ben

On Fri, Jan 27, 2023 at 10:26 AM Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> wrote:

> Welcome and congratulations, Ben!
>
> On Fri, Jan 27, 2023 at 8:48 PM Adrien Grand  wrote:
> >
> > I'm pleased to announce that Ben Trent has accepted the PMC's
> > invitation to become a committer.
> >
> > Ben, the tradition is that new committers introduce themselves with a
> > brief bio.
> >
> > Congratulations and welcome!
> >
> > --
> > Adrien
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>


Adding new extension point

2022-11-10 Thread Benjamin Trent
Hey y'all,

I am new to this type of workflow, I am used to github and Pull-requests.

What is the process for adding a new extension point? With a recent foray
into getting Lucene KNN into the ann-benchmarks repository, I found the
need to adjust the current codec (Lucene94Codec) to allow us to adjust some
values.

I think the extension will only need to allow us to optionally override the
`getFormat...ForField` parameters.

Thanks!

Ben Trent

p.s. In the meantime, I am manually copying a file before calling `make` in
PyLucene.


Re: [VOTE] Release PyLucene 9.4.1-rc3

2022-11-01 Thread Benjamin Trent
+1 from me

On Tue, Nov 1, 2022 at 4:37 PM Andi Vajda  wrote:

>
> The PyLucene 9.4.1 (rc3) release tracking the recent release of
> Apache Lucene 9.4.1 is ready.
>
> A release candidate is available from:
> https://dist.apache.org/repos/dist/dev/lucene/pylucene/9.4.1-rc3/
>
> PyLucene 9.4.1 is built with JCC 3.13, included in these release artifacts.
>
> JCC 3.13 supports Python 3.3 up to Python 3.11.
> PyLucene may also be built with Python 2, although Python 2 support is now
> untested.
>
> Please vote to release these artifacts as PyLucene 9.4.1.
> Anyone interested in this release can and should vote !
>
> Thanks !
>
> Andi..
>
> ps: the KEYS file for PyLucene release signing is at:
> https://dist.apache.org/repos/dist/release/lucene/pylucene/KEYS
> https://dist.apache.org/repos/dist/dev/lucene/pylucene/KEYS
>
> pps: here is my +1
>


Re: [VOTE] Release PyLucene 9.4.1-rc2

2022-11-01 Thread Benjamin Trent
Andi,

I pulled down the rc-2 and tested in Docker on Ubuntu 18.04.6 LTS (Bionic
Beaver), with Python 3.6.9, I get the following error attempting to build
JCC

jcc3/sources/functions.cpp: In function 'void installType(PyTypeObject**,
PyType_Def*, PyObject*, char*, int)':
jcc3/sources/functions.cpp:1742:13: error: 'Py_SET_TYPE' was not declared
in this scope
 Py_SET_TYPE(*type, PY_TYPE(FinalizerClass));
 ^~~
jcc3/sources/functions.cpp:1742:13: note: suggested alternative:
'__S64_TYPE'
 Py_SET_TYPE(*type, PY_TYPE(FinalizerClass));
 ^~~
 __S64_TYPE
error: command 'aarch64-linux-gnu-gcc' failed with exit status 1

I am thinking https://issues.apache.org/jira/browse/PYLUCENE-66 is related.

Thanks!

On Tue, Nov 1, 2022 at 2:13 PM Andi Vajda  wrote:

>
> The PyLucene 9.4.1 (rc2) release tracking the recent release of
> Apache Lucene 9.4.1 is ready.
>
> A release candidate is available from:
> https://dist.apache.org/repos/dist/dev/lucene/pylucene/9.4.1-rc2/
>
> PyLucene 9.4.1 is built with JCC 3.13, included in these release artifacts.
>
> JCC 3.13 supports Python 3.3 up to Python 3.11.
> PyLucene may also be built with Python 2, although Python 2 support is now
> untested.
>
> Please vote to release these artifacts as PyLucene 9.4.1.
> Anyone interested in this release can and should vote !
>
> Thanks !
>
> Andi..
>
> ps: the KEYS file for PyLucene release signing is at:
> https://dist.apache.org/repos/dist/release/lucene/pylucene/KEYS
> https://dist.apache.org/repos/dist/dev/lucene/pylucene/KEYS
>
> pps: here is my +1
>


Re: [VOTE] Release PyLucene 9.4.1

2022-11-01 Thread Benjamin Trent
+1

On Mon, Oct 31, 2022 at 6:50 PM Jeff Breidenbach 
wrote:

> +1
>
> On Mon, Oct 31, 2022, 3:50 PM Andi Vajda  wrote:
>
> >
> > The PyLucene 9.4.1 (rc1) release tracking the recent release of
> > Apache Lucene 9.4.1 is ready.
> >
> > A release candidate is available from:
> > https://dist.apache.org/repos/dist/dev/lucene/pylucene/9.4.1-rc1/
> >
> > PyLucene 9.4.1 is built with JCC 3.12, included in these release
> artifacts.
> >
> > JCC 3.12 supports Python 3.3 up to Python 3.9 (in addition to Python
> 2.3+).
> > PyLucene may be built with Python 2 or Python 3, although Python 2
> support
> > is
> > now untested.
> >
> > Please vote to release these artifacts as PyLucene 9.4.1.
> > Anyone interested in this release can and should vote !
> >
> > Thanks !
> >
> > Andi..
> >
> > ps: the KEYS file for PyLucene release signing is at:
> > https://dist.apache.org/repos/dist/release/lucene/pylucene/KEYS
> > https://dist.apache.org/repos/dist/dev/lucene/pylucene/KEYS
> >
> > pps: here is my +1
> >
>


Re: New release of PyLucene?

2022-10-31 Thread Benjamin Trent
Thank you very much Andi!

Once I wrapped my head around the basics, I found PyLucene
excellent, intuitive, and just like writing in Lucene, but with Python!

It would be marvelous if we could automate the process or document the
steps for building newer versions.

Thanks!

Ben

On Mon, Oct 31, 2022 at 11:50 AM Andi Vajda  wrote:

>
> > On Oct 31, 2022, at 07:55, Benjamin Trent  wrote:
> >
> > Lucene 9.4.1 was recently released. The last version of PyLucene
> released
> > was 9.1.0. There have been some improvements to Lucene since then. My
> > particular concern is around KNN search.
> >
> > What is the process to start a new release of PyLucene?
>
> Showing interest, which you just did !
> Let me get one going...
>
> Andi..
>
> >
> >
> > Thank you!
> >
> > Ben
>


New release of PyLucene?

2022-10-31 Thread Benjamin Trent
Lucene 9.4.1 was recently released. The last version of PyLucene released
was 9.1.0. There have been some improvements to Lucene since then. My
particular concern is around KNN search.

What is the process to start a new release of PyLucene?


Thank you!

Ben