PackedInts functionalities

2023-10-16 Thread Dongyu Xu
Hi devs,

As I was working on https://github.com/apache/lucene/issues/12513 I needed to 
compress positive integers which are used to locate postings etc.

To put it concretely, I will need to pack a few values per term contiguously 
and those values can have different bit-width. For example, consider that we 
need to encode docFreq and postingsStartOffset per term and docFreq takes 4 bit 
and the postingsStartOffset takes 6 bit. We expect to write the following for 
two terms.

```
Term1  | Term2

docFreq(4bit) | postingsStartOffset(6bit) | docFreq(4bit) | 
postingsStartOffset(6bit)

```

On the read path, I expect to locate the offest for a term first and followed 
by reading two values that have different bit-width.

In the spirit of not re-inventing necessarily, I tried to explore the existing 
PackedInts util classes and I believe there is no support for this at the 
moment. The biggest gap I found is that the existing classes expect to 
write/read values of same bit-width.

I'm writing to get feedback from yall to see if I missed anything.

Cheers,
Tony X


Re: Can we get rid of "Approve & Run" on GitHub PRs by new contributors (non-committers)?

2023-10-16 Thread Dawid Weiss
I filed a PR here -
https://github.com/apache/lucene/pull/12687

Dawid

On Mon, Oct 16, 2023 at 7:53 PM Dawid Weiss  wrote:

>
> It's actually as simple as adding:
>
> timeout-minutes: xyz
>
> to workflows.
>
>
> https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes
>
> I use it elsewhere for jobs on Windows because they tend to hang sometimes
> (for reasons unknown to me).
>
> Dawid
>
>
> On Mon, Oct 16, 2023 at 4:53 PM Robert Muir  wrote:
>
>> I think running the builds with a timeout is a good thing to do
>> anyway, for any CI build. I'm sure github actions has some fancy yaml
>> for that, but you can just do "timeout -k 1m 1h ./gradlew..." instead
>> of "./gradlew" too.
>>
>> On Mon, Oct 16, 2023 at 9:58 AM Michael McCandless
>>  wrote:
>> >
>> > When a non-committer (I think?) opens a PR, one of the committers must
>> notice it and click Approve & Run so the contributor can find out if
>> something broke in our automated tests/precommit/linting.
>> >
>> > This seems like a waste, and a friction in the worst possible place for
>> our community: new contributor onboarding experience.
>> >
>> > I think we have it to prevent e.g. a crypto mining bot of a PR sneaking
>> in and taking tons of resources to mine dogecoin or so?
>> >
>> > But 1) that doesn't seem to be happening so far, 2) when I hit "Approve
>> & Run" I never look closely to see if there is in fact a hidden crypto
>> miner in there, and 3) can't we just put some reasonable timeout on the
>> GitHub actions to block such abuse?
>> >
>> > Is this some sort of requirement by GitHub, or did we choose to turn on
>> this silly step?
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>


Re: Can we get rid of "Approve & Run" on GitHub PRs by new contributors (non-committers)?

2023-10-16 Thread Dawid Weiss
It's actually as simple as adding:

timeout-minutes: xyz

to workflows.

https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes

I use it elsewhere for jobs on Windows because they tend to hang sometimes
(for reasons unknown to me).

Dawid


On Mon, Oct 16, 2023 at 4:53 PM Robert Muir  wrote:

> I think running the builds with a timeout is a good thing to do
> anyway, for any CI build. I'm sure github actions has some fancy yaml
> for that, but you can just do "timeout -k 1m 1h ./gradlew..." instead
> of "./gradlew" too.
>
> On Mon, Oct 16, 2023 at 9:58 AM Michael McCandless
>  wrote:
> >
> > When a non-committer (I think?) opens a PR, one of the committers must
> notice it and click Approve & Run so the contributor can find out if
> something broke in our automated tests/precommit/linting.
> >
> > This seems like a waste, and a friction in the worst possible place for
> our community: new contributor onboarding experience.
> >
> > I think we have it to prevent e.g. a crypto mining bot of a PR sneaking
> in and taking tons of resources to mine dogecoin or so?
> >
> > But 1) that doesn't seem to be happening so far, 2) when I hit "Approve
> & Run" I never look closely to see if there is in fact a hidden crypto
> miner in there, and 3) can't we just put some reasonable timeout on the
> GitHub actions to block such abuse?
> >
> > Is this some sort of requirement by GitHub, or did we choose to turn on
> this silly step?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Can we get rid of "Approve & Run" on GitHub PRs by new contributors (non-committers)?

2023-10-16 Thread Robert Muir
I think running the builds with a timeout is a good thing to do
anyway, for any CI build. I'm sure github actions has some fancy yaml
for that, but you can just do "timeout -k 1m 1h ./gradlew..." instead
of "./gradlew" too.

On Mon, Oct 16, 2023 at 9:58 AM Michael McCandless
 wrote:
>
> When a non-committer (I think?) opens a PR, one of the committers must notice 
> it and click Approve & Run so the contributor can find out if something broke 
> in our automated tests/precommit/linting.
>
> This seems like a waste, and a friction in the worst possible place for our 
> community: new contributor onboarding experience.
>
> I think we have it to prevent e.g. a crypto mining bot of a PR sneaking in 
> and taking tons of resources to mine dogecoin or so?
>
> But 1) that doesn't seem to be happening so far, 2) when I hit "Approve & 
> Run" I never look closely to see if there is in fact a hidden crypto miner in 
> there, and 3) can't we just put some reasonable timeout on the GitHub actions 
> to block such abuse?
>
> Is this some sort of requirement by GitHub, or did we choose to turn on this 
> silly step?
>
> Mike McCandless
>
> http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Can we get rid of "Approve & Run" on GitHub PRs by new contributors (non-committers)?

2023-10-16 Thread Uwe Schindler

Hi,

this seems to be a safety feature and is also enabled in general for 
Github. I found no options in asf.yaml to enable/disable it:


https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features#Git.asf.yamlfeatures-GitHubsettings

You can only add some users to a whitelist of "collaborators" through 
asf.yaml. Nevertheless, I see no problem for pressing the button. When I 
quickly review a PR, I generally press the button. For safety reasons 
this is required in most projects I was contributing, too (not only 
ASF). What's the problem in pressing the button? Of course you take 
responsibility when the crypto miner starts, but if there is a huge 
PR by an external contributor, I would first ask if they could split it 
into smaller pieces. At some point we have to review it, and most 
external people creating huge PRs did bad stuff like pressing the format 
button in their IDE.


I think running "./gradlew precommit" is a must for new contributors. 
The online checks on Github are more for me as reviewer/committer, to 
make sure all is fine before I press the merge button (for many PRs I 
don't even checkout the code after review). So it is fine to not trigger 
it by end-users.


-1 to ask INFRA to enable this.

Uwe

Am 16.10.2023 um 15:57 schrieb Michael McCandless:
When a non-committer (I think?) opens a PR, one of the committers must 
notice it and click Approve & Run so the contributor can find out if 
something broke in our automated tests/precommit/linting.


This seems like a waste, and a friction in the worst possible place 
for our community: new contributor onboarding experience.


I think we have it to prevent e.g. a crypto mining bot of a PR 
sneaking in and taking tons of resources to mine dogecoin or so?


But 1) that doesn't seem to be happening so far, 2) when I hit 
"Approve & Run" I never look closely to see if there is in fact a 
hidden crypto miner in there, and 3) can't we just put some 
reasonable timeout on the GitHub actions to block such abuse?


Is this some sort of requirement by GitHub, or did we choose to turn 
on this silly step?


Mike McCandless

http://blog.mikemccandless.com


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de


Can we get rid of "Approve & Run" on GitHub PRs by new contributors (non-committers)?

2023-10-16 Thread Michael McCandless
When a non-committer (I think?) opens a PR, one of the committers must
notice it and click Approve & Run so the contributor can find out if
something broke in our automated tests/precommit/linting.

This seems like a waste, and a friction in the worst possible place for our
community: new contributor onboarding experience.

I think we have it to prevent e.g. a crypto mining bot of a PR sneaking in
and taking tons of resources to mine dogecoin or so?

But 1) that doesn't seem to be happening so far, 2) when I hit "Approve &
Run" I never look closely to see if there is in fact a hidden crypto miner
in there, and 3) can't we just put some reasonable timeout on the GitHub
actions to block such abuse?

Is this some sort of requirement by GitHub, or did we choose to turn on
this silly step?

Mike McCandless

http://blog.mikemccandless.com


Re: Multimodal search

2023-10-16 Thread Michael Wechner
btw, here are some other examples of hybrid search implementations, 
using RRF


https://weaviate.io/blog/hybrid-search-explained
https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking
https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html

but as written below, I don't think this really addresses the problem of 
accuracy at its core.


Thanks

Michael


Am 15.10.23 um 21:05 schrieb Michael Wechner:

Hi Navneet

I also observe that various "vector search DBs" are implementing 
hybrid search, because the accuracy with embeddings is often not good 
enough.
Vectors are often too "mushy" and hybrid search can help to improve 
accuracy, just as re-ranking does, but I think there is a better way.


Depending on the dataset and the expertise of a human, answers by 
"humans" are much more accurate, because I think "humans" are 
extracting "features" from input and then operate on these "features". 
See for example


https://medium.com/aleph-alpha-blog/multimodality-attention-is-all-you-need-is-all-we-needed-526c45abdf0

and see the principles behind DALL-E and CLIP.

I think the same or similar principles could be re-used to implement a 
more accurate search.


I have built a very simple PoC and it looks promising, that using this 
approach provides a much higher accuracy, because the similarity score 
is much more distinct.


Of course there are various challenges, but I think it is worth exploring.

I also understand that within an existing "ecosystem" change, resp. 
trying something new can be difficult, but I guess I am not the only 
one seeing low accuracy as a fundamental problem, right?


Thanks

Michael





Am 14.10.23 um 09:38 schrieb Navneet Verma:

Hi Michael,
Please correct me if I am wrong, I think what you are trying to say 
with multimodal search is to combine both text search and vector 
search to improve the accuracy of search results. As per my 
understanding of search space people are coining this as Hybrid 
search. We recently launched a query clause in OpenSearch called 
"hybrid" which takes this hybrid approach and combines scores of text 
and vector search 
globally(https://opensearch.org/blog/hybrid-search/). As per our 
experiments we saw accuracy being better than text search and vector 
search alone. Just curious if you are thinking something like this or 
you have a completely different thought.


I agree that currently to improve the accuracy of search results 
there have been techniques like re-ranking that are very popular.



Thanks
Navneet

On Fri, Oct 13, 2023 at 8:53 AM Michael Wechner 
 wrote:


Thanks for your feedback and the link to the OpenSearch
implementation!

I think the embedding approach as it exists today is not and will
not be able to provide good enough accuracy.
Many people try to fix this with re-ranking, which helps, but
does not really fix the actual problem.

I think we focus too much on text, because text/language is
actually just a representation of the "models" we create in our
minds from the reality we perceive via our senses.

When you take multimodality into account from the very beginning,
then you will be forced to approach search differently
and I would argue that this will lead to a much more powerful
search implementation, which is able to provide better accuracy
and also the capability that the implementation knows much better
what it does not know.

I do not mean to sound philosophical, but actually have a quite
clear implementation in my mind resp. on paper, but I would be
interested
to know whether the Lucene community is interested to reconsider
search from the ground up?

I think the Lucene community has a fantastic knowledge /
expertise, but I think it is time to evolve quite radically, and
not just do another vector search implementation.

WDYT?

Thanks

Michael







Am 13.10.23 um 00:49 schrieb Michael Froh:

We recently added multimodal search in OpenSearch:
https://github.com/opensearch-project/neural-search/pull/359

Since Lucene ultimately just cares about embeddings, does Lucene
itself really need to be multimodal? Wherever the embeddings
come from, Lucene can index the vectors and combine with textual
queries, right?

Thanks,
Froh

On Thu, Oct 12, 2023 at 12:59 PM Michael Wechner
 wrote:

Hi

Did anyone of the Lucene committers consider making Lucene
multimodal?

With a quick Google search I found for example

https://dl.acm.org/doi/abs/10.1145/3503161.3548768

https://sigir-ecom.github.io/ecom2018/ecom18Papers/paper7.pdf

Thanks

Michael



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org