[ 
https://issues.apache.org/jira/browse/SOLR-17942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Puneet Ahuja updated SOLR-17942:
--------------------------------
    Description: 
The parameter ramPerThreadHardLimitMB cannot be larger than 2GB in Lucene, 
which means a single thread cannot write segments larger than 2GB.
Refer: 
[https://lucene.apache.org/core/9_9_0/core/org/apache/lucene/index/IndexWriterConfig.html#setRAMPerThreadHardLimitMB(int])

This issue proposes to make this parameter configurable above the 2GB limit, so 
that each thread can write a bigger segment. I plan to use reflection to bypass 
this hard-coded limit in Lucene.

 

When indexing high dimensional vector data, each segment has its own HNSW 
graph. So more segments mean more graphs to search per shard and more graph 
rebuild work during merges. With this change, a single indexing thread can 
flush fewer, and larger segments, which is generally more resource-efficient 
for vector-heavy workloads.

Lucene issue: https://github.com/apache/lucene/issues/15296

  was:
The parameter ramPerThreadHardLimitMB cannot be larger than 2GB in Lucene, 
which means a single thread cannot write segments larger than 2GB.
Refer: 
[https://lucene.apache.org/core/9_9_0/core/org/apache/lucene/index/IndexWriterConfig.html#setRAMPerThreadHardLimitMB(int])

This issue proposes to make this parameter configurable above the 2GB limit, so 
that each thread can write a bigger segment. I plan to use reflection to bypass 
this hard-coded limit in Lucene.

 

When indexing high dimensional vector data, each segment has its own HNSW 
graph. So more segments mean more graphs to search per shard and more graph 
rebuild work during merges. With this change, a single indexing thread can 
flush fewer, and larger segments, which is generally more resource-efficient 
for vector-heavy workloads.


> Raising the hardcoded limit of lucene parameter ramPerThreadHardLimitMB using 
> reflection
> ----------------------------------------------------------------------------------------
>
>                 Key: SOLR-17942
>                 URL: https://issues.apache.org/jira/browse/SOLR-17942
>             Project: Solr
>          Issue Type: Task
>    Affects Versions: main (10.0)
>            Reporter: Puneet Ahuja
>            Priority: Major
>
> The parameter ramPerThreadHardLimitMB cannot be larger than 2GB in Lucene, 
> which means a single thread cannot write segments larger than 2GB.
> Refer: 
> [https://lucene.apache.org/core/9_9_0/core/org/apache/lucene/index/IndexWriterConfig.html#setRAMPerThreadHardLimitMB(int])
> This issue proposes to make this parameter configurable above the 2GB limit, 
> so that each thread can write a bigger segment. I plan to use reflection to 
> bypass this hard-coded limit in Lucene.
>  
> When indexing high dimensional vector data, each segment has its own HNSW 
> graph. So more segments mean more graphs to search per shard and more graph 
> rebuild work during merges. With this change, a single indexing thread can 
> flush fewer, and larger segments, which is generally more resource-efficient 
> for vector-heavy workloads.
> Lucene issue: https://github.com/apache/lucene/issues/15296



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to