[jira] [Updated] (LUCENE-8689) Boolean DocValues Codec Implementation

2019-02-11 Thread Ivan Mamontov (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mamontov updated LUCENE-8689:
--
Attachment: (was: results.png)

> Boolean DocValues Codec Implementation
> --
>
> Key: LUCENE-8689
> URL: https://issues.apache.org/jira/browse/LUCENE-8689
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Ivan Mamontov
>Priority: Minor
>  Labels: patch, performance
> Attachments: LUCENE-8689.patch, SynteticDocValuesBench70.java, 
> results2.png
>
>
> To avoid issues where some products become available/unavailable at some 
> point in time after being out-of-stock, e-commerce search system designers 
> need to embed up-to-date information about inventory availability right into 
> the search engines. Key requirement is to be able to accurately filter out 
> unavailable products and use availability as one of ranking signals. However, 
> keeping availability data up-to-date is a non-trivial task. Straightforward 
> implementation based on a partial updates of Lucene documents causes Solr 
> cache trashing with negatively affected query performance and resource 
> utilization.
>  As an alternative solution we can use DocValues and build-in in-place 
> updates where field values can be independently updated without touching 
> inverted index, and while filtering by DocValues is a bit slower, overall 
> performance gain is better. However existing long based docValues are not 
> sufficiently optimized for carrying boolean inventory availability data:
>  * All DocValues queries are internally rewritten into 
> org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
> iteration over all column values and typically much slower than using 
> TermsQuery.
>  * On every commit/merge codec has to iterate over DocValues a couple times 
> in order to choose the best compression algorithm suitable for given data. As 
> a result for 4K fields and 3M max doc merge takes more than 10 minutes
> This issue is intended to solve these limitations via special bitwise doc 
> values format that uses internal representation of 
> org.apache.lucene.util.FixedBitSet in order to store indexed values and load 
> them at search time as a simple long array without additional decoding. There 
> are several reasons for this:
>  * At index time encoding is super fast without superfluous iterations over 
> all values to choose the best compression algorithm suitable for given data.
>  * At query time decoding is also simple and fast, no GC pressure and extra 
> steps
>  * Internal representation allows to perform random access in constant time
> Limitations are:
>  * Does not support non boolean fields
>  * Boolean fields must be represented as long values 1 for true and 0 for 
> false
>  * Current implementation does not support advanced bit set formats like 
> org.apache.lucene.util.SparseFixedBitSet or 
> org.apache.lucene.util.RoaringDocIdSet
> In order to evaluate performance gain I've wrote a simple JMH based benchmark 
> [^SynteticDocValuesBench70.java] which allows to estimate a relative cost of 
> DF filters. This benchmark creates 2 000 000 documents with 5 boolean columns 
> with different density, where 10, 35, 50, 60 and 90 is an amount of documents 
> with value 1. Each method tries to enumerate over all values in synthetic 
> store field in all available ways:
>  * baseline – in almost all cases Solr uses FixedBitSet in filter cache to 
> keep store availability. This test just iterates over all bits.
>  * docValuesRaw – iterates over all values of DV column, the same code is 
> used in "post filtering", sorting and faceting.
>  * docValuesNumbersQuery – iterates over all values produced by query/filter 
> store:1, actually there is the only query implementation for DV based fields 
> - DocValuesNumbersQuery. This means that Lucene rewrites all term, range and 
> filter queries for non indexed filed into this fallback implementation.
>  * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which 
> support only two values – 0/1
> !results2.png!
> Query latency is similar to FixedBitSet with negligible overhead 1-2 ms. 
> DocValuesNumbersQuery 6-7 times slower compared to boolean query. Raw doc 
> values iterator is also not so fast as it performs on-the-fly decoding.
> Attached patch contains two parts:
>  * bitwise codec and all required structures and producers/consumers
>  * boolean query which removes TwoPhaseIterator, AllBits approximation and 
> missing docs lookup
>  * docValues codec test green except non long values cases



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Updated] (LUCENE-8689) Boolean DocValues Codec Implementation

2019-02-11 Thread Ivan Mamontov (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mamontov updated LUCENE-8689:
--
Description: 
To avoid issues where some products become available/unavailable at some point 
in time after being out-of-stock, e-commerce search system designers need to 
embed up-to-date information about inventory availability right into the search 
engines. Key requirement is to be able to accurately filter out unavailable 
products and use availability as one of ranking signals. However, keeping 
availability data up-to-date is a non-trivial task. Straightforward 
implementation based on a partial updates of Lucene documents causes Solr cache 
trashing with negatively affected query performance and resource utilization.
 As an alternative solution we can use DocValues and build-in in-place updates 
where field values can be independently updated without touching inverted 
index, and while filtering by DocValues is a bit slower, overall performance 
gain is better. However existing long based docValues are not sufficiently 
optimized for carrying boolean inventory availability data:
 * All DocValues queries are internally rewritten into 
org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
iteration over all column values and typically much slower than using 
TermsQuery.
 * On every commit/merge codec has to iterate over DocValues a couple times in 
order to choose the best compression algorithm suitable for given data. As a 
result for 4K fields and 3M max doc merge takes more than 10 minutes

This issue is intended to solve these limitations via special bitwise doc 
values format that uses internal representation of 
org.apache.lucene.util.FixedBitSet in order to store indexed values and load 
them at search time as a simple long array without additional decoding. There 
are several reasons for this:
 * At index time encoding is super fast without superfluous iterations over all 
values to choose the best compression algorithm suitable for given data.
 * At query time decoding is also simple and fast, no GC pressure and extra 
steps
 * Internal representation allows to perform random access in constant time

Limitations are:
 * Does not support non boolean fields
 * Boolean fields must be represented as long values 1 for true and 0 for false
 * Current implementation does not support advanced bit set formats like 
org.apache.lucene.util.SparseFixedBitSet or 
org.apache.lucene.util.RoaringDocIdSet

In order to evaluate performance gain I've wrote a simple JMH based benchmark 
[^SynteticDocValuesBench70.java] which allows to estimate a relative cost of DF 
filters. This benchmark creates 2 000 000 documents with 5 boolean columns with 
different density, where 10, 35, 50, 60 and 90 is an amount of documents with 
value 1. Each method tries to enumerate over all values in synthetic store 
field in all available ways:
 * baseline – in almost all cases Solr uses FixedBitSet in filter cache to keep 
store availability. This test just iterates over all bits.
 * docValuesRaw – iterates over all values of DV column, the same code is used 
in "post filtering", sorting and faceting.
 * docValuesNumbersQuery – iterates over all values produced by query/filter 
store:1, actually there is the only query implementation for DV based fields - 
DocValuesNumbersQuery. This means that Lucene rewrites all term, range and 
filter queries for non indexed filed into this fallback implementation.
 * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which 
support only two values – 0/1

!results2.png!

Query latency is similar to FixedBitSet with negligible overhead 1-2 ms. 
DocValuesNumbersQuery 6-7 times slower compared to boolean query. Raw doc 
values iterator is also not so fast as it performs on-the-fly decoding.

Attached patch contains two parts:
 * bitwise codec and all required structures and producers/consumers
 * boolean query which removes TwoPhaseIterator, AllBits approximation and 
missing docs lookup
 * docValues codec test green except non long values cases

  was:
To avoid issues where some products become available/unavailable at some point 
in time after being out-of-stock, e-commerce search system designers need to 
embed up-to-date information about inventory availability right into the search 
engines. Key requirement is to be able to accurately filter out unavailable 
products and use availability as one of ranking signals. However, keeping 
availability data up-to-date is a non-trivial task. Straightforward 
implementation based on a partial updates of Lucene documents causes Solr cache 
trashing with negatively affected query performance and resource utilization.
 As an alternative solution we can use DocValues and build-in in-place updates 
where field values can be independently updated without touching inverted 
index, and while filtering by DocValues is a 

[jira] [Updated] (LUCENE-8689) Boolean DocValues Codec Implementation

2019-02-11 Thread Ivan Mamontov (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mamontov updated LUCENE-8689:
--
Description: 
To avoid issues where some products become available/unavailable at some point 
in time after being out-of-stock, e-commerce search system designers need to 
embed up-to-date information about inventory availability right into the search 
engines. Key requirement is to be able to accurately filter out unavailable 
products and use availability as one of ranking signals. However, keeping 
availability data up-to-date is a non-trivial task. Straightforward 
implementation based on a partial updates of Lucene documents causes Solr cache 
trashing with negatively affected query performance and resource utilization.
 As an alternative solution we can use DocValues and build-in in-place updates 
where field values can be independently updated without touching inverted 
index, and while filtering by DocValues is a bit slower, overall performance 
gain is better. However existing long based docValues are not sufficiently 
optimized for carrying boolean inventory availability data:
 * All DocValues queries are internally rewritten into 
org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
iteration over all column values and typically much slower than using 
TermsQuery.
 * On every commit/merge codec has to iterate over DocValues a couple times in 
order to choose ths best compression algorithm suitable for given data. As a 
result for 4K fields and 3M max doc merge takes more than 10 minutes

This issue is intended to solve these limitations via special bitwise doc 
values format that uses internal representation of 
org.apache.lucene.util.FixedBitSet in order to store indexed values and load 
them at search time as a simple long array without additional decoding. There 
are several reasons for this:
 * At index time encoding is super fast without superfluous iterations over all 
values to choose ths best compression algorithm suitable for given data.
 * At query time decoding is also simple and fast, no GC pressure and extra 
steps
 * Internal representation allows to perform random access in constant time

Limitations are:
 * Does not support non boolean fields
 * Boolean fields must be represented as long values 1 for true and 0 for false
 * Current implementation does not support advanced bit set formats like 
org.apache.lucene.util.SparseFixedBitSet or 
org.apache.lucene.util.RoaringDocIdSet

In order to evaluate performance gain I've wrote a simple JMH based benchmark 
[^SynteticDocValuesBench70.java] which allows to estimate a relative cost of DF 
filters. This benchmark creates 2 000 000 documents with 5 boolean columns with 
different density, where 10, 35, 50, 60 and 90 is an amount of documents with 
value 1. Each method tries to enumerate over all values in synthetic store 
field in all available ways:
 * baseline – in almost all cases Solr uses FixedBitSet in filter cache to keep 
store availability. This test just iterates over all bits.
 * docValuesRaw – iterates over all values of DV column, the same code is used 
in "post filtering", sorting and faceting.
 * docValuesNumbersQuery – iterates over all values produced by query/filter 
store:1, actually there is the only query implementation for DV based fields - 
DocValuesNumbersQuery. This means that Lucene rewrites all term, range and 
filter queries for non indexed filed into this fallback implementation.
 * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which 
support only two values – 0/1

!results2.png!

Query latency is similar to FixedBitSet with negligible overhead 1-2 ms. 
DocValuesNumbersQuery 6-7 times slower compared to boolean query. Raw doc 
values iterator is also not so fast as it performs on-the-fly decoding.

Attached patch contains two parts:
 * bitwise codec and all required structures and producers/consumers
 * boolean query which removes TwoPhaseIterator, AllBits approximation and 
missing docs lookup
 * docValues codec test green except non long values cases

  was:
To avoid issues where some products become available/unavailable at some point 
in time after being out-of-stock, e-commerce search system designers need to 
embed up-to-date information about inventory availability right into the search 
engines. Key requirement is to be able to accurately filter out unavailable 
products and use availability as one of ranking signals. However, keeping 
availability data up-to-date is a non-trivial task. Straightforward 
implementation based on a partial updates of Lucene documents causes Solr cache 
trashing with negatively affected query performance and resource utilization.
 As an alternative solution we can use DocValues and build-in in-place updates 
where field values can be independently updated without touching inverted 
index, and while filtering by DocValues is a 

[jira] [Updated] (LUCENE-8689) Boolean DocValues Codec Implementation

2019-02-10 Thread Ivan Mamontov (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mamontov updated LUCENE-8689:
--
Description: 
To avoid issues where some products become available/unavailable at some point 
in time after being out-of-stock, e-commerce search system designers need to 
embed up-to-date information about inventory availability right into the search 
engines. Key requirement is to be able to accurately filter out unavailable 
products and use availability as one of ranking signals. However, keeping 
availability data up-to-date is a non-trivial task. Straightforward 
implementation based on a partial updates of Lucene documents causes Solr cache 
trashing with negatively affected query performance and resource utilization.
 As an alternative solution we can use DocValues and build-in in-place updates 
where field values can be independently updated without touching inverted 
index, and while filtering by DocValues is a bit slower, overall performance 
gain is better. However existing long based docValues are not sufficiently 
optimized for carrying boolean inventory availability data:
 * All DocValues queries are internally rewritten into 
org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
iteration over all column values and typically much slower than using 
TermsQuery.
 * On every commit/merge codec has to iterate over DocValues a couple times in 
order to choose ths best compression algorithm suitable for given data. As a 
result for 4K fields and 3M max doc merge takes more than 10 minutes

This issue is intended to solve these limitations via special bitwise doc 
values format that uses internal representation of 
org.apache.lucene.util.FixedBitSet in order to store indexed values and load 
them at search time as a simple long array without additional decoding. There 
are several reasons for this:
 * At index time encoding is super fast without superfluous iterations over all 
values to choose ths best compression algorithm suitable for given data.
 * At query time decoding is also simple and fast, no GC pressure and extra 
steps
 * Internal representation allows to perform random access in constant time

Limitations are:
 * Does not support non boolean fields
 * Boolean fields must be represented as long values 1 for true and 0 for false
 * Current implementation does not support advanced bit set formats like 
org.apache.lucene.util.SparseFixedBitSet or 
org.apache.lucene.util.RoaringDocIdSet

In order to evaluate performance gain I've wrote a simple JMH based benchmark 
[^SynteticDocValuesBench70.java] which allows to estimate a relative cost of DF 
filters. This benchmark creates 2 000 000 documents with 5 boolean columns with 
different density, where 10, 35, 50, 60 and 90 is an amount of documents with 
value 1. Each method tries to enumerate over all values in synthetic store 
field in all available ways:
 * baseline – in almost all cases Solr uses FixedBitSet in filter cache to keep 
store availability. This test just iterates over all bits.
 * docValuesRaw – iterates over all values of DV column, the same code is used 
in "post filtering", sorting and faceting.
 * docValuesNumbersQuery – iterates over all values produced by query/filter 
store:1, actually there is the only query implementation for DV based fields - 
DocValuesNumbersQuery. This means that Lucene rewrites all term, range and 
filter queries for non indexed filed into this fallback implementation.
 * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which 
support only two values – 0/1

!results2.png!

Query latency is similar to FixedBitSet with negligible overhead 1-2 ms. 
DocValuesNumbersQuery 6-7 times slower compared to boolean query. Raw doc 
values iterator is also not so fast as it performs on-the-fly decoding.


 Attached patch contains two parts:
 * bitwise codec and all required structures and producers/consumers
 * boolean query which removes TwoPhaseIterator and AllBits approximation

  was:
To avoid issues where some products become available/unavailable at some point 
in time after being out-of-stock, e-commerce search system designers need to 
embed up-to-date information about inventory availability right into the search 
engines. Key requirement is to be able to accurately filter out unavailable 
products and use availability as one of ranking signals. However, keeping 
availability data up-to-date is a non-trivial task. Straightforward 
implementation based on a partial updates of Lucene documents causes Solr cache 
trashing with negatively affected query performance and resource utilization.
 As an alternative solution we can use DocValues and build-in in-place updates 
where field values can be independently updated without touching inverted 
index, and while filtering by DocValues is a bit slower, overall performance 
gain is better. However existing long based 

[jira] [Updated] (LUCENE-8689) Boolean DocValues Codec Implementation

2019-02-10 Thread Ivan Mamontov (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mamontov updated LUCENE-8689:
--
Attachment: SynteticDocValuesBench70.java

> Boolean DocValues Codec Implementation
> --
>
> Key: LUCENE-8689
> URL: https://issues.apache.org/jira/browse/LUCENE-8689
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Ivan Mamontov
>Priority: Minor
>  Labels: patch, performance
> Attachments: LUCENE-8689.patch, SynteticDocValuesBench70.java, 
> results.png
>
>
> To avoid issues where some products become available/unavailable at some 
> point in time after being out-of-stock, e-commerce search system designers 
> need to embed up-to-date information about inventory availability right into 
> the search engines. Key requirement is to be able to accurately filter out 
> unavailable products and use availability as one of ranking signals. However, 
> keeping availability data up-to-date is a non-trivial task. Straightforward 
> implementation based on a partial updates of Lucene documents causes Solr 
> cache trashing with negatively affected query performance and resource 
> utilization.
>  As an alternative solution we can use DocValues and build-in in-place 
> updates where field values can be independently updated without touching 
> inverted index, and while filtering by DocValues is a bit slower, overall 
> performance gain is better. However existing long based docValues are not 
> sufficiently optimized for carrying boolean inventory availability data:
>  * All DocValues queries are internally rewritten into 
> org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
> iteration over all column values and typically much slower than using 
> TermsQuery.
>  * On every commit/merge codec has to iterate over DocValues a couple times 
> in order to choose ths best compression algorithm suitable for given data. As 
> a result for 4K fields and 3M max doc merge takes more than 10 minutes
> This issue is intended to solve these limitations via special bitwise doc 
> values format that uses internal representation of 
> org.apache.lucene.util.FixedBitSet in order to store indexed values and load 
> them at search time as a simple long array without additional decoding. There 
> are several reasons for this:
>  * At index time encoding is super fast without superfluous iterations over 
> all values to choose ths best compression algorithm suitable for given 
> data.
>  * At query time decoding is also simple and fast, no GC pressure and extra 
> steps
>  * Internal representation allows to perform random access in constant time
> Limitations are:
>  * Does not support non boolean fields
>  * Boolean fields must be represented as long values 1 for true and 0 for 
> false
>  * Current implementation does not support advanced bit set formats like 
> org.apache.lucene.util.SparseFixedBitSet or 
> org.apache.lucene.util.RoaringDocIdSet
> In order to evaluate performance gain I've wrote a simple benchmark(JMH 
> based) which allows to estimate a relative cost of DF filters. This benchmark 
> creates 2 000 000 documents with 5 boolean columns with different density, 
> where 10, 35, 50, 60 and 90 is an amount of documents with value 1. Each 
> method tries to enumerate over all values in synthetic store field in all 
> available ways:
>  * baseline – in almost all cases Solr uses FixedBitSet in filter cache to 
> keep store availability. This test just iterates over all bits.
>  * docValuesRaw – iterates over all values of DV column, the same code is 
> used in "post filtering", sorting and faceting.
>  * docValuesNumbersQuery – iterates over all values produced by query/filter 
> store:1, actually there is the only query implementation for DV based fields 
> - DocValuesNumbersQuery. This means that Lucene rewrites all term, range and 
> filter queries for non indexed filed into this fallback implementation.
>  * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which 
> support only two values – 0/1
> !results.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8689) Boolean DocValues Codec Implementation

2019-02-10 Thread Ivan Mamontov (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mamontov updated LUCENE-8689:
--
Description: 
To avoid issues where some products become available/unavailable at some point 
in time after being out-of-stock, e-commerce search system designers need to 
embed up-to-date information about inventory availability right into the search 
engines. Key requirement is to be able to accurately filter out unavailable 
products and use availability as one of ranking signals. However, keeping 
availability data up-to-date is a non-trivial task. Straightforward 
implementation based on a partial updates of Lucene documents causes Solr cache 
trashing with negatively affected query performance and resource utilization.
 As an alternative solution we can use DocValues and build-in in-place updates 
where field values can be independently updated without touching inverted 
index, and while filtering by DocValues is a bit slower, overall performance 
gain is better. However existing long based docValues are not sufficiently 
optimized for carrying boolean inventory availability data:
 * All DocValues queries are internally rewritten into 
org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
iteration over all column values and typically much slower than using 
TermsQuery.
 * On every commit/merge codec has to iterate over DocValues a couple times in 
order to choose ths best compression algorithm suitable for given data. As a 
result for 4K fields and 3M max doc merge takes more than 10 minutes

This issue is intended to solve these limitations via special bitwise doc 
values format that uses internal representation of 
org.apache.lucene.util.FixedBitSet in order to store indexed values and load 
them at search time as a simple long array without additional decoding. There 
are several reasons for this:
 * At index time encoding is super fast without superfluous iterations over all 
values to choose ths best compression algorithm suitable for given data.
 * At query time decoding is also simple and fast, no GC pressure and extra 
steps
 * Internal representation allows to perform random access in constant time

Limitations are:
 * Does not support non boolean fields
 * Boolean fields must be represented as long values 1 for true and 0 for false
 * Current implementation does not support advanced bit set formats like 
org.apache.lucene.util.SparseFixedBitSet or 
org.apache.lucene.util.RoaringDocIdSet

In order to evaluate performance gain I've wrote a simple JMH based benchmark  
[^SynteticDocValuesBench70.java]  which allows to estimate a relative cost of 
DF filters. This benchmark creates 2 000 000 documents with 5 boolean columns 
with different density, where 10, 35, 50, 60 and 90 is an amount of documents 
with value 1. Each method tries to enumerate over all values in synthetic store 
field in all available ways:
 * baseline – in almost all cases Solr uses FixedBitSet in filter cache to keep 
store availability. This test just iterates over all bits.
 * docValuesRaw – iterates over all values of DV column, the same code is used 
in "post filtering", sorting and faceting.
 * docValuesNumbersQuery – iterates over all values produced by query/filter 
store:1, actually there is the only query implementation for DV based fields - 
DocValuesNumbersQuery. This means that Lucene rewrites all term, range and 
filter queries for non indexed filed into this fallback implementation.
 * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which 
support only two values – 0/1

!results2.png!

Query latency is similar to FixedBitSet with negligible overhead 1-2 ms.
Attached patch contains two parts:
* bitwise codec and all required structures and producers/consumers
* boolean query which removes TwoPhaseIterator and AllBits approximation


  was:
To avoid issues where some products become available/unavailable at some point 
in time after being out-of-stock, e-commerce search system designers need to 
embed up-to-date information about inventory availability right into the search 
engines. Key requirement is to be able to accurately filter out unavailable 
products and use availability as one of ranking signals. However, keeping 
availability data up-to-date is a non-trivial task. Straightforward 
implementation based on a partial updates of Lucene documents causes Solr cache 
trashing with negatively affected query performance and resource utilization.
 As an alternative solution we can use DocValues and build-in in-place updates 
where field values can be independently updated without touching inverted 
index, and while filtering by DocValues is a bit slower, overall performance 
gain is better. However existing long based docValues are not sufficiently 
optimized for carrying boolean inventory availability data:
 * All DocValues queries are internally rewritten into 

[jira] [Updated] (LUCENE-8689) Boolean DocValues Codec Implementation

2019-02-10 Thread Ivan Mamontov (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mamontov updated LUCENE-8689:
--
Description: 
To avoid issues where some products become available/unavailable at some point 
in time after being out-of-stock, e-commerce search system designers need to 
embed up-to-date information about inventory availability right into the search 
engines. Key requirement is to be able to accurately filter out unavailable 
products and use availability as one of ranking signals. However, keeping 
availability data up-to-date is a non-trivial task. Straightforward 
implementation based on a partial updates of Lucene documents causes Solr cache 
trashing with negatively affected query performance and resource utilization.
 As an alternative solution we can use DocValues and build-in in-place updates 
where field values can be independently updated without touching inverted 
index, and while filtering by DocValues is a bit slower, overall performance 
gain is better. However existing long based docValues are not sufficiently 
optimized for carrying boolean inventory availability data:
 * All DocValues queries are internally rewritten into 
org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
iteration over all column values and typically much slower than using 
TermsQuery.
 * On every commit/merge codec has to iterate over DocValues a couple times in 
order to choose ths best compression algorithm suitable for given data. As a 
result for 4K fields and 3M max doc merge takes more than 10 minutes

This issue is intended to solve these limitations via special bitwise doc 
values format that uses internal representation of 
org.apache.lucene.util.FixedBitSet in order to store indexed values and load 
them at search time as a simple long array without additional decoding. There 
are several reasons for this:
 * At index time encoding is super fast without superfluous iterations over all 
values to choose ths best compression algorithm suitable for given data.
 * At query time decoding is also simple and fast, no GC pressure and extra 
steps
 * Internal representation allows to perform random access in constant time

Limitations are:
 * Does not support non boolean fields
 * Boolean fields must be represented as long values 1 for true and 0 for false
 * Current implementation does not support advanced bit set formats like 
org.apache.lucene.util.SparseFixedBitSet or 
org.apache.lucene.util.RoaringDocIdSet

In order to evaluate performance gain I've wrote a simple JMH based benchmark  
[^SynteticDocValuesBench70.java]  which allows to estimate a relative cost of 
DF filters. This benchmark creates 2 000 000 documents with 5 boolean columns 
with different density, where 10, 35, 50, 60 and 90 is an amount of documents 
with value 1. Each method tries to enumerate over all values in synthetic store 
field in all available ways:
 * baseline – in almost all cases Solr uses FixedBitSet in filter cache to keep 
store availability. This test just iterates over all bits.
 * docValuesRaw – iterates over all values of DV column, the same code is used 
in "post filtering", sorting and faceting.
 * docValuesNumbersQuery – iterates over all values produced by query/filter 
store:1, actually there is the only query implementation for DV based fields - 
DocValuesNumbersQuery. This means that Lucene rewrites all term, range and 
filter queries for non indexed filed into this fallback implementation.
 * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which 
support only two values – 0/1

!results2.png!

Query latency is similar to FixedBitSet with negligible overhead 1-2 ms

  was:
To avoid issues where some products become available/unavailable at some point 
in time after being out-of-stock, e-commerce search system designers need to 
embed up-to-date information about inventory availability right into the search 
engines. Key requirement is to be able to accurately filter out unavailable 
products and use availability as one of ranking signals. However, keeping 
availability data up-to-date is a non-trivial task. Straightforward 
implementation based on a partial updates of Lucene documents causes Solr cache 
trashing with negatively affected query performance and resource utilization.
 As an alternative solution we can use DocValues and build-in in-place updates 
where field values can be independently updated without touching inverted 
index, and while filtering by DocValues is a bit slower, overall performance 
gain is better. However existing long based docValues are not sufficiently 
optimized for carrying boolean inventory availability data:
 * All DocValues queries are internally rewritten into 
org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
iteration over all column values and typically much slower than using 
TermsQuery.
 * On every commit/merge 

[jira] [Updated] (LUCENE-8689) Boolean DocValues Codec Implementation

2019-02-10 Thread Ivan Mamontov (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mamontov updated LUCENE-8689:
--
Description: 
To avoid issues where some products become available/unavailable at some point 
in time after being out-of-stock, e-commerce search system designers need to 
embed up-to-date information about inventory availability right into the search 
engines. Key requirement is to be able to accurately filter out unavailable 
products and use availability as one of ranking signals. However, keeping 
availability data up-to-date is a non-trivial task. Straightforward 
implementation based on a partial updates of Lucene documents causes Solr cache 
trashing with negatively affected query performance and resource utilization.
 As an alternative solution we can use DocValues and build-in in-place updates 
where field values can be independently updated without touching inverted 
index, and while filtering by DocValues is a bit slower, overall performance 
gain is better. However existing long based docValues are not sufficiently 
optimized for carrying boolean inventory availability data:
 * All DocValues queries are internally rewritten into 
org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
iteration over all column values and typically much slower than using 
TermsQuery.
 * On every commit/merge codec has to iterate over DocValues a couple times in 
order to choose ths best compression algorithm suitable for given data. As a 
result for 4K fields and 3M max doc merge takes more than 10 minutes

This issue is intended to solve these limitations via special bitwise doc 
values format that uses internal representation of 
org.apache.lucene.util.FixedBitSet in order to store indexed values and load 
them at search time as a simple long array without additional decoding. There 
are several reasons for this:
 * At index time encoding is super fast without superfluous iterations over all 
values to choose ths best compression algorithm suitable for given data.
 * At query time decoding is also simple and fast, no GC pressure and extra 
steps
 * Internal representation allows to perform random access in constant time

Limitations are:
 * Does not support non boolean fields
 * Boolean fields must be represented as long values 1 for true and 0 for false
 * Current implementation does not support advanced bit set formats like 
org.apache.lucene.util.SparseFixedBitSet or 
org.apache.lucene.util.RoaringDocIdSet

In order to evaluate performance gain I've wrote a simple JMH based benchmark  
[^SynteticDocValuesBench70.java]  which allows to estimate a relative cost of 
DF filters. This benchmark creates 2 000 000 documents with 5 boolean columns 
with different density, where 10, 35, 50, 60 and 90 is an amount of documents 
with value 1. Each method tries to enumerate over all values in synthetic store 
field in all available ways:
 * baseline – in almost all cases Solr uses FixedBitSet in filter cache to keep 
store availability. This test just iterates over all bits.
 * docValuesRaw – iterates over all values of DV column, the same code is used 
in "post filtering", sorting and faceting.
 * docValuesNumbersQuery – iterates over all values produced by query/filter 
store:1, actually there is the only query implementation for DV based fields - 
DocValuesNumbersQuery. This means that Lucene rewrites all term, range and 
filter queries for non indexed filed into this fallback implementation.
 * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which 
support only two values – 0/1

!results2.png!

  was:
To avoid issues where some products become available/unavailable at some point 
in time after being out-of-stock, e-commerce search system designers need to 
embed up-to-date information about inventory availability right into the search 
engines. Key requirement is to be able to accurately filter out unavailable 
products and use availability as one of ranking signals. However, keeping 
availability data up-to-date is a non-trivial task. Straightforward 
implementation based on a partial updates of Lucene documents causes Solr cache 
trashing with negatively affected query performance and resource utilization.
 As an alternative solution we can use DocValues and build-in in-place updates 
where field values can be independently updated without touching inverted 
index, and while filtering by DocValues is a bit slower, overall performance 
gain is better. However existing long based docValues are not sufficiently 
optimized for carrying boolean inventory availability data:
 * All DocValues queries are internally rewritten into 
org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
iteration over all column values and typically much slower than using 
TermsQuery.
 * On every commit/merge codec has to iterate over DocValues a couple times in 
order to choose ths 

[jira] [Updated] (LUCENE-8689) Boolean DocValues Codec Implementation

2019-02-10 Thread Ivan Mamontov (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mamontov updated LUCENE-8689:
--
Attachment: results2.png

> Boolean DocValues Codec Implementation
> --
>
> Key: LUCENE-8689
> URL: https://issues.apache.org/jira/browse/LUCENE-8689
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Ivan Mamontov
>Priority: Minor
>  Labels: patch, performance
> Attachments: LUCENE-8689.patch, SynteticDocValuesBench70.java, 
> results.png, results2.png
>
>
> To avoid issues where some products become available/unavailable at some 
> point in time after being out-of-stock, e-commerce search system designers 
> need to embed up-to-date information about inventory availability right into 
> the search engines. Key requirement is to be able to accurately filter out 
> unavailable products and use availability as one of ranking signals. However, 
> keeping availability data up-to-date is a non-trivial task. Straightforward 
> implementation based on a partial updates of Lucene documents causes Solr 
> cache trashing with negatively affected query performance and resource 
> utilization.
>  As an alternative solution we can use DocValues and build-in in-place 
> updates where field values can be independently updated without touching 
> inverted index, and while filtering by DocValues is a bit slower, overall 
> performance gain is better. However existing long based docValues are not 
> sufficiently optimized for carrying boolean inventory availability data:
>  * All DocValues queries are internally rewritten into 
> org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
> iteration over all column values and typically much slower than using 
> TermsQuery.
>  * On every commit/merge codec has to iterate over DocValues a couple times 
> in order to choose ths best compression algorithm suitable for given data. As 
> a result for 4K fields and 3M max doc merge takes more than 10 minutes
> This issue is intended to solve these limitations via special bitwise doc 
> values format that uses internal representation of 
> org.apache.lucene.util.FixedBitSet in order to store indexed values and load 
> them at search time as a simple long array without additional decoding. There 
> are several reasons for this:
>  * At index time encoding is super fast without superfluous iterations over 
> all values to choose ths best compression algorithm suitable for given 
> data.
>  * At query time decoding is also simple and fast, no GC pressure and extra 
> steps
>  * Internal representation allows to perform random access in constant time
> Limitations are:
>  * Does not support non boolean fields
>  * Boolean fields must be represented as long values 1 for true and 0 for 
> false
>  * Current implementation does not support advanced bit set formats like 
> org.apache.lucene.util.SparseFixedBitSet or 
> org.apache.lucene.util.RoaringDocIdSet
> In order to evaluate performance gain I've wrote a simple JMH based benchmark 
>  [^SynteticDocValuesBench70.java]  which allows to estimate a relative cost 
> of DF filters. This benchmark creates 2 000 000 documents with 5 boolean 
> columns with different density, where 10, 35, 50, 60 and 90 is an amount of 
> documents with value 1. Each method tries to enumerate over all values in 
> synthetic store field in all available ways:
>  * baseline – in almost all cases Solr uses FixedBitSet in filter cache to 
> keep store availability. This test just iterates over all bits.
>  * docValuesRaw – iterates over all values of DV column, the same code is 
> used in "post filtering", sorting and faceting.
>  * docValuesNumbersQuery – iterates over all values produced by query/filter 
> store:1, actually there is the only query implementation for DV based fields 
> - DocValuesNumbersQuery. This means that Lucene rewrites all term, range and 
> filter queries for non indexed filed into this fallback implementation.
>  * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which 
> support only two values – 0/1
> !results2.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8689) Boolean DocValues Codec Implementation

2019-02-10 Thread Ivan Mamontov (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mamontov updated LUCENE-8689:
--
Attachment: (was: results2.png)

> Boolean DocValues Codec Implementation
> --
>
> Key: LUCENE-8689
> URL: https://issues.apache.org/jira/browse/LUCENE-8689
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Ivan Mamontov
>Priority: Minor
>  Labels: patch, performance
> Attachments: LUCENE-8689.patch, SynteticDocValuesBench70.java, 
> results.png, results2.png
>
>
> To avoid issues where some products become available/unavailable at some 
> point in time after being out-of-stock, e-commerce search system designers 
> need to embed up-to-date information about inventory availability right into 
> the search engines. Key requirement is to be able to accurately filter out 
> unavailable products and use availability as one of ranking signals. However, 
> keeping availability data up-to-date is a non-trivial task. Straightforward 
> implementation based on a partial updates of Lucene documents causes Solr 
> cache trashing with negatively affected query performance and resource 
> utilization.
>  As an alternative solution we can use DocValues and build-in in-place 
> updates where field values can be independently updated without touching 
> inverted index, and while filtering by DocValues is a bit slower, overall 
> performance gain is better. However existing long based docValues are not 
> sufficiently optimized for carrying boolean inventory availability data:
>  * All DocValues queries are internally rewritten into 
> org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
> iteration over all column values and typically much slower than using 
> TermsQuery.
>  * On every commit/merge codec has to iterate over DocValues a couple times 
> in order to choose ths best compression algorithm suitable for given data. As 
> a result for 4K fields and 3M max doc merge takes more than 10 minutes
> This issue is intended to solve these limitations via special bitwise doc 
> values format that uses internal representation of 
> org.apache.lucene.util.FixedBitSet in order to store indexed values and load 
> them at search time as a simple long array without additional decoding. There 
> are several reasons for this:
>  * At index time encoding is super fast without superfluous iterations over 
> all values to choose ths best compression algorithm suitable for given 
> data.
>  * At query time decoding is also simple and fast, no GC pressure and extra 
> steps
>  * Internal representation allows to perform random access in constant time
> Limitations are:
>  * Does not support non boolean fields
>  * Boolean fields must be represented as long values 1 for true and 0 for 
> false
>  * Current implementation does not support advanced bit set formats like 
> org.apache.lucene.util.SparseFixedBitSet or 
> org.apache.lucene.util.RoaringDocIdSet
> In order to evaluate performance gain I've wrote a simple JMH based benchmark 
>  [^SynteticDocValuesBench70.java]  which allows to estimate a relative cost 
> of DF filters. This benchmark creates 2 000 000 documents with 5 boolean 
> columns with different density, where 10, 35, 50, 60 and 90 is an amount of 
> documents with value 1. Each method tries to enumerate over all values in 
> synthetic store field in all available ways:
>  * baseline – in almost all cases Solr uses FixedBitSet in filter cache to 
> keep store availability. This test just iterates over all bits.
>  * docValuesRaw – iterates over all values of DV column, the same code is 
> used in "post filtering", sorting and faceting.
>  * docValuesNumbersQuery – iterates over all values produced by query/filter 
> store:1, actually there is the only query implementation for DV based fields 
> - DocValuesNumbersQuery. This means that Lucene rewrites all term, range and 
> filter queries for non indexed filed into this fallback implementation.
>  * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which 
> support only two values – 0/1
> !results2.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8689) Boolean DocValues Codec Implementation

2019-02-10 Thread Ivan Mamontov (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mamontov updated LUCENE-8689:
--
Description: 
To avoid issues where some products become available/unavailable at some point 
in time after being out-of-stock, e-commerce search system designers need to 
embed up-to-date information about inventory availability right into the search 
engines. Key requirement is to be able to accurately filter out unavailable 
products and use availability as one of ranking signals. However, keeping 
availability data up-to-date is a non-trivial task. Straightforward 
implementation based on a partial updates of Lucene documents causes Solr cache 
trashing with negatively affected query performance and resource utilization.
 As an alternative solution we can use DocValues and build-in in-place updates 
where field values can be independently updated without touching inverted 
index, and while filtering by DocValues is a bit slower, overall performance 
gain is better. However existing long based docValues are not sufficiently 
optimized for carrying boolean inventory availability data:
 * All DocValues queries are internally rewritten into 
org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
iteration over all column values and typically much slower than using 
TermsQuery.
 * On every commit/merge codec has to iterate over DocValues a couple times in 
order to choose ths best compression algorithm suitable for given data. As a 
result for 4K fields and 3M max doc merge takes more than 10 minutes

This issue is intended to solve these limitations via special bitwise doc 
values format that uses internal representation of 
org.apache.lucene.util.FixedBitSet in order to store indexed values and load 
them at search time as a simple long array without additional decoding. There 
are several reasons for this:
 * At index time encoding is super fast without superfluous iterations over all 
values to choose ths best compression algorithm suitable for given data.
 * At query time decoding is also simple and fast, no GC pressure and extra 
steps
 * Internal representation allows to perform random access in constant time

Limitations are:
 * Does not support non boolean fields
 * Boolean fields must be represented as long values 1 for true and 0 for false
 * Current implementation does not support advanced bit set formats like 
org.apache.lucene.util.SparseFixedBitSet or 
org.apache.lucene.util.RoaringDocIdSet

In order to evaluate performance gain I've wrote a simple JMH based benchmark  
[^SynteticDocValuesBench70.java]  which allows to estimate a relative cost of 
DF filters. This benchmark creates 2 000 000 documents with 5 boolean columns 
with different density, where 10, 35, 50, 60 and 90 is an amount of documents 
with value 1. Each method tries to enumerate over all values in synthetic store 
field in all available ways:
 * baseline – in almost all cases Solr uses FixedBitSet in filter cache to keep 
store availability. This test just iterates over all bits.
 * docValuesRaw – iterates over all values of DV column, the same code is used 
in "post filtering", sorting and faceting.
 * docValuesNumbersQuery – iterates over all values produced by query/filter 
store:1, actually there is the only query implementation for DV based fields - 
DocValuesNumbersQuery. This means that Lucene rewrites all term, range and 
filter queries for non indexed filed into this fallback implementation.
 * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which 
support only two values – 0/1

!results2.png!

  was:
To avoid issues where some products become available/unavailable at some point 
in time after being out-of-stock, e-commerce search system designers need to 
embed up-to-date information about inventory availability right into the search 
engines. Key requirement is to be able to accurately filter out unavailable 
products and use availability as one of ranking signals. However, keeping 
availability data up-to-date is a non-trivial task. Straightforward 
implementation based on a partial updates of Lucene documents causes Solr cache 
trashing with negatively affected query performance and resource utilization.
 As an alternative solution we can use DocValues and build-in in-place updates 
where field values can be independently updated without touching inverted 
index, and while filtering by DocValues is a bit slower, overall performance 
gain is better. However existing long based docValues are not sufficiently 
optimized for carrying boolean inventory availability data:
 * All DocValues queries are internally rewritten into 
org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
iteration over all column values and typically much slower than using 
TermsQuery.
 * On every commit/merge codec has to iterate over DocValues a couple times in 
order to choose ths 

[jira] [Updated] (LUCENE-8689) Boolean DocValues Codec Implementation

2019-02-10 Thread Ivan Mamontov (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mamontov updated LUCENE-8689:
--
Description: 
To avoid issues where some products become available/unavailable at some point 
in time after being out-of-stock, e-commerce search system designers need to 
embed up-to-date information about inventory availability right into the search 
engines. Key requirement is to be able to accurately filter out unavailable 
products and use availability as one of ranking signals. However, keeping 
availability data up-to-date is a non-trivial task. Straightforward 
implementation based on a partial updates of Lucene documents causes Solr cache 
trashing with negatively affected query performance and resource utilization.
 As an alternative solution we can use DocValues and build-in in-place updates 
where field values can be independently updated without touching inverted 
index, and while filtering by DocValues is a bit slower, overall performance 
gain is better. However existing long based docValues are not sufficiently 
optimized for carrying boolean inventory availability data:
 * All DocValues queries are internally rewritten into 
org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
iteration over all column values and typically much slower than using 
TermsQuery.
 * On every commit/merge codec has to iterate over DocValues a couple times in 
order to choose ths best compression algorithm suitable for given data. As a 
result for 4K fields and 3M max doc merge takes more than 10 minutes

This issue is intended to solve these limitations via special bitwise doc 
values format that uses internal representation of 
org.apache.lucene.util.FixedBitSet in order to store indexed values and load 
them at search time as a simple long array without additional decoding. There 
are several reasons for this:
 * At index time encoding is super fast without superfluous iterations over all 
values to choose ths best compression algorithm suitable for given data.
 * At query time decoding is also simple and fast, no GC pressure and extra 
steps
 * Internal representation allows to perform random access in constant time

Limitations are:
 * Does not support non boolean fields
 * Boolean fields must be represented as long values 1 for true and 0 for false
 * Current implementation does not support advanced bit set formats like 
org.apache.lucene.util.SparseFixedBitSet or 
org.apache.lucene.util.RoaringDocIdSet

In order to evaluate performance gain I've wrote a simple JMH based benchmark  
[^SynteticDocValuesBench70.java]  which allows to estimate a relative cost of 
DF filters. This benchmark creates 2 000 000 documents with 5 boolean columns 
with different density, where 10, 35, 50, 60 and 90 is an amount of documents 
with value 1. Each method tries to enumerate over all values in synthetic store 
field in all available ways:
 * baseline – in almost all cases Solr uses FixedBitSet in filter cache to keep 
store availability. This test just iterates over all bits.
 * docValuesRaw – iterates over all values of DV column, the same code is used 
in "post filtering", sorting and faceting.
 * docValuesNumbersQuery – iterates over all values produced by query/filter 
store:1, actually there is the only query implementation for DV based fields - 
DocValuesNumbersQuery. This means that Lucene rewrites all term, range and 
filter queries for non indexed filed into this fallback implementation.
 * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which 
support only two values – 0/1

!results2.png|thumbnail!

  was:
To avoid issues where some products become available/unavailable at some point 
in time after being out-of-stock, e-commerce search system designers need to 
embed up-to-date information about inventory availability right into the search 
engines. Key requirement is to be able to accurately filter out unavailable 
products and use availability as one of ranking signals. However, keeping 
availability data up-to-date is a non-trivial task. Straightforward 
implementation based on a partial updates of Lucene documents causes Solr cache 
trashing with negatively affected query performance and resource utilization.
 As an alternative solution we can use DocValues and build-in in-place updates 
where field values can be independently updated without touching inverted 
index, and while filtering by DocValues is a bit slower, overall performance 
gain is better. However existing long based docValues are not sufficiently 
optimized for carrying boolean inventory availability data:
 * All DocValues queries are internally rewritten into 
org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
iteration over all column values and typically much slower than using 
TermsQuery.
 * On every commit/merge codec has to iterate over DocValues a couple times in 
order to 

[jira] [Updated] (LUCENE-8689) Boolean DocValues Codec Implementation

2019-02-10 Thread Ivan Mamontov (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mamontov updated LUCENE-8689:
--
Attachment: results2.png

> Boolean DocValues Codec Implementation
> --
>
> Key: LUCENE-8689
> URL: https://issues.apache.org/jira/browse/LUCENE-8689
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Ivan Mamontov
>Priority: Minor
>  Labels: patch, performance
> Attachments: LUCENE-8689.patch, SynteticDocValuesBench70.java, 
> results.png, results2.png
>
>
> To avoid issues where some products become available/unavailable at some 
> point in time after being out-of-stock, e-commerce search system designers 
> need to embed up-to-date information about inventory availability right into 
> the search engines. Key requirement is to be able to accurately filter out 
> unavailable products and use availability as one of ranking signals. However, 
> keeping availability data up-to-date is a non-trivial task. Straightforward 
> implementation based on a partial updates of Lucene documents causes Solr 
> cache trashing with negatively affected query performance and resource 
> utilization.
>  As an alternative solution we can use DocValues and build-in in-place 
> updates where field values can be independently updated without touching 
> inverted index, and while filtering by DocValues is a bit slower, overall 
> performance gain is better. However existing long based docValues are not 
> sufficiently optimized for carrying boolean inventory availability data:
>  * All DocValues queries are internally rewritten into 
> org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
> iteration over all column values and typically much slower than using 
> TermsQuery.
>  * On every commit/merge codec has to iterate over DocValues a couple times 
> in order to choose ths best compression algorithm suitable for given data. As 
> a result for 4K fields and 3M max doc merge takes more than 10 minutes
> This issue is intended to solve these limitations via special bitwise doc 
> values format that uses internal representation of 
> org.apache.lucene.util.FixedBitSet in order to store indexed values and load 
> them at search time as a simple long array without additional decoding. There 
> are several reasons for this:
>  * At index time encoding is super fast without superfluous iterations over 
> all values to choose ths best compression algorithm suitable for given 
> data.
>  * At query time decoding is also simple and fast, no GC pressure and extra 
> steps
>  * Internal representation allows to perform random access in constant time
> Limitations are:
>  * Does not support non boolean fields
>  * Boolean fields must be represented as long values 1 for true and 0 for 
> false
>  * Current implementation does not support advanced bit set formats like 
> org.apache.lucene.util.SparseFixedBitSet or 
> org.apache.lucene.util.RoaringDocIdSet
> In order to evaluate performance gain I've wrote a simple JMH based benchmark 
>  [^SynteticDocValuesBench70.java]  which allows to estimate a relative cost 
> of DF filters. This benchmark creates 2 000 000 documents with 5 boolean 
> columns with different density, where 10, 35, 50, 60 and 90 is an amount of 
> documents with value 1. Each method tries to enumerate over all values in 
> synthetic store field in all available ways:
>  * baseline – in almost all cases Solr uses FixedBitSet in filter cache to 
> keep store availability. This test just iterates over all bits.
>  * docValuesRaw – iterates over all values of DV column, the same code is 
> used in "post filtering", sorting and faceting.
>  * docValuesNumbersQuery – iterates over all values produced by query/filter 
> store:1, actually there is the only query implementation for DV based fields 
> - DocValuesNumbersQuery. This means that Lucene rewrites all term, range and 
> filter queries for non indexed filed into this fallback implementation.
>  * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which 
> support only two values – 0/1
> !results.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8689) Boolean DocValues Codec Implementation

2019-02-10 Thread Ivan Mamontov (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mamontov updated LUCENE-8689:
--
Description: 
To avoid issues where some products become available/unavailable at some point 
in time after being out-of-stock, e-commerce search system designers need to 
embed up-to-date information about inventory availability right into the search 
engines. Key requirement is to be able to accurately filter out unavailable 
products and use availability as one of ranking signals. However, keeping 
availability data up-to-date is a non-trivial task. Straightforward 
implementation based on a partial updates of Lucene documents causes Solr cache 
trashing with negatively affected query performance and resource utilization.
 As an alternative solution we can use DocValues and build-in in-place updates 
where field values can be independently updated without touching inverted 
index, and while filtering by DocValues is a bit slower, overall performance 
gain is better. However existing long based docValues are not sufficiently 
optimized for carrying boolean inventory availability data:
 * All DocValues queries are internally rewritten into 
org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
iteration over all column values and typically much slower than using 
TermsQuery.
 * On every commit/merge codec has to iterate over DocValues a couple times in 
order to choose ths best compression algorithm suitable for given data. As a 
result for 4K fields and 3M max doc merge takes more than 10 minutes

This issue is intended to solve these limitations via special bitwise doc 
values format that uses internal representation of 
org.apache.lucene.util.FixedBitSet in order to store indexed values and load 
them at search time as a simple long array without additional decoding. There 
are several reasons for this:
 * At index time encoding is super fast without superfluous iterations over all 
values to choose ths best compression algorithm suitable for given data.
 * At query time decoding is also simple and fast, no GC pressure and extra 
steps
 * Internal representation allows to perform random access in constant time

Limitations are:
 * Does not support non boolean fields
 * Boolean fields must be represented as long values 1 for true and 0 for false
 * Current implementation does not support advanced bit set formats like 
org.apache.lucene.util.SparseFixedBitSet or 
org.apache.lucene.util.RoaringDocIdSet

In order to evaluate performance gain I've wrote a simple JMH based benchmark  
[^SynteticDocValuesBench70.java]  which allows to estimate a relative cost of 
DF filters. This benchmark creates 2 000 000 documents with 5 boolean columns 
with different density, where 10, 35, 50, 60 and 90 is an amount of documents 
with value 1. Each method tries to enumerate over all values in synthetic store 
field in all available ways:
 * baseline – in almost all cases Solr uses FixedBitSet in filter cache to keep 
store availability. This test just iterates over all bits.
 * docValuesRaw – iterates over all values of DV column, the same code is used 
in "post filtering", sorting and faceting.
 * docValuesNumbersQuery – iterates over all values produced by query/filter 
store:1, actually there is the only query implementation for DV based fields - 
DocValuesNumbersQuery. This means that Lucene rewrites all term, range and 
filter queries for non indexed filed into this fallback implementation.
 * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which 
support only two values – 0/1

!results.png!

  was:
To avoid issues where some products become available/unavailable at some point 
in time after being out-of-stock, e-commerce search system designers need to 
embed up-to-date information about inventory availability right into the search 
engines. Key requirement is to be able to accurately filter out unavailable 
products and use availability as one of ranking signals. However, keeping 
availability data up-to-date is a non-trivial task. Straightforward 
implementation based on a partial updates of Lucene documents causes Solr cache 
trashing with negatively affected query performance and resource utilization.
 As an alternative solution we can use DocValues and build-in in-place updates 
where field values can be independently updated without touching inverted 
index, and while filtering by DocValues is a bit slower, overall performance 
gain is better. However existing long based docValues are not sufficiently 
optimized for carrying boolean inventory availability data:
 * All DocValues queries are internally rewritten into 
org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
iteration over all column values and typically much slower than using 
TermsQuery.
 * On every commit/merge codec has to iterate over DocValues a couple times in 
order to choose ths 

[jira] [Updated] (LUCENE-8689) Boolean DocValues Codec Implementation

2019-02-10 Thread Ivan Mamontov (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mamontov updated LUCENE-8689:
--
Attachment: LUCENE-8689.patch

> Boolean DocValues Codec Implementation
> --
>
> Key: LUCENE-8689
> URL: https://issues.apache.org/jira/browse/LUCENE-8689
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Ivan Mamontov
>Priority: Minor
>  Labels: patch, performance
> Attachments: LUCENE-8689.patch, results.png
>
>
> To avoid issues where some products become available/unavailable at some 
> point in time after being out-of-stock, e-commerce search system designers 
> need to embed up-to-date information about inventory availability right into 
> the search engines. Key requirement is to be able to accurately filter out 
> unavailable products and use availability as one of ranking signals. However, 
> keeping availability data up-to-date is a non-trivial task. Straightforward 
> implementation based on a partial updates of Lucene documents causes Solr 
> cache trashing with negatively affected query performance and resource 
> utilization.
>  As an alternative solution we can use DocValues and build-in in-place 
> updates where field values can be independently updated without touching 
> inverted index, and while filtering by DocValues is a bit slower, overall 
> performance gain is better. However existing long based docValues are not 
> sufficiently optimized for carrying boolean inventory availability data:
>  * All DocValues queries are internally rewritten into 
> org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
> iteration over all column values and typically much slower than using 
> TermsQuery.
>  * On every commit/merge codec has to iterate over DocValues a couple times 
> in order to choose ths best compression algorithm suitable for given data. As 
> a result for 4K fields and 3M max doc merge takes more than 10 minutes
> This issue is intended to solve these limitations via special bitwise doc 
> values format that uses internal representation of 
> org.apache.lucene.util.FixedBitSet in order to store indexed values and load 
> them at search time as a simple long array without additional decoding. There 
> are several reasons for this:
>  * At index time encoding is super fast without superfluous iterations over 
> all values to choose ths best compression algorithm suitable for given 
> data.
>  * At query time decoding is also simple and fast, no GC pressure and extra 
> steps
>  * Internal representation allows to perform random access in constant time
> Limitations are:
>  * Does not support non boolean fields
>  * Boolean fields must be represented as long values 1 for true and 0 for 
> false
>  * Current implementation does not support advanced bit set formats like 
> org.apache.lucene.util.SparseFixedBitSet or 
> org.apache.lucene.util.RoaringDocIdSet
> In order to evaluate performance gain I've wrote a simple benchmark(JMH 
> based) which allows to estimate a relative cost of DF filters. This benchmark 
> creates 2 000 000 documents with 5 boolean columns with different density, 
> where 10, 35, 50, 60 and 90 is an amount of documents with value 1. Each 
> method tries to enumerate over all values in synthetic store field in all 
> available ways:
>  * baseline – in almost all cases Solr uses FixedBitSet in filter cache to 
> keep store availability. This test just iterates over all bits.
>  * docValuesRaw – iterates over all values of DV column, the same code is 
> used in "post filtering", sorting and faceting.
>  * docValuesNumbersQuery – iterates over all values produced by query/filter 
> store:1, actually there is the only query implementation for DV based fields 
> - DocValuesNumbersQuery. This means that Lucene rewrites all term, range and 
> filter queries for non indexed filed into this fallback implementation.
>  * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which 
> support only two values – 0/1
> !results.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8689) Boolean DocValues Codec Implementation

2019-02-10 Thread Ivan Mamontov (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mamontov updated LUCENE-8689:
--
Description: 
To avoid issues where some products become available/unavailable at some point 
in time after being out-of-stock, e-commerce search system designers need to 
embed up-to-date information about inventory availability right into the search 
engines. Key requirement is to be able to accurately filter out unavailable 
products and use availability as one of ranking signals. However, keeping 
availability data up-to-date is a non-trivial task. Straightforward 
implementation based on a partial updates of Lucene documents causes Solr cache 
trashing with negatively affected query performance and resource utilization.
 As an alternative solution we can use DocValues and build-in in-place updates 
where field values can be independently updated without touching inverted 
index, and while filtering by DocValues is a bit slower, overall performance 
gain is better. However existing long based docValues are not sufficiently 
optimized for carrying boolean inventory availability data:
 * All DocValues queries are internally rewritten into 
org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
iteration over all column values and typically much slower than using 
TermsQuery.
 * On every commit/merge codec has to iterate over DocValues a couple times in 
order to choose ths best compression algorithm suitable for given data. As a 
result for 4K fields and 3M max doc merge takes more than 10 minutes

This issue is intended to solve these limitations via special bitwise doc 
values format that uses internal representation of 
org.apache.lucene.util.FixedBitSet in order to store indexed values and load 
them at search time as a simple long array without additional decoding. There 
are several reasons for this:
 * At index time encoding is super fast without superfluous iterations over all 
values to choose ths best compression algorithm suitable for given data.
 * At query time decoding is also simple and fast, no GC pressure and extra 
steps
 * Internal representation allows to perform random access in constant time

Limitations are:
 * Does not support non boolean fields
 * Boolean fields must be represented as long values 1 for true and 0 for false
 * Current implementation does not support advanced bit set formats like 
org.apache.lucene.util.SparseFixedBitSet or 
org.apache.lucene.util.RoaringDocIdSet

In order to evaluate performance gain I've wrote a simple benchmark(JMH based) 
which allows to estimate a relative cost of DF filters. This benchmark creates 
2 000 000 documents with 5 boolean columns with different density, where 10, 
35, 50, 60 and 90 is an amount of documents with value 1. Each method tries to 
enumerate over all values in synthetic store field in all available ways:
 * baseline – in almost all cases Solr uses FixedBitSet in filter cache to keep 
store availability. This test just iterates over all bits.
 * docValuesRaw – iterates over all values of DV column, the same code is used 
in "post filtering", sorting and faceting.
 * docValuesNumbersQuery – iterates over all values produced by query/filter 
store:1, actually there is the only query implementation for DV based fields - 
DocValuesNumbersQuery. This means that Lucene rewrites all term, range and 
filter queries for non indexed filed into this fallback implementation.
 * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which 
support only two values – 0/1

!results.png!

  was:
To avoid issues where some products become available/unavailable at some point 
in time after being out-of-stock, e-commerce search system designers need to 
embed up-to-date information about inventory availability right into the search 
engines. Key requirement is to be able to accurately filter out unavailable 
products and use availability as one of ranking signals. However, keeping 
availability data up-to-date is a non-trivial task. Straightforward 
implementation based on a partial updates of Lucene documents causes Solr cache 
trashing with negatively affected query performance and resource utilization.
 As an alternative solution we can use DocValues and build-in in-place updates 
where field values can be independently updated without touching inverted 
index, and while filtering by DocValues is a bit slower, overall performance 
gain is better. However existing long based docValues are not sufficiently 
optimized for carrying boolean inventory availability data:
 * All DocValues queries are internally rewritten into 
org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
iteration over all column values and typically much slower than using 
TermsQuery.
 * On every commit/merge codec has to iterate over DocValues a couple times in 
order to choose ths best compression algorithm suitable 

[jira] [Updated] (LUCENE-8689) Boolean DocValues Codec Implementation

2019-02-10 Thread Ivan Mamontov (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mamontov updated LUCENE-8689:
--
Description: 
To avoid issues where some products become available/unavailable at some point 
in time after being out-of-stock, e-commerce search system designers need to 
embed up-to-date information about inventory availability right into the search 
engines. Key requirement is to be able to accurately filter out unavailable 
products and use availability as one of ranking signals. However, keeping 
availability data up-to-date is a non-trivial task. Straightforward 
implementation based on a partial updates of Lucene documents causes Solr cache 
trashing with negatively affected query performance and resource utilization.
 As an alternative solution we can use DocValues and build-in in-place updates 
where field values can be independently updated without touching inverted 
index, and while filtering by DocValues is a bit slower, overall performance 
gain is better. However existing long based docValues are not sufficiently 
optimized for carrying boolean inventory availability data:
 * All DocValues queries are internally rewritten into 
org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
iteration over all column values and typically much slower than using 
TermsQuery.
 * On every commit/merge codec has to iterate over DocValues a couple times in 
order to choose ths best compression algorithm suitable for given data. As a 
result for 4K fields and 3M max doc merge takes more than 10 minutes

This issue is intended to solve these limitations via special bitwise doc 
values format that uses internal representation of 
org.apache.lucene.util.FixedBitSet in order to store indexed values and load 
them at search time as a simple long array without additional decoding. There 
are several reasons for this:
 * At index time encoding is super fast without superfluous iterations over all 
values to choose ths best compression algorithm suitable for given data.
 * At query time decoding is also simple and fast, no GC pressure and extra 
steps
 * Internal representation allows to perform random access in constant time

Limitations are:
 * Does not support non boolean fields
 * Boolean fields must be represented as long values 1 for true and 0 for false
 * Current implementation does not support advanced bit set formats like 
org.apache.lucene.util.SparseFixedBitSet or 
org.apache.lucene.util.RoaringDocIdSet

In order to evaluate performance gain I've wrote a simple benchmark(JMH based) 
which allows to estimate a relative cost of DF filters. This benchmark creates 
2 000 000 documents with 5 boolean columns with different density, where 10, 
35, 50, 60 and 90 is an amount of documents with value 1. Each method tries to 
enumerate over all values in synthetic store field in all available ways:
 * baseline – in almost all cases Solr uses FixedBitSet in filter cache to keep 
store availability. This test just iterates over all bits.
 * docValuesRaw – iterates over all values of DV column, the same code is used 
in "post filtering", sorting and faceting.
 * docValuesNumbersQuery – iterates over all values produced by query/filter 
store:1, actually there is the only query implementation for DV based fields - 
DocValuesNumbersQuery. This means that Lucene rewrites all term, range and 
filter queries for non indexed filed into this fallback implementation.
 * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which 
support only two values – 0/1

!results.png|thumbnail!

  was:
To avoid issues where some products become available/unavailable at some point 
in time after being out-of-stock, e-commerce search system designers need to 
embed up-to-date information about inventory availability right into the search 
engines. Key requirement is to be able to accurately filter out unavailable 
products and use availability as one of ranking signals. However, keeping 
availability data up-to-date is a non-trivial task. Straightforward 
implementation based on a partial updates of Lucene documents causes Solr cache 
trashing with negatively affected query performance and resource utilization.
 As an alternative solution we can use DocValues and build-in in-place updates 
where field values can be independently updated without touching inverted 
index, and while filtering by DocValues is a bit slower, overall performance 
gain is better. However existing long based docValues are not sufficiently 
optimized for carrying boolean inventory availability data:
 * All DocValues queries are internally rewritten into 
org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
iteration over all column values and typically much slower than using 
TermsQuery.
 * On every commit/merge codec has to iterate over DocValues a couple times in 
order to choose ths best compression algorithm 

[jira] [Created] (LUCENE-8689) Boolean DocValues Codec Implementation

2019-02-10 Thread Ivan Mamontov (JIRA)
Ivan Mamontov created LUCENE-8689:
-

 Summary: Boolean DocValues Codec Implementation
 Key: LUCENE-8689
 URL: https://issues.apache.org/jira/browse/LUCENE-8689
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/codecs
Reporter: Ivan Mamontov
 Attachments: results.png

To avoid issues where some products become available/unavailable at some point 
in time after being out-of-stock, e-commerce search system designers need to 
embed up-to-date information about inventory availability right into the search 
engines. Key requirement is to be able to accurately filter out unavailable 
products and use availability as one of ranking signals. However, keeping 
availability data up-to-date is a non-trivial task. Straightforward 
implementation based on a partial updates of Lucene documents causes Solr cache 
trashing with negatively affected query performance and resource utilization.
 As an alternative solution we can use DocValues and build-in in-place updates 
where field values can be independently updated without touching inverted 
index, and while filtering by DocValues is a bit slower, overall performance 
gain is better. However existing long based docValues are not sufficiently 
optimized for carrying boolean inventory availability data:
 * All DocValues queries are internally rewritten into 
org.apache.lucene.search.DocValuesNumbersQuery which is based on direct 
iteration over all column values and typically much slower than using 
TermsQuery.
 * On every commit/merge codec has to iterate over DocValues a couple times in 
order to choose ths best compression algorithm suitable for given data. As a 
result for 4K fields and 3M max doc merge takes more than 10 minutes

This issue is intended to solve these limitations via special bitwise doc 
values format that uses internal representation of 
org.apache.lucene.util.FixedBitSet in order to store indexed values and load 
them at search time as a simple long array without additional decoding. There 
are several reasons for this:
 * At index time encoding is super fast without superfluous iterations over all 
values to choose ths best compression algorithm suitable for given data.
 * At query time decoding is also simple and fast, no GC pressure and extra 
steps
 * Internal representation allows to perform random access in constant time

Limitations are:
 * Does not support non boolean fields
 * Boolean fields must be represented as long values 1 for true and 0 for false
 * Current implementation does not support advanced bit set formats like 
org.apache.lucene.util.SparseFixedBitSet or 
org.apache.lucene.util.RoaringDocIdSet

In order to evaluate performance gain I've wrote a simple benchmark(JMH based) 
which allows to estimate a relative cost of DF filters. This benchmark creates 
2 000 000 documents with 5 boolean columns with different density, where 10, 
35, 50, 60 and 90 is an amount of documents with value 1. Each method tries to 
enumerate over all values in synthetic store field in all available ways:
 * baseline – in almost all cases Solr uses FixedBitSet in filter cache to keep 
store availability. This test just iterates over all bits.
 * docValuesRaw – iterates over all values of DV column, the same code is used 
in "post filtering", sorting and faceting.
 * docValuesNumbersQuery – iterates over all values produced by query/filter 
store:1, actually there is the only query implementation for DV based fields - 
DocValuesNumbersQuery. This means that Lucene rewrites all term, range and 
filter queries for non indexed filed into this fallback implementation.
 * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which 
support only two values – 0/1

!results.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10691) Allow to not commit index on core close

2017-05-18 Thread Ivan Mamontov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mamontov updated SOLR-10691:
-
Attachment: SOLR-10691.patch

> Allow to not commit index on core close
> ---
>
> Key: SOLR-10691
> URL: https://issues.apache.org/jira/browse/SOLR-10691
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ivan Mamontov
>Priority: Trivial
> Attachments: SOLR-10691.patch
>
>
> As a Solr user I would like to avoid unnecessary commits into Solr/Lucene 
> index on {{org.apache.solr.update.SolrIndexWriter#close}} in case IW has 
> uncommitted changes.
> In {{org.apache.lucene.index.IndexWriterConfig}}(LUCENE-5871) there is a 
> property which is currently used to decide whether to commit or discard 
> uncommitted changes  when you call close(). Unfortunately Solr does not 
> support this property in {{org.apache.solr.update.SolrIndexConfig}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10691) Allow to not commit index on core close

2017-05-18 Thread Ivan Mamontov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mamontov updated SOLR-10691:
-
Attachment: (was: SOLR-10691.patch)

> Allow to not commit index on core close
> ---
>
> Key: SOLR-10691
> URL: https://issues.apache.org/jira/browse/SOLR-10691
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ivan Mamontov
>Priority: Trivial
>
> As a Solr user I would like to avoid unnecessary commits into Solr/Lucene 
> index on {{org.apache.solr.update.SolrIndexWriter#close}} in case IW has 
> uncommitted changes.
> In {{org.apache.lucene.index.IndexWriterConfig}}(LUCENE-5871) there is a 
> property which is currently used to decide whether to commit or discard 
> uncommitted changes  when you call close(). Unfortunately Solr does not 
> support this property in {{org.apache.solr.update.SolrIndexConfig}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10691) Allow to not commit index on core close

2017-05-18 Thread Ivan Mamontov (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16015828#comment-16015828
 ] 

Ivan Mamontov commented on SOLR-10691:
--

Erick,
 
In our case we have an old school master-slave cluster with a dedicated master. 
Also, we have very strict requirements about consistency - no implicit commits 
or rollbacks are allowed in order to maintain the integrity of 
data(all-or-nothing principle).

* begin the transaction
* delete all documents from index
* index a huge set of documents/collections 
* if no errors occur then commit the transaction else roll back the transaction
* replicate index to slave nodes

So now in order to support transactions during maintenance window or in case of 
any issue on master we have to 
* make a backup of all cores
* restart server as fast as possible
* restore index

As you can see this process is very fragile and has high maintenance cost.

Regarding the change: from my point of view this change is safe and backward 
compatible - it just allows to configure index writer from Lucene. This option 
is fairly well tested by Lucene and in my understanding it is enough to create 
happy path test with simple test scenario.

> Allow to not commit index on core close
> ---
>
> Key: SOLR-10691
> URL: https://issues.apache.org/jira/browse/SOLR-10691
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ivan Mamontov
>Priority: Trivial
>
> As a Solr user I would like to avoid unnecessary commits into Solr/Lucene 
> index on {{org.apache.solr.update.SolrIndexWriter#close}} in case IW has 
> uncommitted changes.
> In {{org.apache.lucene.index.IndexWriterConfig}}(LUCENE-5871) there is a 
> property which is currently used to decide whether to commit or discard 
> uncommitted changes  when you call close(). Unfortunately Solr does not 
> support this property in {{org.apache.solr.update.SolrIndexConfig}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10691) Allow to not commit index on core close

2017-05-15 Thread Ivan Mamontov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mamontov updated SOLR-10691:
-
Attachment: SOLR-10691.patch

Here is a patch without full test coverage, I'll update it later.

> Allow to not commit index on core close
> ---
>
> Key: SOLR-10691
> URL: https://issues.apache.org/jira/browse/SOLR-10691
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ivan Mamontov
>Priority: Trivial
> Attachments: SOLR-10691.patch
>
>
> As a Solr user I would like to avoid unnecessary commits into Solr/Lucene 
> index on {{org.apache.solr.update.SolrIndexWriter#close}} in case IW has 
> uncommitted changes.
> In {{org.apache.lucene.index.IndexWriterConfig}}(LUCENE-5871) there is a 
> property which is currently used to decide whether to commit or discard 
> uncommitted changes  when you call close(). Unfortunately Solr does not 
> support this property in {{org.apache.solr.update.SolrIndexConfig}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10691) Allow to not commit index on core close

2017-05-15 Thread Ivan Mamontov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mamontov updated SOLR-10691:
-
Security: (was: Public)

> Allow to not commit index on core close
> ---
>
> Key: SOLR-10691
> URL: https://issues.apache.org/jira/browse/SOLR-10691
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ivan Mamontov
>Priority: Trivial
>
> As a Solr user I would like to avoid unnecessary commits into Solr/Lucene 
> index on {{org.apache.solr.update.SolrIndexWriter#close}} in case IW has 
> uncommitted changes.
> In {{org.apache.lucene.index.IndexWriterConfig}}(LUCENE-5871) there is a 
> property which is currently used to decide whether to commit or discard 
> uncommitted changes  when you call close(). Unfortunately Solr does not 
> support this property in {{org.apache.solr.update.SolrIndexConfig}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-10691) Allow to not commit index on core close

2017-05-15 Thread Ivan Mamontov (JIRA)
Ivan Mamontov created SOLR-10691:


 Summary: Allow to not commit index on core close
 Key: SOLR-10691
 URL: https://issues.apache.org/jira/browse/SOLR-10691
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Ivan Mamontov
Priority: Trivial


As a Solr user I would like to avoid unnecessary commits into Solr/Lucene index 
on {{org.apache.solr.update.SolrIndexWriter#close}} in case IW has uncommitted 
changes.
In {{org.apache.lucene.index.IndexWriterConfig}}(LUCENE-5871) there is a 
property which is currently used to decide whether to commit or discard 
uncommitted changes  when you call close(). Unfortunately Solr does not support 
this property in {{org.apache.solr.update.SolrIndexConfig}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Deleted] (LUCENE-7260) StandardQueryParser is over 100 times slower in v5 compared to v3

2016-04-28 Thread Ivan Mamontov (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mamontov updated LUCENE-7260:
--
Comment: was deleted

(was: I do not recommend to use yourkit anywhere especially in microbenchmarks. 
According to JMC(-XX:+UnlockCommercialFeatures -XX:+UnlockDiagnosticVMOptions 
-XX:+DebugNonSafepoints -XX:+FlightRecorder 
-XX:StartFlightRecording=duration=60s,filename=myrecording.jfr) the hottest 
method is 
org.apache.lucene.queryparser.flexible.core.nodes.QueryNodeImpl.removeChildren(QueryNode)

See details here https://issues.apache.org/jira/browse/LUCENE-5099)

> StandardQueryParser is over 100 times slower in v5 compared to v3
> -
>
> Key: LUCENE-7260
> URL: https://issues.apache.org/jira/browse/LUCENE-7260
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/queryparser
>Affects Versions: 5.4.1
> Environment: Java 8u51
>Reporter: Trejkaz
>  Labels: performance
>
> The following test code times parsing a large query.
> {code}
> import org.apache.lucene.analysis.KeywordAnalyzer;
> //import org.apache.lucene.analysis.core.KeywordAnalyzer;
> import org.apache.lucene.queryParser.standard.StandardQueryParser;
> //import org.apache.lucene.queryparser.flexible.standard.StandardQueryParser;
> import org.apache.lucene.search.BooleanQuery;
> public class LargeQueryTest {
> public static void main(String[] args) throws Exception {
> BooleanQuery.setMaxClauseCount(50_000);
> StringBuilder builder = new StringBuilder(50_000*10);
> builder.append("id:( ");
> boolean first = true;
> for (int i = 0; i < 50_000; i++) {
> if (first) {
> first = false;
> } else {
> builder.append(" OR ");
> }
> builder.append(String.valueOf(i));
> }
> builder.append(" )");
> String queryString = builder.toString();
> StandardQueryParser parser2 = new StandardQueryParser(new 
> KeywordAnalyzer());
> for (int i = 0; i < 10; i++) {
> long t0 = System.currentTimeMillis();
> parser2.parse(queryString, "nope");
> long t1 = System.currentTimeMillis();
> System.out.println(t1-t0);
> }
> }
> }
> {code}
> For Lucene 3.6.2, the timings settle down to 200~300 with the fastest being 
> 207.
> For Lucene 5.4.1, the timings settle down to 2~3 with the fastest 
> being 22444.
> So at some point, some change made the query parser 100 times slower. I would 
> suspect that it has something to do with how the list of children is now 
> handled. Every time someone gets the children, it copies the list. Every time 
> someone sets the children, it walks through to detach parent references and 
> then reattaches them all again.
> If it were me, I would probably make these collections immutable so that I 
> didn't have to defensively copy them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7260) StandardQueryParser is over 100 times slower in v5 compared to v3

2016-04-28 Thread Ivan Mamontov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262614#comment-15262614
 ] 

Ivan Mamontov commented on LUCENE-7260:
---

I do not recommend to use yourkit anywhere especially in microbenchmarks. 
According to JMC(-XX:+UnlockCommercialFeatures -XX:+UnlockDiagnosticVMOptions 
-XX:+DebugNonSafepoints -XX:+FlightRecorder 
-XX:StartFlightRecording=duration=60s,filename=myrecording.jfr) the hottest 
method is 
org.apache.lucene.queryparser.flexible.core.nodes.QueryNodeImpl.removeChildren(QueryNode)

See details here https://issues.apache.org/jira/browse/LUCENE-5099

> StandardQueryParser is over 100 times slower in v5 compared to v3
> -
>
> Key: LUCENE-7260
> URL: https://issues.apache.org/jira/browse/LUCENE-7260
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/queryparser
>Affects Versions: 5.4.1
> Environment: Java 8u51
>Reporter: Trejkaz
>  Labels: performance
>
> The following test code times parsing a large query.
> {code}
> import org.apache.lucene.analysis.KeywordAnalyzer;
> //import org.apache.lucene.analysis.core.KeywordAnalyzer;
> import org.apache.lucene.queryParser.standard.StandardQueryParser;
> //import org.apache.lucene.queryparser.flexible.standard.StandardQueryParser;
> import org.apache.lucene.search.BooleanQuery;
> public class LargeQueryTest {
> public static void main(String[] args) throws Exception {
> BooleanQuery.setMaxClauseCount(50_000);
> StringBuilder builder = new StringBuilder(50_000*10);
> builder.append("id:( ");
> boolean first = true;
> for (int i = 0; i < 50_000; i++) {
> if (first) {
> first = false;
> } else {
> builder.append(" OR ");
> }
> builder.append(String.valueOf(i));
> }
> builder.append(" )");
> String queryString = builder.toString();
> StandardQueryParser parser2 = new StandardQueryParser(new 
> KeywordAnalyzer());
> for (int i = 0; i < 10; i++) {
> long t0 = System.currentTimeMillis();
> parser2.parse(queryString, "nope");
> long t1 = System.currentTimeMillis();
> System.out.println(t1-t0);
> }
> }
> }
> {code}
> For Lucene 3.6.2, the timings settle down to 200~300 with the fastest being 
> 207.
> For Lucene 5.4.1, the timings settle down to 2~3 with the fastest 
> being 22444.
> So at some point, some change made the query parser 100 times slower. I would 
> suspect that it has something to do with how the list of children is now 
> handled. Every time someone gets the children, it copies the list. Every time 
> someone sets the children, it walks through to detach parent references and 
> then reattaches them all again.
> If it were me, I would probably make these collections immutable so that I 
> didn't have to defensively copy them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org