[jira] [Commented] (LUCENE-7253) Sparse data in doc values and segments merging

David Smiley (JIRA) Mon, 02 May 2016 07:49:39 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15266716#comment-15266716
 ]


David Smiley commented on LUCENE-7253:
--------------------------------------

bq. The current "sparse" optimizations make the index smaller, but make access 
slower (log N). We shouldn't do things like this. We should make things faster.

This issue is not about access-time or index size -- indeed something that is 
sacred in terms of upholding access-time speed.  LUCENE-6863, what you refer 
to, was a delicate balancing act.

I'm trying to understand your point of view better... which is hard because I'm 
being told something that I think is preposterous. Maybe i'm misunderstanding 
what you even mean by the term "abuse case".  Lets try and communicate with 
words we hopefully both understand  ...

Do you believe that it's very _rare_ to populate fields sparsely, even those 
flagged as DocValues?  Without much thought, I think probably half the search 
apps I know have at least one docValues field that isn't fully dense.  Yonik is 
basically saying the same.  It isn't rare; I think it's dubious to claim doing 
this is an abuse if it's popular to do it.  So I don't think you mean that.  

Maybe you simply mean that the DocValues API itself shouldn't/doesn't _cater_ 
to sparsity even though sparsity is allowed and you understand sparsity is 
popular and useful for some use-cases nonetheless? After all, 
{{DocValues.docsWithValue}} is part of our API including other methods that 
return -1 when there's no value -- it's _supported_.  I agree to this; do you 
agree to that statement too?  Pawel proposes a change to an internal class 
that, assuming benchmarks show, will have an up-side to a _supported_ 
capability of DocValues.  If it has no technical down-sides, (these are 
hypotheticals to be proven out first), isn't vetoing now premature?  

bq. This means: {{if (abusecase)}}

What is the abusecase condition?  A condition that is true if dense and false 
if sparse?  NumericDocValuesWriter  _already has_ such conditions -- see the 
hole filling loop which is a condition that must be evaluated for every sparse 
whole.  Also see the FixedBitSet docsWithField which only exists due to the 
sparsity notion.  Are these now set in stone, not to be changed because you say 
so?  But what of it any way -- what's wrong?  I know that, if you had the 
time/inclination or could somehow tell other contributors what do do, that it 
would be solved in another way (RLE) but why stop someone willing to donate 
their time to solve it a different way?  It would not prevent an RLE API from 
appearing later, assuming RLE turns out to be better, right?

> Sparse data in doc values and segments merging 
> -----------------------------------------------
>
>                 Key: LUCENE-7253
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7253
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: 5.5, 6.0
>            Reporter: Pawel Rog
>              Labels: performance
>
> Doc Values were optimized recently to efficiently store sparse data. 
> Unfortunately there is still big problem with Doc Values merges for sparse 
> fields. When we imagine 1 billion documents index it seems it doesn't matter 
> if all documents have value for this field or there is only 1 document with 
> value. Segment merge time is the same for both cases. In most cases this is 
> not a problem but there are several cases in which one can expect having many 
> fields with sparse doc values.
> I can describe an example. During performance tests of a system with large 
> number of sparse fields I realized that Doc Values merges are a bottleneck. I 
> had hundreds of different numeric fields. Each document contained only small 
> subset of all fields. Average document contains 5-7 different numeric values. 
> As you can see data was very sparse in these fields. It turned out that 
> ingestion process was CPU-bound. Most of CPU time was spent in DocValues 
> related methods (SingletonSortedNumericDocValues#setDocument, 
> DocValuesConsumer$10$1#next, DocValuesConsumer#isSingleValued, 
> DocValuesConsumer$4$1#setNext, ...) - mostly during merging segments.
> Adrien Grand suggested to reduce the number of sparse fields and replace them 
> with smaller number of denser fields. This helped a lot but complicated 
> fields naming. 
> I am not much familiar with Doc Values source code but I have small 
> suggestion how to improve Doc Values merges for sparse fields. I realized 
> that Doc Values producers and consumers use Iterators. Let's take an example 
> of numeric Doc Values. Would it be possible to replace Iterator which 
> "travels" through all documents with Iterator over collection of non empty 
> values? Of course this would require storing object (instead of numeric) 
> which contains value and document ID. Such an iterator could significantly 
> improve merge time of sparse Doc Values fields. IMHO this won't cause big 
> overhead for dense structures but it can be game changer for sparse 
> structures.
> This is what happens in NumericDocValuesWriter on flush
> {code}
>     dvConsumer.addNumericField(fieldInfo,
>                                new Iterable<Number>() {
>                                  @Override
>                                  public Iterator<Number> iterator() {
>                                    return new NumericIterator(maxDoc, values, 
> docsWithField);
>                                  }
>                                });
> {code}
> Before this happens during addValue, this loop is executed to fill holes.
> {code}
>     // Fill in any holes:
>     for (int i = (int)pending.size(); i < docID; ++i) {
>       pending.add(MISSING);
>     }
> {code}
> It turns out that variable called pending is used only internally in 
> NumericDocValuesWriter. I know pending is PackedLongValues and it wouldn't be 
> good to change it with different class (some kind of list) because this may 
> break DV performance for dense fields. I hope someone can suggest interesting 
> solutions for this problem :).
> It would be great if discussion about sparse Doc Values merge performance can 
> start here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7253) Sparse data in doc values and segments merging

Reply via email to