Re: nested documents performance anomaly

2019-04-17 Thread Jeff Wartes

This is more a solr-user conversation, but one other possibility is that for 
the “1M docs test” you sent 1M insert requests, and for the “1000 parent doc 
test” you sent 1000 insert requests.
Batching multiple documents into a single insert request will yield *much* 
better throughput, and the nested-doc approach essentially forces you to do 
that as a side effect of how the insert request is structured.

So basically Dale’s theory, but applied to the HTTP level instead of the 
segment level.

Some other random tips for indexing speed:

  *   For hard commits, set openSearcher=false
  *   For soft commits, set the commit interval as large as you can stand.

-Jeff Wartes

From: Dale Richardson 
Reply-To: "dev@lucene.apache.org" 
Date: Sunday, April 14, 2019 at 3:58 AM
To: "dev@lucene.apache.org" 
Subject: Re: nested documents performance anomaly

Hi Roi,
My understanding of how the nested relationship is implemented in Lucene is 
that the child document references are physically stored in the same index 
segment as the parent document reference.  For normal queries which index 
segment a document reference is stored in is completely transparent to the 
query result, but the block join operator used for parent-child joins takes 
advantage of this low-level detail to provide for super-fast joins between 
parent and child documents.  A trade off for this technique is that the 
relevant index segment needs to be re-written when any part of the parent-child 
relationship changes.

I suspect that if you are writing all the children documents for a parent, you 
are helpfully batching up all updates to a single index segment into a single 
update, with the subsequent increase in speed.

The constraints that apply in return for this speed boost is that you must have 
all the children document ready to write in one go, and the index updates are 
likely done in a single transaction for each parent (i.e. all or none).  I 
suspect (but have not tested the fact) that indexing/storing 1000 child 
documents to a 1000 parent documents one document at a time would actually be 
slower than just indexing 1 million documents 1 document at a time.

I hope this increases your understanding of the situation.

Regards,
Dale.

From: Roi Wexler 
Sent: Sunday, 14 April 2019 6:59 AM
To: dev@lucene.apache.org
Subject: nested documents performance anomaly


Hi,
we're at the process of testing Solr for its indexing speed which is very 
impotent to our application.
we've witnessed strange behavior that we wish to understand before using it.
when we indexed 1M docs it took about 63 seconds but when we indexed the same 
documents only now we've nested them as 1000 parented with 1000 child documents 
each, it took only 27 seconds.

we know that Lucene don't support nested documents for it has a flat object 
model, and we do see that in fact it does index each of the child documents as 
a separate document.

we have tests shows that we get the same results in case we index all documents 
flat (without childs) or when we index them as 1000 parents with 1000 nested 
documents each.

do we miss something here?
why does it behave like that?
what kind of constraints does child documents have, or what is the price we pay 
to get this better index speed?
we're trying to establish if this is a valid way to get a better performance in 
index speed..

any help will be appreciated.




Re: nested documents performance anomaly

2019-04-14 Thread Dale Richardson
Hi Roi,
My understanding of how the nested relationship is implemented in Lucene is 
that the child document references are physically stored in the same index 
segment as the parent document reference.  For normal queries which index 
segment a document reference is stored in is completely transparent to the 
query result, but the block join operator used for parent-child joins takes 
advantage of this low-level detail to provide for super-fast joins between 
parent and child documents.  A trade off for this technique is that the 
relevant index segment needs to be re-written when any part of the parent-child 
relationship changes.

I suspect that if you are writing all the children documents for a parent, you 
are helpfully batching up all updates to a single index segment into a single 
update, with the subsequent increase in speed.

The constraints that apply in return for this speed boost is that you must have 
all the children document ready to write in one go, and the index updates are 
likely done in a single transaction for each parent (i.e. all or none).  I 
suspect (but have not tested the fact) that indexing/storing 1000 child 
documents to a 1000 parent documents one document at a time would actually be 
slower than just indexing 1 million documents 1 document at a time.

I hope this increases your understanding of the situation.

Regards,
Dale.

From: Roi Wexler 
Sent: Sunday, 14 April 2019 6:59 AM
To: dev@lucene.apache.org
Subject: nested documents performance anomaly


Hi,
we're at the process of testing Solr for its indexing speed which is very 
impotent to our application.
we've witnessed strange behavior that we wish to understand before using it.
when we indexed 1M docs it took about 63 seconds but when we indexed the same 
documents only now we've nested them as 1000 parented with 1000 child documents 
each, it took only 27 seconds.

we know that Lucene don't support nested documents for it has a flat object 
model, and we do see that in fact it does index each of the child documents as 
a separate document.

we have tests shows that we get the same results in case we index all documents 
flat (without childs) or when we index them as 1000 parents with 1000 nested 
documents each.

do we miss something here?
why does it behave like that?
what kind of constraints does child documents have, or what is the price we pay 
to get this better index speed?
we're trying to establish if this is a valid way to get a better performance in 
index speed..

any help will be appreciated.




nested documents performance anomaly

2019-04-14 Thread Roi Wexler
Hi,
we're at the process of testing Solr for its indexing speed which is very 
impotent to our application.
we've witnessed strange behavior that we wish to understand before using it.
when we indexed 1M docs it took about 63 seconds but when we indexed the same 
documents only now we've nested them as 1000 parented with 1000 child documents 
each, it took only 27 seconds.

we know that Lucene don't support nested documents for it has a flat object 
model, and we do see that in fact it does index each of the child documents as 
a separate document.

we have tests shows that we get the same results in case we index all documents 
flat (without childs) or when we index them as 1000 parents with 1000 nested 
documents each.

do we miss something here?
why does it behave like that?
what kind of constraints does child documents have, or what is the price we pay 
to get this better index speed?
we're trying to establish if this is a valid way to get a better performance in 
index speed..

any help will be appreciated.