[jira] [Commented] (LUCENE-8041) All Fields.terms(fld) impls should be O(1) not O(log(N))

2019-05-16 Thread Huy Le (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841902#comment-16841902
 ] 

Huy Le commented on LUCENE-8041:


I don't think that we can avoid paying the additional O(N) memory cost by using 
Stream#sorted trick because under the hood, the stream implementation also 
creates an Array, sort it then copy it into the LinkedHashMap in the terminal 
stage (see SortedOps.SizedRefSortingSink).  Apart from DirectFields, it is hard 
to implement your approach on other cases. The patch I provided has advantage 
that it is simple, clear and safe. 

As we are talking about number of fields, it is unlikely that there is a index 
with million fields perhaps ten of thousands is more realistic.  

> All Fields.terms(fld) impls should be O(1) not O(log(N))
> 
>
> Key: LUCENE-8041
> URL: https://issues.apache.org/jira/browse/LUCENE-8041
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8041-LinkedHashMap.patch, LUCENE-8041.patch
>
>
> I've seen apps that have a good number of fields -- hundreds.  The O(log(N)) 
> of TreeMap definitely shows up in a profiler; sometimes 20% of search time, 
> if I recall.  There are many Field implementations that are impacted... in 
> part because Fields is the base class of FieldsProducer.  
> As an aside, I hope Fields to go away some day; FieldsProducer should be 
> TermsProducer and not have an iterator of fields. If DocValuesProducer 
> doesn't have this then why should the terms index part of our API have it?  
> If we did this then the issue here would be a simple transition to a HashMap.
> Or maybe we can switch to HashMap and relax the definition of Fields.iterator 
> to not necessarily be sorted?
> Perhaps the fix can be a relatively simple conversion over to LinkedHashMap 
> in many cases if we can assume when we initialize these internal maps that we 
> consume them in sorted order to begin with.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8041) All Fields.terms(fld) impls should be O(1) not O(log(N))

2019-05-16 Thread Huy Le (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841075#comment-16841075
 ] 

Huy Le edited comment on LUCENE-8041 at 5/16/19 7:57 AM:
-

I created a patch that use LinkedHashMap to maintain field name to Fields 
mapping.  

[^LUCENE-8041-LinkedHashMap.patch]


was (Author: huyyuh):
I created a patch that use LinkedHashMap to maintain field name to Fields 
mapping.

> All Fields.terms(fld) impls should be O(1) not O(log(N))
> 
>
> Key: LUCENE-8041
> URL: https://issues.apache.org/jira/browse/LUCENE-8041
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8041-LinkedHashMap.patch, LUCENE-8041.patch
>
>
> I've seen apps that have a good number of fields -- hundreds.  The O(log(N)) 
> of TreeMap definitely shows up in a profiler; sometimes 20% of search time, 
> if I recall.  There are many Field implementations that are impacted... in 
> part because Fields is the base class of FieldsProducer.  
> As an aside, I hope Fields to go away some day; FieldsProducer should be 
> TermsProducer and not have an iterator of fields. If DocValuesProducer 
> doesn't have this then why should the terms index part of our API have it?  
> If we did this then the issue here would be a simple transition to a HashMap.
> Or maybe we can switch to HashMap and relax the definition of Fields.iterator 
> to not necessarily be sorted?
> Perhaps the fix can be a relatively simple conversion over to LinkedHashMap 
> in many cases if we can assume when we initialize these internal maps that we 
> consume them in sorted order to begin with.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8041) All Fields.terms(fld) impls should be O(1) not O(log(N))

2019-05-16 Thread Huy Le (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huy Le updated LUCENE-8041:
---
Attachment: LUCENE-8041-LinkedHashMap.patch

> All Fields.terms(fld) impls should be O(1) not O(log(N))
> 
>
> Key: LUCENE-8041
> URL: https://issues.apache.org/jira/browse/LUCENE-8041
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8041-LinkedHashMap.patch, LUCENE-8041.patch
>
>
> I've seen apps that have a good number of fields -- hundreds.  The O(log(N)) 
> of TreeMap definitely shows up in a profiler; sometimes 20% of search time, 
> if I recall.  There are many Field implementations that are impacted... in 
> part because Fields is the base class of FieldsProducer.  
> As an aside, I hope Fields to go away some day; FieldsProducer should be 
> TermsProducer and not have an iterator of fields. If DocValuesProducer 
> doesn't have this then why should the terms index part of our API have it?  
> If we did this then the issue here would be a simple transition to a HashMap.
> Or maybe we can switch to HashMap and relax the definition of Fields.iterator 
> to not necessarily be sorted?
> Perhaps the fix can be a relatively simple conversion over to LinkedHashMap 
> in many cases if we can assume when we initialize these internal maps that we 
> consume them in sorted order to begin with.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8041) All Fields.terms(fld) impls should be O(1) not O(log(N))

2019-05-16 Thread Huy Le (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841075#comment-16841075
 ] 

Huy Le commented on LUCENE-8041:


I created a patch that use LinkedHashMap to maintain field name to Fields 
mapping.

> All Fields.terms(fld) impls should be O(1) not O(log(N))
> 
>
> Key: LUCENE-8041
> URL: https://issues.apache.org/jira/browse/LUCENE-8041
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8041-LinkedHashMap.patch, LUCENE-8041.patch
>
>
> I've seen apps that have a good number of fields -- hundreds.  The O(log(N)) 
> of TreeMap definitely shows up in a profiler; sometimes 20% of search time, 
> if I recall.  There are many Field implementations that are impacted... in 
> part because Fields is the base class of FieldsProducer.  
> As an aside, I hope Fields to go away some day; FieldsProducer should be 
> TermsProducer and not have an iterator of fields. If DocValuesProducer 
> doesn't have this then why should the terms index part of our API have it?  
> If we did this then the issue here would be a simple transition to a HashMap.
> Or maybe we can switch to HashMap and relax the definition of Fields.iterator 
> to not necessarily be sorted?
> Perhaps the fix can be a relatively simple conversion over to LinkedHashMap 
> in many cases if we can assume when we initialize these internal maps that we 
> consume them in sorted order to begin with.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8800) FieldsReader#terms poor performance on a index with many field names sharing common prefix

2019-05-15 Thread Huy Le (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huy Le updated LUCENE-8800:
---
Summary: FieldsReader#terms poor performance on a index with many field 
names sharing common prefix  (was: FieldsReader#terms poor performance on a 
index with many fields)

> FieldsReader#terms poor performance on a index with many field names sharing 
> common prefix
> --
>
> Key: LUCENE-8800
> URL: https://issues.apache.org/jira/browse/LUCENE-8800
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Huy Le
>Priority: Major
> Attachments: Screen Shot 2019-05-15 at 5.08.26 pm.png
>
>
> We have experienced poor performance on an index with many fields, their 
> names share common prefix. Sampling stack using jprofiler showed a hotspot on 
> method FieldsReader#terms.
> !Screen Shot 2019-05-15 at 5.08.26 pm.png!
> Looking at source code I have seen that TreeMap is used to map between field 
> name to  FieldsProducer which means a lookup incurs O(logN) comparisons. 
> {code:java}
> private static class FieldsReader extends FieldsProducer {
> ...
> private final Map fields = new TreeMap<>();
> ...
> @Override
> public Terms terms(String field) throws IOException {
>   FieldsProducer fieldsProducer = fields.get(field);
>   return fieldsProducer == null ? null : fieldsProducer.terms(field);
> }
> {code}
> The problem becomes much worse when field names are long and share common 
> prefix because each comparison has to iterate over an entire string.
> In our case, the index has around 6000 fields in form of customfield_*.  I 
> wonder if we can change the TreeMap to HashMap or LinkedHashMap in case we 
> want to preserve the sorted order to improve the situation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8800) FieldsReader#terms poor performance on a index with many fields

2019-05-15 Thread Huy Le (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huy Le updated LUCENE-8800:
---
Attachment: Screen Shot 2019-05-15 at 5.08.26 pm.png

> FieldsReader#terms poor performance on a index with many fields
> ---
>
> Key: LUCENE-8800
> URL: https://issues.apache.org/jira/browse/LUCENE-8800
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Huy Le
>Priority: Major
> Attachments: Screen Shot 2019-05-15 at 5.08.26 pm.png
>
>
> We have experienced poor performance on an index with many fields, their 
> names share common prefix. Sampling stack using jprofiler showed a hotspot on 
> method FieldsReader#terms.
> !Screen Shot 2019-05-15 at 5.08.26 pm.png!
> Looking at source code I have seen that TreeMap is used to map between field 
> name to  FieldsProducer which means a lookup incurs O(logN) comparisons. 
> {code:java}
> private static class FieldsReader extends FieldsProducer {
> ...
> private final Map fields = new TreeMap<>();
> ...
> @Override
> public Terms terms(String field) throws IOException {
>   FieldsProducer fieldsProducer = fields.get(field);
>   return fieldsProducer == null ? null : fieldsProducer.terms(field);
> }
> {code}
> The problem becomes much worse when field names are long and share common 
> prefix because each comparison has to iterate over an entire string.
> In our case, the index has around 6000 fields in form of customfield_*.  I 
> wonder if we can change the TreeMap to HashMap or LinkedHashMap in case we 
> want to preserve the sorted order to improve the situation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8800) FieldsReader#terms poor performance on a index with many fields

2019-05-15 Thread Huy Le (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huy Le updated LUCENE-8800:
---
Attachment: (was: Screen Shot 2019-05-15 at 5.08.26 pm.png)

> FieldsReader#terms poor performance on a index with many fields
> ---
>
> Key: LUCENE-8800
> URL: https://issues.apache.org/jira/browse/LUCENE-8800
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Huy Le
>Priority: Major
>
> We have experienced poor performance on an index with many fields, their 
> names share common prefix. Sampling stack using jprofiler showed a hotspot on 
> method FieldsReader#terms.
> !Screen Shot 2019-05-15 at 5.08.26 pm.png!
> Looking at source code I have seen that TreeMap is used to map between field 
> name to  FieldsProducer which means a lookup incurs O(logN) comparisons. 
> {code:java}
> private static class FieldsReader extends FieldsProducer {
> ...
> private final Map fields = new TreeMap<>();
> ...
> @Override
> public Terms terms(String field) throws IOException {
>   FieldsProducer fieldsProducer = fields.get(field);
>   return fieldsProducer == null ? null : fieldsProducer.terms(field);
> }
> {code}
> The problem becomes much worse when field names are long and share common 
> prefix because each comparison has to iterate over an entire string.
> In our case, the index has around 6000 fields in form of customfield_*.  I 
> wonder if we can change the TreeMap to HashMap or LinkedHashMap in case we 
> want to preserve the sorted order to improve the situation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8800) FieldsReader#terms poor performance on a index with many fields

2019-05-15 Thread Huy Le (JIRA)
Huy Le created LUCENE-8800:
--

 Summary: FieldsReader#terms poor performance on a index with many 
fields
 Key: LUCENE-8800
 URL: https://issues.apache.org/jira/browse/LUCENE-8800
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/codecs
Affects Versions: 8.0
Reporter: Huy Le
 Attachments: Screen Shot 2019-05-15 at 5.08.26 pm.png

We have experienced poor performance on an index with many fields, their names 
share common prefix. Sampling stack using jprofiler showed a hotspot on method 
FieldsReader#terms.

!Screen Shot 2019-05-15 at 5.08.26 pm.png!

Looking at source code I have seen that TreeMap is used to map between field 
name to  FieldsProducer which means a lookup incurs O(logN) comparisons. 

{code:java}
private static class FieldsReader extends FieldsProducer {
...
private final Map fields = new TreeMap<>();
...
@Override
public Terms terms(String field) throws IOException {
  FieldsProducer fieldsProducer = fields.get(field);
  return fieldsProducer == null ? null : fieldsProducer.terms(field);
}
{code}

The problem becomes much worse when field names are long and share common 
prefix because each comparison has to iterate over an entire string.
In our case, the index has around 6000 fields in form of customfield_*.  I 
wonder if we can change the TreeMap to HashMap or LinkedHashMap in case we want 
to preserve the sorted order to improve the situation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org