from:"Otis Gospodnetic"

Re: [VOTE] Solr to become a top-level Apache project (TLP)

2020-05-15 Thread Otis Gospodnetic

+1

Otis
--
http://sematext.com

> On May 15, 2020, at 15:13, kwatters  wrote:
> 
> -1
> 
> 
> 
> 
> --
> Sent from: 
> https://lucene.472066.n3.nabble.com/Lucene-Java-Developer-f564358.html
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-13434) OpenTracing support for Solr

2019-05-26 Thread Otis Gospodnetic (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848549#comment-16848549
 ] 

Otis Gospodnetic commented on SOLR-13434:
-

[~caomanhdat] Note that OpenTracing has merged with OpenCensus to form 
OpenTelemetry.

> OpenTracing support for Solr
> 
>
> Key: SOLR-13434
> URL: https://issues.apache.org/jira/browse/SOLR-13434
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Shalin Shekhar Mangar
>Assignee: Cao Manh Dat
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: SOLR-13434.patch
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> [OpenTracing|https://opentracing.io/] is a vendor neutral API and 
> infrastructure for distributed tracing. Many OSS tracers just as Jaeger, 
> OpenZipkin, Apache SkyWalking as well as commercial tools support OpenTracing 
> APIs. Ideally, we can implement it once and have integrations for popular 
> tracers like we have with metrics and prometheus.
> I'm aware of SOLR-9641 but HTrace has since retired from incubator for lack 
> of activity so this is a fresh attempt at solving this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-12765) Possibly incorrect format in JMX cache stats

2018-09-13 Thread Otis Gospodnetic (JIRA)



 [ 
https://issues.apache.org/jira/browse/SOLR-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-12765:

Component/s: metrics

> Possibly incorrect format in JMX cache stats
> 
>
> Key: SOLR-12765
> URL: https://issues.apache.org/jira/browse/SOLR-12765
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Affects Versions: 7.4
>Reporter: Bojan Smid
>Priority: Major
>
> I posted a question on ML 
> [https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201809.mbox/%3CCAGniRXR4Ps%3D03X0uiByCn5ecUT2VY4TLV4iNcxCde3dxBnmC-w%40mail.gmail.com%3E|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201809.mbox/%3CCAGniRXR4Ps%3D03X0uiByCn5ecUT2VY4TLV4iNcxCde3dxBnmC-w%40mail.gmail.com%3E),]
>  , but didn't get feedback. Since it looks like a possible bug, I am opening 
> a ticket.
>  
>   It seems the format of cache mbeans changed with 7.4.0.  And from what I 
> see similar change wasn't made for other mbeans, which may mean it was 
> accidental and may be a bug.
>  
>   In Solr 7.3.* format was (each attribute on its own, numeric type):
>  
> mbean:
> solr:dom1=core,dom2=gettingstarted,dom3=shard1,dom4=replica_n1,category=CACHE,scope=searcher,name=filterCache
>  
> attributes:
>   lookups java.lang.Long = 0
>   hits java.lang.Long = 0
>   cumulative_evictions java.lang.Long = 0
>   size java.lang.Long = 0
>   hitratio java.lang.Float = 0.0
>   evictions java.lang.Long = 0
>   cumulative_lookups java.lang.Long = 0
>   cumulative_hitratio java.lang.Float = 0.0
>   warmupTime java.lang.Long = 0
>   inserts java.lang.Long = 0
>   cumulative_inserts java.lang.Long = 0
>   cumulative_hits java.lang.Long = 0
>  
>   With 7.4.0 there is a single attribute "Value" (java.lang.Object):
>  
> mbean:
> solr:dom1=core,dom2=gettingstarted,dom3=shard1,dom4=replica_n1,category=CACHE,scope=searcher,name=filterCache
>  
> attributes:
>   Value java.lang.Object = \{lookups=0, evictions=0, 
> cumulative_inserts=0, cumulative_hits=0, hits=0, cumulative_evictions=0, 
> size=0, hitratio=0.0, cumulative_lookups=0, cumulative_hitratio=0.0, 
> warmupTime=0, inserts=0}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-12765) Possibly incorrect format in JMX cache stats

2018-09-12 Thread Otis Gospodnetic (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612047#comment-16612047
 ] 

Otis Gospodnetic commented on SOLR-12765:
-

[~ab]is this a bug?  If so, we could try to get you the patch/PR.

> Possibly incorrect format in JMX cache stats
> 
>
> Key: SOLR-12765
> URL: https://issues.apache.org/jira/browse/SOLR-12765
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.4
>Reporter: Bojan Smid
>Priority: Major
>
> I posted a question on ML 
> [https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201809.mbox/%3CCAGniRXR4Ps%3D03X0uiByCn5ecUT2VY4TLV4iNcxCde3dxBnmC-w%40mail.gmail.com%3E|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201809.mbox/%3CCAGniRXR4Ps%3D03X0uiByCn5ecUT2VY4TLV4iNcxCde3dxBnmC-w%40mail.gmail.com%3E),]
>  , but didn't get feedback. Since it looks like a possible bug, I am opening 
> a ticket.
>  
>   It seems the format of cache mbeans changed with 7.4.0.  And from what I 
> see similar change wasn't made for other mbeans, which may mean it was 
> accidental and may be a bug.
>  
>   In Solr 7.3.* format was (each attribute on its own, numeric type):
>  
> mbean:
> solr:dom1=core,dom2=gettingstarted,dom3=shard1,dom4=replica_n1,category=CACHE,scope=searcher,name=filterCache
>  
> attributes:
>   lookups java.lang.Long = 0
>   hits java.lang.Long = 0
>   cumulative_evictions java.lang.Long = 0
>   size java.lang.Long = 0
>   hitratio java.lang.Float = 0.0
>   evictions java.lang.Long = 0
>   cumulative_lookups java.lang.Long = 0
>   cumulative_hitratio java.lang.Float = 0.0
>   warmupTime java.lang.Long = 0
>   inserts java.lang.Long = 0
>   cumulative_inserts java.lang.Long = 0
>   cumulative_hits java.lang.Long = 0
>  
>   With 7.4.0 there is a single attribute "Value" (java.lang.Object):
>  
> mbean:
> solr:dom1=core,dom2=gettingstarted,dom3=shard1,dom4=replica_n1,category=CACHE,scope=searcher,name=filterCache
>  
> attributes:
>   Value java.lang.Object = \{lookups=0, evictions=0, 
> cumulative_inserts=0, cumulative_hits=0, hits=0, cumulative_evictions=0, 
> size=0, hitratio=0.0, cumulative_lookups=0, cumulative_hitratio=0.0, 
> warmupTime=0, inserts=0}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-8274) Add per-request MDC logging based on user-provided value.

2018-05-01 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-8274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460158#comment-16460158
 ] 

Otis Gospodnetic commented on SOLR-8274:


Perhaps a more modern way to approach this is to instrument Solr.  OpenTracing 
comes to mind.  See 
[https://sematext.com/blog/opentracing-distributed-tracing-emerging-industry-standard/]
 for a quick overview.  See also [https://github.com/opentracing-contrib] 

> Add per-request MDC logging based on user-provided value.
> -
>
> Key: SOLR-8274
> URL: https://issues.apache.org/jira/browse/SOLR-8274
> Project: Solr
>  Issue Type: Improvement
>  Components: logging
>Reporter: Jason Gerlowski
>Priority: Minor
> Attachments: SOLR-8274.patch
>
>
> *Problem 1* Currently, there's no way (AFAIK) to find all log messages 
> associated with a particular request.
> *Problem 2* There's also no easy way for multi-tenant Solr setups to find all 
> log messages associated with a particular customer/tenant.
> Both of these problems would be more manageable if Solr could be configured 
> to record an MDC tag based on a header, or some other user provided value.
> This would allow admins to group together logs about a single request.  If 
> the same header value is repeated multiple times this functionality could 
> also be used to group together arbitrary requests, such as those that come 
> from a particular user, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11779) Basic long-term collection of aggregated metrics

2018-03-19 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405324#comment-16405324
 ] 

Otis Gospodnetic commented on SOLR-11779:
-

IMHO don't do it.  Investing in APIs and building tools around Solr that 
consume Solr metrics, events, etc. is a much better investment than keeping 
things self-contained.  A platform and ecosystem it enables win over a tool 
that tries to do everything.

> Basic long-term collection of aggregated metrics
> 
>
> Key: SOLR-11779
> URL: https://issues.apache.org/jira/browse/SOLR-11779
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Affects Versions: 7.3, master (8.0)
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
>Priority: Major
>
> Tracking the key metrics over time is very helpful in understanding the 
> cluster and user behavior.
> Currently even basic metrics tracking requires setting up an external system 
> and either polling {{/admin/metrics}} or using {{SolrMetricReporter}}-s. The 
> advantage of this setup is that these external tools usually provide a lot of 
> sophisticated functionality. The downside is that they don't ship out of the 
> box with Solr and require additional admin effort to set up.
> Solr could collect some of the key metrics and keep their historical values 
> in a round-robin database (eg. using RRD4j) to keep the size of the historic 
> data constant (eg. ~64kB per metric), but at the same providing out of the 
> box useful insights into the basic system behavior over time. This data could 
> be persisted to the {{.system}} collection as blobs, and it could be also 
> presented in the Admin UI as graphs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Welcome Ahmet Arslan as Lucene/Solr committer

2017-12-17 Thread Otis Gospodnetic

Welcome Ahmet! :)

Otis
--
http://sematext.com

> On Dec 17, 2017, at 14:35, Joel Bernstein  wrote:
> 
> Welcome Ahmet!
> 
> Joel Bernstein
> http://joelsolr.blogspot.com/
> 
>> On Sun, Dec 17, 2017 at 11:15 AM, Erick Erickson  
>> wrote:
>> Welcome Ahmet!
>> 
>> 
>> On Sun, Dec 17, 2017 at 7:30 AM, David Smiley  
>> wrote:
>> > Welcome Ahmet!
>> >
>> > On Sun, Dec 17, 2017 at 9:28 AM Yonik Seeley  wrote:
>> >>
>> >> Congrats Ahmet!
>> >>
>> >> -Yonik
>> >>
>> >>
>> >> On Sun, Dec 17, 2017 at 5:15 AM, Adrien Grand  wrote:
>> >> > Hi all,
>> >> >
>> >> > Please join me in welcoming Ahmet Arslan as the latest Lucene/Solr
>> >> > committer.
>> >> > Ahmet, it's tradition for you to introduce yourself with a brief bio.
>> >> >
>> >> > Congratulations and Welcome!
>> >> >
>> >> > Adrien
>> >>
>> >> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>
>> > --
>> > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
>> > LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
>> > http://www.solrenterprisesearchserver.com
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>> 
>

Re: Congratulations to the new Lucene/Solr PMC Chair, Adrien Grand

2017-10-20 Thread Otis Gospodnetic

Bravo!

Otis
--
http://sematext.com

> On Oct 20, 2017, at 13:04, Varun Thacker  wrote:
> 
> Congratulations Adrien!
> 
>> On Fri, Oct 20, 2017 at 9:45 AM, Tomas Fernandez Lobbe  
>> wrote:
>> Congratulations Adrien!
>> 
>>> On Oct 19, 2017, at 10:51 AM, Martin Gainty  wrote:
>>> 
>>> Félicitations Adrien!
>>> 
>>> Martin 
>>> __ 
>>> 
>>> From: ansh...@apple.com  on behalf of Anshum Gupta 
>>> 
>>> Sent: Thursday, October 19, 2017 11:52 AM
>>> To: dev@lucene.apache.org
>>> Subject: Re: Congratulations to the new Lucene/Solr PMC Chair, Adrien Grand
>>>  
>>> Congratulations Adrien!
>>> 
>>> -Anshum
>>> 
>>> 
>>> 
 On Oct 19, 2017, at 12:19 AM, Tommaso Teofili  
 wrote:
 
 Once a year the Lucene PMC rotates the PMC chair and Apache Vice President 
 position.
 This year we have nominated and elected Adrien Grand as the chair and 
 today the board just approved it, so now it's official.
 
 Congratulations Adrien!
 Regards,
 Tommaso
>> 
>

Re: Welcome Hrishikesh Gadre as Lucene/Solr committer

2017-09-30 Thread Otis Gospodnetic

Welcome, Hrishikesh!

Otis
--
http://sematext.com

> On Sep 29, 2017, at 13:23, Yonik Seeley  wrote:
> 
> Hi All,
> 
> Please join me in welcoming Hrishikesh Gadre as the latest Lucene/Solr
> committer.
> Hrishikesh, it's tradition for you to introduce yourself with a brief bio.
> 
> Congrats and Welcome!
> -Yonik
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11323) Expose cache maxSize and autowarm settings in JMX

2017-09-20 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174234#comment-16174234
 ] 

Otis Gospodnetic commented on SOLR-11323:
-

[~ab] this is that 1-line change we briefly chatted about in Vegas.  It would 
be great if you could this in in the next Solr 7.x minor release. Thanks.

> Expose cache maxSize and autowarm settings in JMX
> -
>
> Key: SOLR-11323
> URL: https://issues.apache.org/jira/browse/SOLR-11323
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: JMX, metrics
>Affects Versions: 7.0, 7.1
>Reporter: Bojan Smid
>
> Before Solr 7.*, cache maxSize and autowarm settings were exposed in JMX 
> along with cache metrics. There was a textual attribute "description" which 
> could be parsed to extract maxSize and autowarm settings. This was very 
> useful for various monitoring tools since maxSize and autowarm could then be 
> displayed on monitoring charts (one could for example compare current size of 
> some cache to its maxSize without digging through configs to find this 
> setting).
> Ideally maxSize and autowarm count/% would be exposed as two separate 
> attributes, but having a single description field (which can be parsed) would 
> also be better than nothing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-11323) Expose cache maxSize and autowarm settings in JMX

2017-09-05 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-11323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-11323:

Component/s: metrics
 JMX

> Expose cache maxSize and autowarm settings in JMX
> -
>
> Key: SOLR-11323
> URL: https://issues.apache.org/jira/browse/SOLR-11323
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: JMX, metrics
>Affects Versions: 7.0, 7.1
>Reporter: Bojan Smid
>
> Before Solr 7.*, cache maxSize and autowarm settings were exposed in JMX 
> along with cache metrics. There was a textual attribute "description" which 
> could be parsed to extract maxSize and autowarm settings. This was very 
> useful for various monitoring tools since maxSize and autowarm could then be 
> displayed on monitoring charts (one could for example compare current size of 
> some cache to its maxSize without digging through configs to find this 
> setting).
> Ideally maxSize and autowarm count/% would be exposed as two separate 
> attributes, but having a single description field (which can be parsed) would 
> also be better than nothing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-10573) Hide ZooKeeper

2017-04-26 Thread Otis Gospodnetic (JIRA)

Otis Gospodnetic created SOLR-10573:
---

 Summary: Hide ZooKeeper
 Key: SOLR-10573
 URL: https://issues.apache.org/jira/browse/SOLR-10573
 Project: Solr
  Issue Type: New Feature
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Otis Gospodnetic


It may make sense to either embed ZK in Solr and allow running Solr instances 
with just ZK and no data or something else that hides ZK from Solr users...

Based on what the Solr poll that revealed lowish SolrCloud adoption and 
comments in 
http://search-lucene.com/m/Solr/eHNlm8wPIKJ3v51?subj=Poll+Master+Slave+or+SolrCloud
 that showed that people still find SolrCloud complex, at least partly because 
of the external ZK recommendation.

See also: 
http://search-lucene.com/m/Lucene/l6pAi11rBma0gNoI1?subj=SolrCloud+master+mode+planned+




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-7806) Explore delta of delta encoding

2017-04-25 Thread Otis Gospodnetic (JIRA)

Otis Gospodnetic created LUCENE-7806:


 Summary: Explore delta of delta encoding
 Key: LUCENE-7806
 URL: https://issues.apache.org/jira/browse/LUCENE-7806
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Otis Gospodnetic


>From 
>http://search-lucene.com/m/Lucene/l6pAi1YEfXhuGGIl1?subj=Re+Delta+of+delta+encoding

{quote}
delta of delta encoding is one of the Facebook Gorilla tricks that allows it to 
compress 16 bytes into 1.37 bytes on average -- see section 4.1 that describes 
it -- http://www.vldb.org/pvldb/vol8/p1816-teller.pdf

This seems to be aimed at both time fields and numerical values.

https://github.com/burmanm/gorilla-tsc is a recent Java impl
{quote}

CC [~jpountz]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10548) hyper-log-log based numBuckets for faceting

2017-04-22 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15980235#comment-15980235
 ] 

Otis Gospodnetic commented on SOLR-10548:
-

A new paper published in January introduced a new cardinality estimation 
algorithm called LogLog-Beta/β:

https://arxiv.org/abs/1612.02284

"The new algorithm uses only one formula and needs no additional bias
corrections for the entire range of cardinalities, therefore, it is more
efficient and simpler to implement. Our simulations show that the accuracy
provided by the new algorithm is as good as or better than the accuracy
provided by either of HyperLogLog or HyperLogLog++."
Some comments about its accuracy (graphs included) can be found in this PR: 
https://github.com/antirez/redis/pull/3677

> hyper-log-log based numBuckets for faceting
> ---
>
> Key: SOLR-10548
> URL: https://issues.apache.org/jira/browse/SOLR-10548
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Reporter: Yonik Seeley
>
> numBuckets currently uses an estimate (same as the unique function detailed 
> at http://yonik.com/solr-count-distinct/ ).  We should either change 
> implementations or introduce a way to optionally select a hyper-log-log based 
> approach for a better estimate with high field cardinalities.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-10418) metrics should return JVM system properties

2017-04-06 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-10418:

Component/s: metrics

> metrics should return JVM system properties
> ---
>
> Key: SOLR-10418
> URL: https://issues.apache.org/jira/browse/SOLR-10418
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Reporter: Noble Paul
>Assignee: Andrzej Bialecki 
>
> We need to stop using the custom solution used in rules and start using 
> metrics for everything



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10359) User Interactions Logger Component

2017-03-27 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944247#comment-15944247
 ] 

Otis Gospodnetic commented on SOLR-10359:
-

Solr *could* be used to process and store this data, but would it be better to 
think more about creating a "spec" for this sort of data and pluggable outputs, 
so that people can choose to push their data elsewhere, whether their own 
custom tooling or 3rd party services?

> User Interactions Logger Component
> --
>
> Key: SOLR-10359
> URL: https://issues.apache.org/jira/browse/SOLR-10359
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alessandro Benedetti
>  Labels: CTR, evaluation
>
> *Introduction*
> Being able to evaluate the quality of your search engine is becoming more and 
> more important day by day.
> This issue is to put a milestone to integrate online evaluation metrics with 
> Solr.
> *Scope*
> Scope of this issue is to provide a set of components able to :
> 1) Collect Search Results impressions ( results shown per query)
> 2) Collect Users interactions ( user interactions on the search results per 
> query e.g. clicks, bookmarking,ect )
> 3) Calculate evaluation metrics on demand, such as Click Through Rate, DCG ...
> *Technical Design*
> A SearchComponent can be designed :
> *UsersEventsLoggerComponent*
> A property (such as storeDir) will define where the data collected will be 
> stored.
> Different data structures can be explored, to keep it simple, a first 
> implementation can be a Lucene Index.
> *Data Model*
> The user event can be modelled in the following way :
>  - the user query the event is related to
>  - the ID of the search result involved in the interaction
>  - the position in the ranking of the search result involved 
> in the interaction
>  - time when the interaction happened
>  - 0 for impressions, a value between 1-5 to identify the 
> type of user event, the semantic will depend on the domain and use cases
>  - this can identify a variant, in A/B testing
> *Impressions Logging*
> When the SearchComponent  is assigned to a request handler, everytime it 
> processes a request and return to the user a result set for a query, the 
> component will collect the impressions ( results returned) and index them in 
> the auxiliary lucene index.
> This will happen in parallel as soon as you return the results to avoid 
> affecting the query time.
> Of course an impact on CPU load and memory is expected, will be interesting 
> to minimise it.
> * User Events Logging *
> An UpdateHandler will be exposed to accept POST requests and collect user 
> events.
> Everytime a request is sent, the user event will be indexed in the underline 
> auxiliary Lucene Index.
> * Stats Calculation *
> A RequestHandler will be exposed to be able to calculate stats and 
> aggregations for the metrics :
> /evaluation?metric=ctr=query=testA,testB
> This request could calculate the CTR for our testA and testB to compare.
> Showing stats in total and per query ( to highlight the queries with 
> lower/higher CTR).
> The calculations will happen separating the  for an easy 
> comparison.
> Will be important to keep it as simple as possible for a first version, to 
> then extend it as much as we like



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10247) Support non-numeric metrics and a "compact" format of /admin/metrics

2017-03-15 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926832#comment-15926832
 ] 

Otis Gospodnetic commented on SOLR-10247:
-

Short version - /cat provides table-like output - columns, with optional 
header, more or less verbose.  Handy for piping into sort and friends that 
humans like to use, but also handy for agents because its output is 
simpler/cheaper to parse than JSON.

> Support non-numeric metrics and a "compact" format of /admin/metrics
> 
>
> Key: SOLR-10247
> URL: https://issues.apache.org/jira/browse/SOLR-10247
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 6.5, master (7.0)
>
> Attachments: compactFormat.png, currentFormat.png, SOLR-10247.patch, 
> SOLR-10247.patch
>
>
> Metrics API currently supports only numeric values. However, it's useful also 
> to report non-numeric values such as eg. version, disk type, component state, 
> some system properties, etc.
> Codahale {{Gauge}} metric type can be used for this purpose, and 
> convenience methods can be added to {{SolrMetricManager}} to make it easier 
> to use.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10247) Support non-numeric metrics and a "compact" format of /admin/metrics

2017-03-15 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926791#comment-15926791
 ] 

Otis Gospodnetic commented on SOLR-10247:
-

bq. "compact" format of /admin/metrics
Something like /cat in ES or something different? /cat in ES is handy...

> Support non-numeric metrics and a "compact" format of /admin/metrics
> 
>
> Key: SOLR-10247
> URL: https://issues.apache.org/jira/browse/SOLR-10247
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 6.5, master (7.0)
>
> Attachments: compactFormat.png, currentFormat.png, SOLR-10247.patch, 
> SOLR-10247.patch
>
>
> Metrics API currently supports only numeric values. However, it's useful also 
> to report non-numeric values such as eg. version, disk type, component state, 
> some system properties, etc.
> Codahale {{Gauge}} metric type can be used for this purpose, and 
> convenience methods can be added to {{SolrMetricManager}} to make it easier 
> to use.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10262) Collect request latency metrics for histograms

2017-03-10 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905566#comment-15905566
 ] 

Otis Gospodnetic commented on SOLR-10262:
-

bq. This would have to be configured early on in solr.xml or even via system 
properties, which is a bit ugly.
Not sure what exactly you mean by this, but I don't think it should be the new 
default because of 
http://search-lucene.com/m/Lucene/l6pAi15LobI6m5Ny1?subj=Solr+JMX+changes+and+backwards+in+compatibility
 .  I am hoping it can be added to whatever is already there.  Then people and 
tools that monitor Solr can decide which data they want to collect.  The old 
stuff could be marked/announced as deprecated if we really don't want/need that 
data, and removed in one of the future releases.

> Collect request latency metrics for histograms
> --
>
> Key: SOLR-10262
> URL: https://issues.apache.org/jira/browse/SOLR-10262
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Reporter: Otis Gospodnetic
>Assignee: Andrzej Bialecki 
>
> Since [~ab] is on a role with metrics...
> There is no way to accurately compute request latency percentiles from 
> metrics exposed by Solr today. We should consider making that possible. c.f. 
> https://github.com/HdrHistogram/HdrHistogram



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-10262) Collect request latency metrics for histograms

2017-03-09 Thread Otis Gospodnetic (JIRA)

Otis Gospodnetic created SOLR-10262:
---

 Summary: Collect request latency metrics for histograms
 Key: SOLR-10262
 URL: https://issues.apache.org/jira/browse/SOLR-10262
 Project: Solr
  Issue Type: New Feature
  Security Level: Public (Default Security Level. Issues are Public)
  Components: metrics
Reporter: Otis Gospodnetic


Since [~ab] is on a role with metrics...
There is no way to accurately compute request latency percentiles from metrics 
exposed by Solr today. We should consider making that possible. c.f. 
https://github.com/HdrHistogram/HdrHistogram




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-10226) JMX metric avgTimePerRequest broken

2017-03-03 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-10226:

Description: 
JMX Metric avgTimePerRequest (of 
org.apache.solr.handler.component.SearchHandler) doesn't appear to behave 
correctly anymore. It was a cumulative value in pre-6.4 versions. Since 
totalTime metric was removed (which was a base for monitoring calculations), 
avgTimePerRequest seems like possible alternative to calculate "time spent in 
requests since last measurement", but it behaves strangely after 6.4.

I did a simple test on gettingstarted collection (just unpacked the Solr 6.4.1 
version and started it with "bin/solr start -e cloud -noprompt"). The query I 
used was:
http://localhost:8983/solr/gettingstarted/select?indent=on=*:*=json
I run it 30 times in a row (with approx 1 sec between executions).

At the same time I was looking (with jconsole) at bean 
solr/gettingstarted_shard2_replica2:type=/select,id=org.apache.solr.handler.component.SearchHandler

Here is how metric was changing over time (first number is "requests" metric, 
second number is "avgTimePerRequest"):
10   6.6033
12   5.9557
13   0.9015---> 13th req would need negative duration if this was cumulative
15   6.7315
16   7.4873
17   0.8458---> same case with 17th request
23   6.1076

At the same time bean 
solr/gettingstarted_shard1_replica2:type=/select,id=org.apache.solr.handler.component.SearchHandler
  also showed strange values:
65.13482
810.5694
90.504
10  0.344
12  8.8121
18  3.3531

CC [~ab]

  was:
JMX Metric avgTimePerRequest (of 
org.apache.solr.handler.component.SearchHandler) doesn't appear to behave 
correctly anymore. It was a cumulative value in pre-6.4 versions. Since 
totalTime metric was removed (which was a base for monitoring calculations), 
avgTimePerRequest seems like possible alternative to calculate "time spent in 
requests since last measurement", but it behaves strangely after 6.4.

I did a simple test on gettingstarted collection (just unpacked the Solr 6.4.1 
version and started it with "bin/solr start -e cloud -noprompt"). The query I 
used was:
http://localhost:8983/solr/gettingstarted/select?indent=on=*:*=json
I run it 30 times in a row (with approx 1 sec between executions).

At the same time I was looking (with jconsole) at bean 
solr/gettingstarted_shard2_replica2:type=/select,id=org.apache.solr.handler.component.SearchHandler

Here is how metric was changing over time (first number is "requests" metric, 
second number is "avgTimePerRequest"):
10   6.6033
12   5.9557
13   0.9015---> 13th req would need negative duration if this was cumulative
15   6.7315
16   7.4873
17   0.8458---> same case with 17th request
23   6.1076

At the same time bean 
solr/gettingstarted_shard1_replica2:type=/select,id=org.apache.solr.handler.component.SearchHandler
  also showed strange values:
65.13482
810.5694
90.504
10  0.344
12  8.8121
18  3.3531


> JMX metric avgTimePerRequest broken
> ---
>
> Key: SOLR-10226
> URL: https://issues.apache.org/jira/browse/SOLR-10226
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Affects Versions: 6.4.1
>Reporter: Bojan Smid
>
> JMX Metric avgTimePerRequest (of 
> org.apache.solr.handler.component.SearchHandler) doesn't appear to behave 
> correctly anymore. It was a cumulative value in pre-6.4 versions. Since 
> totalTime metric was removed (which was a base for monitoring calculations), 
> avgTimePerRequest seems like possible alternative to calculate "time spent in 
> requests since last measurement", but it behaves strangely after 6.4.
> I did a simple test on gettingstarted collection (just unpacked the Solr 
> 6.4.1 version and started it with "bin/solr start -e cloud -noprompt"). The 
> query I used was:
> http://localhost:8983/solr/gettingstarted/select?indent=on=*:*=json
> I run it 30 times in a row (with approx 1 sec between executions).
> At the same time I was looking (with jconsole) at bean 
> solr/gettingstarted_shard2_replica2:type=/select,id=org.apache.solr.handler.component.SearchHandler
> Here is how metric was changing over time (first number is "requests" metric, 
> second number is "avgTimePerRequest"):
> 10   6.6033
> 12   5.9557
> 13   0.9015---> 13th req would need negative duration if this was 
> cumulative
> 15   6.7315
> 16   7.4873
> 17   0.8458---> same case with 17th request
> 23   6.1076
> At the same time bean 
> solr/gettingstarted_shard1_replica

[jira] [Issue Comment Deleted] (SOLR-9898) Documentation for metrics collection and /admin/metrics

2017-03-02 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-9898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-9898:
---
Comment: was deleted

(was: While I love all the new metrics you are adding, I think metrics should 
be treated like code/features in terms of how backwards 
compatibility/deprecation is handled.  Otherwise, on upgrade, people's 
monitoring breaks and monitoring is kind of important... :)

Note: Looks like recent metrics changes broke/changed previously-existing 
MBeans...)

> Documentation for metrics collection and /admin/metrics
> ---
>
> Key: SOLR-9898
> URL: https://issues.apache.org/jira/browse/SOLR-9898
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Affects Versions: 6.4, master (7.0)
>Reporter: Andrzej Bialecki 
>Assignee: Cassandra Targett
>
> Draft documentation follows.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-9898) Documentation for metrics collection and /admin/metrics

2017-03-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892436#comment-15892436
 ] 

Otis Gospodnetic commented on SOLR-9898:


While I love all the new metrics you are adding, I think metrics should be 
treated like code/features in terms of how backwards compatibility/deprecation 
is handled.  Otherwise, on upgrade, people's monitoring breaks and 
monitoring is kind of important... :)

Note: Looks like recent metrics changes broke/changed previously-existing 
MBeans...

> Documentation for metrics collection and /admin/metrics
> ---
>
> Key: SOLR-9898
> URL: https://issues.apache.org/jira/browse/SOLR-9898
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Affects Versions: 6.4, master (7.0)
>Reporter: Andrzej Bialecki 
>Assignee: Cassandra Targett
>
> Draft documentation follows.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-10214) clean up BlockCache metrics

2017-02-28 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-10214:

Component/s: metrics

> clean up BlockCache metrics
> ---
>
> Key: SOLR-10214
> URL: https://issues.apache.org/jira/browse/SOLR-10214
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Reporter: Yonik Seeley
> Attachments: SOLR-10214.patch
>
>
> Many (most) of the block cache metrics are unused (I assume just inherited 
> from Blur) and unmaintained (i.e. most will be 0).  Currently only the size 
> and number of evictions is tracked.
> We should remove unused stats and start tracking
> - number of lookups (or number of misses)
> - number of hits
> - number of inserts
> - number of store failures



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-10029) Fix search link on http://lucene.apache.org/

2017-02-07 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-10029:

Summary: Fix search link on http://lucene.apache.org/  (was: Fix Search 
link in http://lucene.apache.org/ & http://lucene.apache.org/)

> Fix search link on http://lucene.apache.org/
> 
>
> Key: SOLR-10029
> URL: https://issues.apache.org/jira/browse/SOLR-10029
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: website
>Reporter: Tien Nguyen Manh
> Attachments: SOLR-10029.patch
>
>
> The current link to http://search-lucene.com is
> http://search-lucene.com/lucene?q=apache=sl
> http://search-lucene.com/solr?q=apache
> The project names  should be upcase
> http://search-lucene.com/Lucene?q=apache=sl
> http://search-lucene.com/Solr?q=apache



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10091) Support for CDCR using an external queueing service

2017-02-03 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15852310#comment-15852310
 ] 

Otis Gospodnetic commented on SOLR-10091:
-

Would this create a dependency on (specific version of) Kafka?  You may want to 
run that by dev@

> Support for CDCR using an external queueing service
> ---
>
> Key: SOLR-10091
> URL: https://issues.apache.org/jira/browse/SOLR-10091
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: CDCR
>Affects Versions: 6.x
>Reporter: Oliver Bates
>Priority: Minor
>  Labels: features
>
> The idea is to contribute part of the work presented here:
> https://www.youtube.com/watch?v=83vbY9f3nXA
> Specifically these components:
> - update processor that writes updates to an external queueing service 
> (abstracted by an interface)
> - a Kafka implementation of this interface (that goes into /contrib?) so 
> anyone using kafka can use this "out of the box"
> - a consumer application
> For the consumer application, the idea is an app that's queue-agnostic and 
> then the queue-specific consumer bit is loaded at runtime. In this case, 
> there's a "default" kafka consumer in there as well.
> I'm not exactly sure what the best structure would be for these pieces (the 
> kafka implementations and the general consumer app code), so I'll simply post 
> class definitions here and let the community decide where they should go.
> The core work is finished. I just need to clean it up a bit and convert the 
> tests to fit this repo (right now they're using an external framework).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (SOLR-4500) How can we integrate LDAP authentiocation with the Solr instance

2017-01-26 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-4500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved SOLR-4500.

Resolution: Invalid

Questions => mailing list

> How can we integrate LDAP authentiocation with the Solr instance
> 
>
> Key: SOLR-4500
> URL: https://issues.apache.org/jira/browse/SOLR-4500
> Project: Solr
>  Issue Type: Task
>Affects Versions: 4.1
>Reporter: Srividhya
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-9880) Add Ganglia and Graphite metrics reporters

2016-12-21 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768143#comment-15768143
 ] 

Otis Gospodnetic commented on SOLR-9880:


May I recommend a different approach?  With the current approach you'll always 
have somebody come in and ask for additional reporter and typically that won't 
be very high on Solr devs' interest, plus it will require more work, additional 
dependencies, etc.  Moreover, if you do this then you have to think about 
things like destination not being available, about possible on-disk buffering 
so data is not lost, about ensuring the buffered data is purged if there is too 
much of it, and so on.  Solr doesn't want to be in data shipper business.  My 
suggestion, based on working with monitoring and logging for the last N years - 
just log metrics to a file.  There are a number of modern tools that know how 
to tail files, parse their content, ship it somewhere, have buffering, have 
multiple outputs, and so on.  Just make sure data is nicely structured to make 
parsing easy, and done in a way that when you add more metrics you can do it in 
a backwards-compatible way.

> Add Ganglia and Graphite metrics reporters
> --
>
> Key: SOLR-9880
> URL: https://issues.apache.org/jira/browse/SOLR-9880
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
>Priority: Minor
> Fix For: master (7.0), 6.4
>
>
> Originally SOLR-4735 provided implementations for these reporters (wrappers 
> for Dropwizard components to use {{SolrMetricReporter}} API).
> However, this functionality has been split into its own issue due to the 
> additional transitive dependencies that these reporters bring:
> * Ganglia:
> ** metrics-ganglia, ASL, 3kB
> ** gmetric4j (Ganglia RPC implementation), BSD, 29kB
> * Graphite
> ** metrics-graphite, ASL, 10kB
> ** amqp-client (RabbitMQ Java client, marked optional in pom?), ASL/MIT/GPL2, 
> 190kB
> IMHO these are not very large dependencies, and given the useful 
> functionality they provide it's worth adding them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-9805) Use metrics-jvm library to instrument jvm internals

2016-12-17 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-9805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-9805:
---
Component/s: metrics

> Use metrics-jvm library to instrument jvm internals
> ---
>
> Key: SOLR-9805
> URL: https://issues.apache.org/jira/browse/SOLR-9805
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
> Fix For: master (7.0), 6.4
>
> Attachments: SOLR-9805.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> See http://metrics.dropwizard.io/3.1.0/manual/jvm/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-8785) Use Metrics library for core metrics

2016-12-17 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-8785:
---
Component/s: metrics

> Use Metrics library for core metrics
> 
>
> Key: SOLR-8785
> URL: https://issues.apache.org/jira/browse/SOLR-8785
> Project: Solr
>  Issue Type: Improvement
>  Components: metrics
>Affects Versions: 4.1
>Reporter: Jeff Wartes
>Assignee: Shalin Shekhar Mangar
>  Labels: patch, patch-available
> Fix For: master (7.0), 6.4
>
> Attachments: SOLR-8785-increment.patch, SOLR-8785.patch, 
> SOLR-8785.patch, SOLR-8785.patch, SOLR_8775_rates_per_minute_fix.patch
>
>
> The Metrics library (https://dropwizard.github.io/metrics/3.1.0/) is a 
> well-known way to track metrics about applications. 
> In SOLR-1972, latency percentile tracking was added. The comment list is 
> long, so here’s my synopsis:
> 1. An attempt was made to use the Metrics library
> 2. That attempt failed due to a memory leak in Metrics v2.1.1
> 3. Large parts of Metrics were then copied wholesale into the 
> org.apache.solr.util.stats package space and that was used instead.
> Copy/pasting Metrics code into Solr may have been the correct solution at the 
> time, but I submit that it isn’t correct any more. 
> The leak in Metrics was fixed even before SOLR-1972 was released, and by 
> copy/pasting a subset of the functionality, we miss access to other important 
> things that the Metrics library provides, particularly the concept of a 
> Reporter. (https://dropwizard.github.io/metrics/3.1.0/manual/core/#reporters)
> Further, Metrics v3.0.2 is already packaged with Solr anyway, because it’s 
> used in two contrib modules. (map-reduce and morphines-core)
> I’m proposing that:
> 1. Metrics as bundled with Solr be upgraded to the current v3.1.2
> 2. Most of the org.apache.solr.util.stats package space be deleted outright, 
> or gutted and replaced with simple calls to Metrics. Due to the copy/paste 
> origin, the concepts should mostly map 1:1.
> I’d further recommend a usage pattern like:
> SharedMetricRegistries.getOrCreate(System.getProperty(“solr.metrics.registry”,
>  “solr-registry”))
> There are all kinds of areas in Solr that could benefit from metrics tracking 
> and reporting. This pattern allows diverse areas of code to track metrics 
> within a single, named registry. This well-known-name then becomes a handle 
> you can use to easily attach a Reporter and ship all of those metrics off-box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-9812) Implement a /admin/metrics API

2016-12-17 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-9812:
---
Component/s: metrics

> Implement a /admin/metrics API
> --
>
> Key: SOLR-9812
> URL: https://issues.apache.org/jira/browse/SOLR-9812
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
> Fix For: master (7.0), 6.4
>
> Attachments: SOLR-9812.patch, SOLR-9812.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We added a bare bones metrics API in SOLR-9788 but due to limitations with 
> the metrics servlet supplied by the metrics library, it can show statistics 
> from only one metric registry. SOLR-4735 has added a hierarchy of metric 
> registries and the /admin/metrics API should support showing all of them as 
> well as be able to filter metrics from a given registry name.
> In this issue we will implement the improved /admin/metrics API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-9788) Use instrumented jetty classes

2016-12-17 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-9788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-9788:
---
Component/s: metrics

> Use instrumented jetty classes
> --
>
> Key: SOLR-9788
> URL: https://issues.apache.org/jira/browse/SOLR-9788
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
> Fix For: master (7.0), 6.4
>
> Attachments: SOLR_9788.patch, SOLR_9788.patch, SOLR_9788.patch
>
>
> Dropwizard metrics library integrated in SOLR-8785 provides a set of 
> instrumented equivalents of Jetty classes. This allows us to collect 
> statistics on  Jetty's connector, thread pool and handlers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-9599) DocValues performance regression with new iterator API

2016-12-15 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752588#comment-15752588
 ] 

Otis Gospodnetic commented on SOLR-9599:


[~ysee...@gmail.com] all sub-tasks seem to be done/resolved should this 
then be resolved, too?

> DocValues performance regression with new iterator API
> --
>
> Key: SOLR-9599
> URL: https://issues.apache.org/jira/browse/SOLR-9599
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: master (7.0)
>Reporter: Yonik Seeley
> Fix For: master (7.0)
>
>
> I did a quick performance comparison of faceting indexed fields (i.e. 
> docvalues are not stored) using method=dv before and after the new docvalues 
> iterator went in (LUCENE-7407).
> 5M document index, 21 segments, single valued string fields w/ no missing 
> values.
> || field cardinality || new_time / old_time ||
> |10|2.01|
> |1000|2.02|
> |1|1.85|
> |10|1.56|
> |100|1.31|
> So unfortunately, often twice as slow.
> See followup messages for tests using real docvalues as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-7253) Make sparse doc values and segments merging more efficient

2016-12-15 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-7253.
--
Resolution: Duplicate
  Assignee: Michael McCandless

LUCENE-7457 and many others actually took care of the issue reported here.

> Make sparse doc values and segments merging more efficient 
> ---
>
> Key: LUCENE-7253
> URL: https://issues.apache.org/jira/browse/LUCENE-7253
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 5.5, 6.0
>Reporter: Pawel Rog
>Assignee: Michael McCandless
>  Labels: performance
> Fix For: master (7.0)
>
>
> Doc Values were optimized recently to efficiently store sparse data. 
> Unfortunately there is still big problem with Doc Values merges for sparse 
> fields. When we imagine 1 billion documents index it seems it doesn't matter 
> if all documents have value for this field or there is only 1 document with 
> value. Segment merge time is the same for both cases. In most cases this is 
> not a problem but there are several cases in which one can expect having many 
> fields with sparse doc values.
> I can describe an example. During performance tests of a system with large 
> number of sparse fields I realized that Doc Values merges are a bottleneck. I 
> had hundreds of different numeric fields. Each document contained only small 
> subset of all fields. Average document contains 5-7 different numeric values. 
> As you can see data was very sparse in these fields. It turned out that 
> ingestion process was CPU-bound. Most of CPU time was spent in DocValues 
> related methods (SingletonSortedNumericDocValues#setDocument, 
> DocValuesConsumer$10$1#next, DocValuesConsumer#isSingleValued, 
> DocValuesConsumer$4$1#setNext, ...) - mostly during merging segments.
> Adrien Grand suggested to reduce the number of sparse fields and replace them 
> with smaller number of denser fields. This helped a lot but complicated 
> fields naming. 
> I am not much familiar with Doc Values source code but I have small 
> suggestion how to improve Doc Values merges for sparse fields. I realized 
> that Doc Values producers and consumers use Iterators. Let's take an example 
> of numeric Doc Values. Would it be possible to replace Iterator which 
> "travels" through all documents with Iterator over collection of non empty 
> values? Of course this would require storing object (instead of numeric) 
> which contains value and document ID. Such an iterator could significantly 
> improve merge time of sparse Doc Values fields. IMHO this won't cause big 
> overhead for dense structures but it can be game changer for sparse 
> structures.
> This is what happens in NumericDocValuesWriter on flush
> {code}
> dvConsumer.addNumericField(fieldInfo,
>new Iterable() {
>  @Override
>  public Iterator iterator() {
>return new NumericIterator(maxDoc, values, 
> docsWithField);
>  }
>});
> {code}
> Before this happens during addValue, this loop is executed to fill holes.
> {code}
> // Fill in any holes:
> for (int i = (int)pending.size(); i < docID; ++i) {
>   pending.add(MISSING);
> }
> {code}
> It turns out that variable called pending is used only internally in 
> NumericDocValuesWriter. I know pending is PackedLongValues and it wouldn't be 
> good to change it with different class (some kind of list) because this may 
> break DV performance for dense fields. I hope someone can suggest interesting 
> solutions for this problem :).
> It would be great if discussion about sparse Doc Values merge performance can 
> start here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4587) Implement Saved Searches a la ElasticSearch Percolator

2016-12-15 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752039#comment-15752039
 ] 

Otis Gospodnetic commented on SOLR-4587:


http://search-lucene.com/m/Solr/eHNlnz4JxwIMSo1?subj=Deep+dive+on+the+topic+streaming+expression
 for anyone who wants to follow.

> Implement Saved Searches a la ElasticSearch Percolator
> --
>
> Key: SOLR-4587
> URL: https://issues.apache.org/jira/browse/SOLR-4587
> Project: Solr
>  Issue Type: New Feature
>  Components: SearchComponents - other, SolrCloud
>    Reporter: Otis Gospodnetic
> Fix For: 6.0
>
>
> Use Lucene MemoryIndex for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-11-21 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15685728#comment-15685728
 ] 

Otis Gospodnetic commented on LUCENE-6966:
--

Uh, silence. :( I have not looked into the implementation and have only skimmed 
comments here in the past.  My general feeling though is that until/unless this 
gets committed most people won't bother looking (I think we saw similar 
behaviour with Solr CDCR which was WIP in JIRA for a while and was labeled as 
such for a long time but now that it's in I hear more and more people using 
it http://search-lucene.com/?q=cdcr ) and once it's in it may get worked on 
by more interested parties.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
> Attachments: Encryption Codec Documentation.pdf, LUCENE-6966-1.patch, 
> LUCENE-6966-2-docvalues.patch, LUCENE-6966-2.patch
>
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community

[jira] [Commented] (LUCENE-7407) Explore switching doc values to an iterator API

2016-11-15 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669305#comment-15669305
 ] 

Otis Gospodnetic commented on LUCENE-7407:
--

I had a quick look at [~yo...@apache.org]'s SOLR-9599 and then at [~jpountz]'s 
patch in LUCENE-7462 that makes the search-time work less expensive.  Last 
comment from Yonik reporting faceting regression in Solr was from October 18.  
Adrien't patch was committed on October 24.  Maybe things are working better 
for Solr now?

If not, in interest of moving forward, what do people think about Yonik's 
suggestion:
bq. Perhaps we should have both a random access API as well as an iterator API?
?

> Explore switching doc values to an iterator API
> ---
>
> Key: LUCENE-7407
> URL: https://issues.apache.org/jira/browse/LUCENE-7407
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>  Labels: docValues
> Fix For: master (7.0)
>
> Attachments: LUCENE-7407.patch
>
>
> I think it could be compelling if we restricted doc values to use an
> iterator API at read time, instead of the more general random access
> API we have today:
>   * It would make doc values disk usage more of a "you pay for what
> what you actually use", like postings, which is a compelling
> reduction for sparse usage.
>   * I think codecs could compress better and maybe speed up decoding
> of doc values, even in the non-sparse case, since the read-time
> API is more restrictive "forward only" instead of random access.
>   * We could remove {{getDocsWithField}} entirely, since that's
> implicit in the iteration, and the awkward "return 0 if the
> document didn't have this field" would go away.
>   * We can remove the annoying thread locals we must make today in
> {{CodecReader}}, and close the trappy "I accidentally shared a
> single XXXDocValues instance across threads", since an iterator is
> inherently "use once".
>   * We could maybe leverage the numerous optimizations we've done for
> postings over time, since the two problems ("iterate over doc ids
> and store something interesting for each") are very similar.
> This idea has come up many in the past, e.g. LUCENE-7253 is a recent
> example, and very early iterations of doc values started with exactly
> this ;)
> However, it's a truly enormous change, likely 7.0 only.  Or maybe we
> could have the new iterator APIs also ported to 6.x side by side with
> the deprecate existing random-access APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7474) Improve doc values writers

2016-10-10 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564060#comment-15564060
 ] 

Otis Gospodnetic commented on LUCENE-7474:
--

I was wondering how one could compare Lucene indexing (and searching) 
performance before and after this change.  Is there a way to add a sparse 
dataset for the nightly benchmark and use it for both trunk and 6.x branch, so 
one can see the performance difference of Lucene 6.x with sparse data vs. 
Lucene 7.x with sparse data?

> Improve doc values writers
> --
>
> Key: LUCENE-7474
> URL: https://issues.apache.org/jira/browse/LUCENE-7474
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: master (7.0)
>
> Attachments: LUCENE-7474.patch
>
>
> One of the goals of the new iterator-based API is to better handle sparse 
> data. However, the current doc values writers still use a dense 
> representation, and some of them perform naive linear scans in the nextDoc 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7474) Improve doc values writers

2016-10-05 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548058#comment-15548058
 ] 

Otis Gospodnetic commented on LUCENE-7474:
--

yhooo! :)
Do the nightly builds have any tests that will exercise these new writers, the 
new 7.0 Codec, etc., so one can see how much speed this change gains?

> Improve doc values writers
> --
>
> Key: LUCENE-7474
> URL: https://issues.apache.org/jira/browse/LUCENE-7474
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: master (7.0)
>
> Attachments: LUCENE-7474.patch
>
>
> One of the goals of the new iterator-based API is to better handle sparse 
> data. However, the current doc values writers still use a dense 
> representation, and some of them perform naive linear scans in the nextDoc 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2883) Add QParser boolean hint for filter queries

2016-09-23 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15516375#comment-15516375
 ] 

Otis Gospodnetic commented on SOLR-2883:


We hit this in Solr-Redis: 
https://github.com/sematext/solr-redis/issues/38#issuecomment-249184074

> Add QParser boolean hint for filter queries
> ---
>
> Key: SOLR-2883
> URL: https://issues.apache.org/jira/browse/SOLR-2883
> Project: Solr
>  Issue Type: Improvement
>  Components: search
>Reporter: David Smiley
>
> It would be useful if there was a QParser hint of some kind that indicated 
> that the score isn't needed. This would be set by Solr in QueryComponent when 
> processing the fq param, and some field types could check for this and return 
> more efficient Query implementations from FieldType.getFieldQuery(). For 
> example, a geospatial field could return a ConstantScoreQuery(Filter) 
> implementation when only filtering is needed, or return a query that returns 
> a geospatial distance for a document's score. I think there are probably 
> other opportunities for this flag to have its use but I'm not sure.
> As an example solution, a local param of needScore=false could be added.  I 
> should be functionally equivalent to fq={!needScore=false}.
> Here is a modified portion of QueryComponent at line 135 to illustrate what 
> the change would be. I haven't tested it but it compiles.
> {code:java}
> for (String fq : fqs) {
>   if (fq != null && fq.trim().length()!=0) {
> QParser fqp = QParser.getParser(fq, null, req);
> SolrParams localParams = fqp.getLocalParams();
> SolrParams defaultLocalParams = new 
> MapSolrParams(Collections.singletonMap("needScore","false"));
> SolrParams newLocalParams = new 
> DefaultSolrParams(localParams,defaultLocalParams);
> fqp.setLocalParams(newLocalParams);
> filters.add(fqp.getQuery());
>   }
> }
> {code}
> It would probably be best to define the "needScore" constant somewhere better 
> but this is it in a nutshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7253) Make sparse doc values and segments merging more efficient

2016-09-21 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-7253:
-
Fix Version/s: master (7.0)

> Make sparse doc values and segments merging more efficient 
> ---
>
> Key: LUCENE-7253
> URL: https://issues.apache.org/jira/browse/LUCENE-7253
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 5.5, 6.0
>Reporter: Pawel Rog
>  Labels: performance
> Fix For: master (7.0)
>
>
> Doc Values were optimized recently to efficiently store sparse data. 
> Unfortunately there is still big problem with Doc Values merges for sparse 
> fields. When we imagine 1 billion documents index it seems it doesn't matter 
> if all documents have value for this field or there is only 1 document with 
> value. Segment merge time is the same for both cases. In most cases this is 
> not a problem but there are several cases in which one can expect having many 
> fields with sparse doc values.
> I can describe an example. During performance tests of a system with large 
> number of sparse fields I realized that Doc Values merges are a bottleneck. I 
> had hundreds of different numeric fields. Each document contained only small 
> subset of all fields. Average document contains 5-7 different numeric values. 
> As you can see data was very sparse in these fields. It turned out that 
> ingestion process was CPU-bound. Most of CPU time was spent in DocValues 
> related methods (SingletonSortedNumericDocValues#setDocument, 
> DocValuesConsumer$10$1#next, DocValuesConsumer#isSingleValued, 
> DocValuesConsumer$4$1#setNext, ...) - mostly during merging segments.
> Adrien Grand suggested to reduce the number of sparse fields and replace them 
> with smaller number of denser fields. This helped a lot but complicated 
> fields naming. 
> I am not much familiar with Doc Values source code but I have small 
> suggestion how to improve Doc Values merges for sparse fields. I realized 
> that Doc Values producers and consumers use Iterators. Let's take an example 
> of numeric Doc Values. Would it be possible to replace Iterator which 
> "travels" through all documents with Iterator over collection of non empty 
> values? Of course this would require storing object (instead of numeric) 
> which contains value and document ID. Such an iterator could significantly 
> improve merge time of sparse Doc Values fields. IMHO this won't cause big 
> overhead for dense structures but it can be game changer for sparse 
> structures.
> This is what happens in NumericDocValuesWriter on flush
> {code}
> dvConsumer.addNumericField(fieldInfo,
>new Iterable() {
>  @Override
>  public Iterator iterator() {
>return new NumericIterator(maxDoc, values, 
> docsWithField);
>  }
>});
> {code}
> Before this happens during addValue, this loop is executed to fill holes.
> {code}
> // Fill in any holes:
> for (int i = (int)pending.size(); i < docID; ++i) {
>   pending.add(MISSING);
> }
> {code}
> It turns out that variable called pending is used only internally in 
> NumericDocValuesWriter. I know pending is PackedLongValues and it wouldn't be 
> good to change it with different class (some kind of list) because this may 
> break DV performance for dense fields. I hope someone can suggest interesting 
> solutions for this problem :).
> It would be great if discussion about sparse Doc Values merge performance can 
> start here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7407) Explore switching doc values to an iterator API

2016-09-20 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507892#comment-15507892
 ] 

Otis Gospodnetic commented on LUCENE-7407:
--

bq. there is a lot of fun improvements we can make here, in follow-on issues, 
so that e.g. LUCENE-7253 (merging of sparse doc values fields) is fixed.

So LUCENE-7253 is where the new Codec work for trunk will go?
Did you maybe create the other issues you mentioned?  Asking because I'm 
curious what you have in mind and so I can link+watch.

> Explore switching doc values to an iterator API
> ---
>
> Key: LUCENE-7407
> URL: https://issues.apache.org/jira/browse/LUCENE-7407
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>  Labels: docValues
> Attachments: LUCENE-7407.patch
>
>
> I think it could be compelling if we restricted doc values to use an
> iterator API at read time, instead of the more general random access
> API we have today:
>   * It would make doc values disk usage more of a "you pay for what
> what you actually use", like postings, which is a compelling
> reduction for sparse usage.
>   * I think codecs could compress better and maybe speed up decoding
> of doc values, even in the non-sparse case, since the read-time
> API is more restrictive "forward only" instead of random access.
>   * We could remove {{getDocsWithField}} entirely, since that's
> implicit in the iteration, and the awkward "return 0 if the
> document didn't have this field" would go away.
>   * We can remove the annoying thread locals we must make today in
> {{CodecReader}}, and close the trappy "I accidentally shared a
> single XXXDocValues instance across threads", since an iterator is
> inherently "use once".
>   * We could maybe leverage the numerous optimizations we've done for
> postings over time, since the two problems ("iterate over doc ids
> and store something interesting for each") are very similar.
> This idea has come up many in the past, e.g. LUCENE-7253 is a recent
> example, and very early iterations of doc values started with exactly
> this ;)
> However, it's a truly enormous change, likely 7.0 only.  Or maybe we
> could have the new iterator APIs also ported to 6.x side by side with
> the deprecate existing random-access APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7407) Explore switching doc values to an iterator API

2016-08-31 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454278#comment-15454278
 ] 

Otis Gospodnetic commented on LUCENE-7407:
--

Once these changes are made do you think one will be able to just replace the 
Lucene jar in e.g. ES 5.x?

> Explore switching doc values to an iterator API
> ---
>
> Key: LUCENE-7407
> URL: https://issues.apache.org/jira/browse/LUCENE-7407
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>  Labels: docValues
>
> I think it could be compelling if we restricted doc values to use an
> iterator API at read time, instead of the more general random access
> API we have today:
>   * It would make doc values disk usage more of a "you pay for what
> what you actually use", like postings, which is a compelling
> reduction for sparse usage.
>   * I think codecs could compress better and maybe speed up decoding
> of doc values, even in the non-sparse case, since the read-time
> API is more restrictive "forward only" instead of random access.
>   * We could remove {{getDocsWithField}} entirely, since that's
> implicit in the iteration, and the awkward "return 0 if the
> document didn't have this field" would go away.
>   * We can remove the annoying thread locals we must make today in
> {{CodecReader}}, and close the trappy "I accidentally shared a
> single XXXDocValues instance across threads", since an iterator is
> inherently "use once".
>   * We could maybe leverage the numerous optimizations we've done for
> postings over time, since the two problems ("iterate over doc ids
> and store something interesting for each") are very similar.
> This idea has come up many in the past, e.g. LUCENE-7253 is a recent
> example, and very early iterations of doc values started with exactly
> this ;)
> However, it's a truly enormous change, likely 7.0 only.  Or maybe we
> could have the new iterator APIs also ported to 6.x side by side with
> the deprecate existing random-access APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7407) Explore switching doc values to an iterator API

2016-08-05 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410479#comment-15410479
 ] 

Otis Gospodnetic commented on LUCENE-7407:
--

Can I label this with #AWESOME!!! ? Could Adrien's LUCENE-6928 piggyback on 
this?

> Explore switching doc values to an iterator API
> ---
>
> Key: LUCENE-7407
> URL: https://issues.apache.org/jira/browse/LUCENE-7407
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>  Labels: docValues
>
> I think it could be compelling if we restricted doc values to use an
> iterator API at read time, instead of the more general random access
> API we have today:
>   * It would make doc values disk usage more of a "you pay for what
> what you actually use", like postings, which is a compelling
> reduction for sparse usage.
>   * I think codecs could compress better and maybe speed up decoding
> of doc values, even in the non-sparse case, since the read-time
> API is more restrictive "forward only" instead of random access.
>   * We could remove {{getDocsWithField}} entirely, since that's
> implicit in the iteration, and the awkward "return 0 if the
> document didn't have this field" would go away.
>   * We can remove the annoying thread locals we must make today in
> {{CodecReader}}, and close the trappy "I accidentally shared a
> single XXXDocValues instance across threads", since an iterator is
> inherently "use once".
>   * We could maybe leverage the numerous optimizations we've done for
> postings over time, since the two problems ("iterate over doc ids
> and store something interesting for each") are very similar.
> This idea has come up many in the past, e.g. LUCENE-7253 is a recent
> example, and very early iterations of doc values started with exactly
> this ;)
> However, it's a truly enormous change, likely 7.0 only.  Or maybe we
> could have the new iterator APIs also ported to 6.x side by side with
> the deprecate existing random-access APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7407) Explore switching doc values to an iterator API

2016-08-05 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-7407:
-
Labels: docValues  (was: )

> Explore switching doc values to an iterator API
> ---
>
> Key: LUCENE-7407
> URL: https://issues.apache.org/jira/browse/LUCENE-7407
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>  Labels: docValues
>
> I think it could be compelling if we restricted doc values to use an
> iterator API at read time, instead of the more general random access
> API we have today:
>   * It would make doc values disk usage more of a "you pay for what
> what you actually use", like postings, which is a compelling
> reduction for sparse usage.
>   * I think codecs could compress better and maybe speed up decoding
> of doc values, even in the non-sparse case, since the read-time
> API is more restrictive "forward only" instead of random access.
>   * We could remove {{getDocsWithField}} entirely, since that's
> implicit in the iteration, and the awkward "return 0 if the
> document didn't have this field" would go away.
>   * We can remove the annoying thread locals we must make today in
> {{CodecReader}}, and close the trappy "I accidentally shared a
> single XXXDocValues instance across threads", since an iterator is
> inherently "use once".
>   * We could maybe leverage the numerous optimizations we've done for
> postings over time, since the two problems ("iterate over doc ids
> and store something interesting for each") are very similar.
> This idea has come up many in the past, e.g. LUCENE-7253 is a recent
> example, and very early iterations of doc values started with exactly
> this ;)
> However, it's a truly enormous change, likely 7.0 only.  Or maybe we
> could have the new iterator APIs also ported to 6.x side by side with
> the deprecate existing random-access APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7452) json facet api returning inconsistent counts in cloud set up

2016-07-13 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375418#comment-15375418
 ] 

Otis Gospodnetic commented on SOLR-7452:


[~yo...@apache.org] We've gotten inquiries about this bug/patch/fix at 
Sematext, but if you're working on this then maybe it's better for us not to 
meddle, so like a few others above, I'm curious about the status of this.

> json facet api returning inconsistent counts in cloud set up
> 
>
> Key: SOLR-7452
> URL: https://issues.apache.org/jira/browse/SOLR-7452
> Project: Solr
>  Issue Type: Bug
>  Components: Facet Module
>Affects Versions: 5.1
>Reporter: Vamsi Krishna D
>  Labels: count, facet, sort
> Fix For: 5.2
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> While using the newly added feature of json term facet api 
> (http://yonik.com/json-facet-api/#TermsFacet) I am encountering inconsistent 
> returns of counts of faceted value ( Note I am running on a cloud mode of 
> solr). For example consider that i have txns_id(unique field or key), 
> consumer_number and amount. Now for a 10 million such records , lets say i 
> query for 
> q=*:*=0&
>  json.facet={
>biskatoo:{
>type : terms,
>field : consumer_number,
>limit : 20,
>   sort : {y:desc},
>   numBuckets : true,
>   facet:{
>y : "sum(amount)"
>}
>}
>  }
> the results are as follows ( some are omitted ):
> "facets":{
> "count":6641277,
> "biskatoo":{
>   "numBuckets":3112708,
>   "buckets":[{
>   "val":"surya",
>   "count":4,
>   "y":2.264506},
>   {
>   "val":"raghu",
>   "COUNT":3,   // capitalised for recognition 
>   "y":1.8},
> {
>   "val":"malli",
>   "count":4,
>   "y":1.78}]}}}
> but if i restrict the query to 
> q=consumer_number:raghu=0&
>  json.facet={
>biskatoo:{
>type : terms,
>field : consumer_number,
>limit : 20,
>   sort : {y:desc},
>   numBuckets : true,
>   facet:{
>y : "sum(amount)"
>}
>}
>  }
> i get :
>   "facets":{
> "count":4,
> "biskatoo":{
>   "numBuckets":1,
>   "buckets":[{
>   "val":"raghu",
>   "COUNT":4,
>   "y":2429708.24}]}}}
> One can see the count results are inconsistent ( and I found many occasions 
> of inconsistencies).
> I have tried the patch https://issues.apache.org/jira/browse/SOLR-7412 but 
> still the issue seems not resolved



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7341) xjoin - join data from external sources

2016-07-05 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15362942#comment-15362942
 ] 

Otis Gospodnetic commented on SOLR-7341:


[~adamgamble] - vote for it :)

> xjoin - join data from external sources
> ---
>
> Key: SOLR-7341
> URL: https://issues.apache.org/jira/browse/SOLR-7341
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Reporter: Tom Winch
>Priority: Minor
> Fix For: 4.10.3, 5.3.2, 6.0
>
> Attachments: SOLR-7341.patch-4.10.3, SOLR-7341.patch-4_10, 
> SOLR-7341.patch-5.3.2, SOLR-7341.patch-5_3, SOLR-7341.patch-master, 
> SOLR-7341.patch-trunk, SOLR-7341.patch-trunk
>
>
> h2. XJoin
> The "xjoin" SOLR contrib allows external results to be joined with SOLR 
> results in a query and the SOLR result set to be filtered by the results of 
> an external query. Values from the external results are made available in the 
> SOLR results and may also be used to boost the scores of corresponding 
> documents during the search. The contrib consists of the Java classes 
> XJoinSearchComponent, XJoinValueSourceParser and XJoinQParserPlugin (and 
> associated classes), which must be configured in solrconfig.xml, and the 
> interfaces XJoinResultsFactory and XJoinResults, which are implemented by the 
> user to provide the link between SOLR and the external results source (but 
> see below for details of how to use the in-built SimpleXJoinResultsFactory 
> implementation). External results and SOLR documents are matched via a single 
> configurable attribute (the "join field").
> To include the XJoin contrib classes, add the following config to 
> solrconfig.xml:
> {code:xml}
> 
>   ..
>
>regex=".*\.jar" />
>regex="solr-xjoin-\d.*\.jar" />
>   ..
> 
> {code}
> Note that any JARs containing implementations of the XJoinResultsFactory must 
> also be included.
> h2. Java classes and interfaces
> h3. XJoinResultsFactory
> The user implementation of this interface is responsible for connecting to an 
> external source to perform a query (or otherwise collect results). Parameters 
> with prefix ".external." are passed from the SOLR query URL 
> to pararameterise the search. The interface has the following methods:
> * void init(NamedList args) - this is called during SOLR initialisation, and 
> passed parameters from the search component configuration (see below)
> * XJoinResults getResults(SolrParams params) - this is called during a SOLR 
> search to generate external results, and is passed parameters from the SOLR 
> query URL (as above)
> For example, the implementation might perform queries of an external source 
> based on the 'q' SOLR query URL parameter (in full,  name>.external.q).
> h3. XJoinResults
> A user implementation of this interface is returned by the getResults() 
> method of the XJoinResultsFactory implementation. It has methods:
> * Object getResult(String joinId) - this should return a particular result 
> given the value of the join attribute
> * Iterable getJoinIds() - this should return an ordered (ascending) 
> list of the join attribute values for all results of the external search
> h3. XJoinSearchComponent
> This is the central Java class of the contrib. It is a SOLR search component, 
> configured in solrconfig.xml and included in one or more SOLR request 
> handlers. There is one XJoin search component per external source, and each 
> has two main responsibilities:
> * Before the SOLR search, it connects to the external source and retrieves 
> results, storing them in the SOLR request context
> * After the SOLR search, it matches SOLR document in the results set and 
> external results via the join field, adding attributes from the external 
> results to documents in the SOLR results set
> It takes the following initialisation parameters:
> * factoryClass - this specifies the user-supplied class implementing 
> XJoinResultsFactory, used to generate external results
> * joinField - this specifies the attribute on which to join between SOLR 
> documents and external results
> * external - this parameter set is passed to configure the 
> XJoinResultsFactory implementation
> For example, in solrconfig.xml:
> {code:xml}
>  class="org.apache.solr.search.xjoin.XJoinSearchComponent">
>   test.TestXJoinResultsFactory
>   id
>   
> 1,2,3
>   
> 
> {code}
> Here, the search component instantiates a new TextXJoinResultsFactory during 
> initialisation, and passes it the "v

[jira] [Comment Edited] (LUCENE-2605) queryparser parses on whitespace

2016-06-14 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330012#comment-15330012
 ] 

Otis Gospodnetic edited comment on LUCENE-2605 at 6/14/16 8:31 PM:
---

[~steve_rowe] you are about to become everyone's hero and a household name! :)
Is this going to be in the upcoming 6.1?



was (Author: otis):
[~steve_rowe] you are about to become everyone's here and a household name! :)
Is this going to be in the upcoming 6.1?


> queryparser parses on whitespace
> 
>
> Key: LUCENE-2605
> URL: https://issues.apache.org/jira/browse/LUCENE-2605
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Reporter: Robert Muir
>Assignee: Steve Rowe
> Attachments: LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch
>
>
> The queryparser parses input on whitespace, and sends each whitespace 
> separated term to its own independent token stream.
> This breaks the following at query-time, because they can't see across 
> whitespace boundaries:
> * n-gram analysis
> * shingles 
> * synonyms (especially multi-word for whitespace-separated languages)
> * languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their 
> charfilters/tokenizers/tokenfilters will do the same thing at index and 
> querytime, but
> in many cases they can't. Instead, preferably the queryparser would parse 
> around only real 'operators'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2605) queryparser parses on whitespace

2016-06-14 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330012#comment-15330012
 ] 

Otis Gospodnetic commented on LUCENE-2605:
--

[~steve_rowe] you are about to become everyone's here and a household name! :)
Is this going to be in the upcoming 6.1?


> queryparser parses on whitespace
> 
>
> Key: LUCENE-2605
> URL: https://issues.apache.org/jira/browse/LUCENE-2605
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Reporter: Robert Muir
>Assignee: Steve Rowe
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch
>
>
> The queryparser parses input on whitespace, and sends each whitespace 
> separated term to its own independent token stream.
> This breaks the following at query-time, because they can't see across 
> whitespace boundaries:
> * n-gram analysis
> * shingles 
> * synonyms (especially multi-word for whitespace-separated languages)
> * languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their 
> charfilters/tokenizers/tokenfilters will do the same thing at index and 
> querytime, but
> in many cases they can't. Instead, preferably the queryparser would parse 
> around only real 'operators'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7253) Make sparse doc values and segments merging more efficient

2016-05-07 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-7253:
-
Summary: Make sparse doc values and segments merging more efficient   (was: 
Sparse data in doc values and segments merging )

> Make sparse doc values and segments merging more efficient 
> ---
>
> Key: LUCENE-7253
> URL: https://issues.apache.org/jira/browse/LUCENE-7253
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 5.5, 6.0
>Reporter: Pawel Rog
>  Labels: performance
>
> Doc Values were optimized recently to efficiently store sparse data. 
> Unfortunately there is still big problem with Doc Values merges for sparse 
> fields. When we imagine 1 billion documents index it seems it doesn't matter 
> if all documents have value for this field or there is only 1 document with 
> value. Segment merge time is the same for both cases. In most cases this is 
> not a problem but there are several cases in which one can expect having many 
> fields with sparse doc values.
> I can describe an example. During performance tests of a system with large 
> number of sparse fields I realized that Doc Values merges are a bottleneck. I 
> had hundreds of different numeric fields. Each document contained only small 
> subset of all fields. Average document contains 5-7 different numeric values. 
> As you can see data was very sparse in these fields. It turned out that 
> ingestion process was CPU-bound. Most of CPU time was spent in DocValues 
> related methods (SingletonSortedNumericDocValues#setDocument, 
> DocValuesConsumer$10$1#next, DocValuesConsumer#isSingleValued, 
> DocValuesConsumer$4$1#setNext, ...) - mostly during merging segments.
> Adrien Grand suggested to reduce the number of sparse fields and replace them 
> with smaller number of denser fields. This helped a lot but complicated 
> fields naming. 
> I am not much familiar with Doc Values source code but I have small 
> suggestion how to improve Doc Values merges for sparse fields. I realized 
> that Doc Values producers and consumers use Iterators. Let's take an example 
> of numeric Doc Values. Would it be possible to replace Iterator which 
> "travels" through all documents with Iterator over collection of non empty 
> values? Of course this would require storing object (instead of numeric) 
> which contains value and document ID. Such an iterator could significantly 
> improve merge time of sparse Doc Values fields. IMHO this won't cause big 
> overhead for dense structures but it can be game changer for sparse 
> structures.
> This is what happens in NumericDocValuesWriter on flush
> {code}
> dvConsumer.addNumericField(fieldInfo,
>new Iterable() {
>  @Override
>  public Iterator iterator() {
>return new NumericIterator(maxDoc, values, 
> docsWithField);
>  }
>});
> {code}
> Before this happens during addValue, this loop is executed to fill holes.
> {code}
> // Fill in any holes:
> for (int i = (int)pending.size(); i < docID; ++i) {
>   pending.add(MISSING);
> }
> {code}
> It turns out that variable called pending is used only internally in 
> NumericDocValuesWriter. I know pending is PackedLongValues and it wouldn't be 
> good to change it with different class (some kind of list) because this may 
> break DV performance for dense fields. I hope someone can suggest interesting 
> solutions for this problem :).
> It would be great if discussion about sparse Doc Values merge performance can 
> start here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7253) Sparse data in doc values and segments merging

2016-05-03 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15268972#comment-15268972
 ] 

Otis Gospodnetic commented on LUCENE-7253:
--

My take on this:
# sparse fields are indeed not an abuse case
# my understanding of what Robert is saying is that he agrees with 1), but that 
current implementation is not geared for 1) and if existing DV code just 
modified slightly to improve performance then it would not be the right 
implementation
# Robert didn't actually mention -1 explicitly until David brought that up, 
although we all know that Robert could always throw in his -1 in the end, after 
a contributor has already spent hours or days making changes, just to have them 
rejected (but this is a general Lucene project problem that, I think, 
nobody has actually tried solving directly because it'd be painful)
# Robert actually proposed "The correct solution is to have a more next/advance 
type api geared at forward iteration rather than one that mimics an array. Then 
nulls can be handled in typical ways in various situations (eg rle). It should 
be possible esp that scoring is in order.", so my take is that if a contributor 
did exactly what Robert wants then this could potentially be accepted
# I assume the "correct approach" involves more changes and more coding and 
time.  I assume it would be useful to make a simpler and maybe not acceptable 
change first in order to get some numbers and see if it's even worth investing 
time in "correct approach"
# If the numbers look good then, because of a potential -1 from Robert, whoever 
takes on this challenge would have to be very clear, before any additional dev 
work, about what Robert wants, what he would -1, and what he would let in

> Sparse data in doc values and segments merging 
> ---
>
> Key: LUCENE-7253
> URL: https://issues.apache.org/jira/browse/LUCENE-7253
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 5.5, 6.0
>Reporter: Pawel Rog
>  Labels: performance
>
> Doc Values were optimized recently to efficiently store sparse data. 
> Unfortunately there is still big problem with Doc Values merges for sparse 
> fields. When we imagine 1 billion documents index it seems it doesn't matter 
> if all documents have value for this field or there is only 1 document with 
> value. Segment merge time is the same for both cases. In most cases this is 
> not a problem but there are several cases in which one can expect having many 
> fields with sparse doc values.
> I can describe an example. During performance tests of a system with large 
> number of sparse fields I realized that Doc Values merges are a bottleneck. I 
> had hundreds of different numeric fields. Each document contained only small 
> subset of all fields. Average document contains 5-7 different numeric values. 
> As you can see data was very sparse in these fields. It turned out that 
> ingestion process was CPU-bound. Most of CPU time was spent in DocValues 
> related methods (SingletonSortedNumericDocValues#setDocument, 
> DocValuesConsumer$10$1#next, DocValuesConsumer#isSingleValued, 
> DocValuesConsumer$4$1#setNext, ...) - mostly during merging segments.
> Adrien Grand suggested to reduce the number of sparse fields and replace them 
> with smaller number of denser fields. This helped a lot but complicated 
> fields naming. 
> I am not much familiar with Doc Values source code but I have small 
> suggestion how to improve Doc Values merges for sparse fields. I realized 
> that Doc Values producers and consumers use Iterators. Let's take an example 
> of numeric Doc Values. Would it be possible to replace Iterator which 
> "travels" through all documents with Iterator over collection of non empty 
> values? Of course this would require storing object (instead of numeric) 
> which contains value and document ID. Such an iterator could significantly 
> improve merge time of sparse Doc Values fields. IMHO this won't cause big 
> overhead for dense structures but it can be game changer for sparse 
> structures.
> This is what happens in NumericDocValuesWriter on flush
> {code}
> dvConsumer.addNumericField(fieldInfo,
>new Iterable() {
>  @Override
>  public Iterator iterator() {
>return new NumericIterator(maxDoc, values, 
> docsWithField);
>  }
>});
> {code}
> Before this happens during addValue, this loop is executed to fill holes.
> {code}
&

Re: Welcome Scott Blum as a Lucene/Solr committer!

2016-04-19 Thread Otis Gospodnetic

Another welcome...from NYC subway ;)

Otis

> On Apr 19, 2016, at 05:21, Shalin Shekhar Mangar  wrote:
> 
> I'm pleased to announce that Scott Blum has accepted the Lucene PMC's 
> invitation to become a committer.
> 
> Scott, it's tradition that you introduce yourself with a brief bio.
> 
> Your handle "dragonsinth" has already added to the “lucene" LDAP group, so 
> you now have commit privileges. Please test this by adding yourself to the 
> committers section of the Who We Are page on the website: 
>  (use the ASF CMS bookmarklet at the 
> bottom of the page here:  - more info here 
> ).
> 
> The ASF dev page also has lots of useful links: .
> 
> Congratulations and welcome!
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.

Re: Welcome Kevin Rsiden as Lucene/Solr committer

2016-03-18 Thread Otis Gospodnetic

Congratulations and welcome!

Otis

> On Mar 16, 2016, at 13:02, Joel Bernstein  wrote:
> 
> I'm pleased to announce that Kevin Risden has accepted the PMC's invitation 
> to become a committer.
> 
> Kevin, it's tradition that you introduce yourself with a brief bio.
> 
> I believe your account has been setup and karma has been granted so that you 
> can add yourself to the committers section of the Who We Are page on the 
> website:
> .
> 
> Congratulations and welcome!
> 
> 
> Joel Bernstein
>

[jira] [Commented] (SOLR-6568) Join Discovery Contrib

2016-03-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177046#comment-15177046
 ] 

Otis Gospodnetic commented on SOLR-6568:


Hey [~joel.bernstein], has this been superseded by other joins?

> Join Discovery Contrib
> --
>
> Key: SOLR-6568
> URL: https://issues.apache.org/jira/browse/SOLR-6568
> Project: Solr
>  Issue Type: New Feature
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
>Priority: Minor
> Fix For: 5.0
>
>
> This contribution was commissioned by the *NCBI* (National Center for 
> Biotechnology Information). 
> The Join Discovery Contrib is a set of Solr plugins that support large scale 
> joins and "join facets" between Solr cores. 
> There are two different Join implementations included in this contribution. 
> Both implementations are designed to work with integer join keys. It is very 
> common in large BioInformatic and Genomic databases to use integer primary 
> and foreign keys. Integer keys allow Bioinformatic and Genomic search engines 
> and discovery tools to perform complex operations on large data sets very 
> efficiently. 
> The Join Discovery Contrib provides features that will be applicable to 
> anyone working with the freely available databases from the NCBI and likely a 
> large number of other BioInformatic and Genomic databases. These features are 
> not specific though to Bioinformatics and Genomics, they can be used in any 
> datasets where integer keys are used to define the primary and foreign keys.
> What is included in this contrib:
> 1) A new JoinComponent. This component is used instead of the standard 
> QueryComponent. It facilitates very large scale relational joins between two 
> Solr indexes (cores). The join algorithm used in this component is known as a 
> *parallel partitioned merge join*. This is an algorithm which partitions the 
> results from both sides of the join and then sorts and merges the partitions 
> in parallel. 
>  Below are some of it's features:
> * Sub-second performance on very large joins. The parallel join algorithm is 
> capable of sub-second performance on joins with tens of millions of records 
> on both sides of the join.
> * The JoinComponent returns "tuples" with fields from both sides of the join. 
> The initial release returns the primary keys from both sides of the join and 
> the join key. 
> * The tuples also include, and are ranked by, a combined score from both 
> sides of the join.
> * Special purpose memory-mapped on-disk indexes to support \*:\* joins. This 
> makes it possible to join an entire index with a sub-set of another index 
> with sub-second performance. 
> * Support for very fast one-to-one, one-to-many and many-to-many joins. Fast 
> many-to-many joins make it possible to join between indexes on multi-value 
> fields. 
> 2) A new JoinFacetComponent. This component provides facets for both indexes 
> involved in the join. 
> 3) The BitSetJoinQParserPlugin. A very fast parallel filter join based on 
> bitsets that supports infinite levels of nesting. It can be used as a filter 
> query in combination with the JoinComponent or with the standard query 
> component. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-8228) Facet Telemetry

2015-11-21 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-8228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020494#comment-15020494
 ] 

Otis Gospodnetic commented on SOLR-8228:


{quote}
Although not necessarily part of this initial issue, we should think about how 
to get information about certain requests that does not involve modifying the 
actual request or response. For example, "log telemetry data for the next N 
requests that match this pattern". Something like that would more naturally 
point to method 1 for returning the data (i.e. separate from the response).
{quote}

Yes. Think operations and monitoring, and tools that need to collect this data, 
but are obviously not Solr clients issuing queries and collecting this info 
from responses.  So, logs, JMX, stats API, that sort of stuff.

> Facet Telemetry
> ---
>
> Key: SOLR-8228
> URL: https://issues.apache.org/jira/browse/SOLR-8228
> Project: Solr
>  Issue Type: New Feature
>  Components: Facet Module
>Reporter: Yonik Seeley
> Fix For: Trunk
>
>
> As the JSON Facet API becomes more complex and has more optimizations, it 
> would be nice to get a better view of what is going on in faceting... what 
> methods/algorithms are being used and what is taking up the most time or 
> memory.
>   - the strategy/method used to facet the field
>   - number of unique values in facet field
>   - memory usage of facet field itself
>   - memory usage for request (count arrays, etc)
>   - timing of various parts of facet request (finding top N, executing 
> sub-facets, etc)
> This will also help with unit tests, making sure we have proper coverage of 
> various optimizations.
> Some of this information collection may make sense to happen all the time, 
> while other information may be calculated only if requested.
> When adding facet info to a response, it could be done one of two ways:
>  1. in the existing debug block in the response, along with other debug info, 
> structured like 
>  2. directly in the facet response (i.e. in something like "\_debug\_" that 
> is a sibling of "buckets")
> We need to also consider how to merge distributed debug info (and add more 
> info about the distributed phase as well).  Given this, (2) may be simpler 
> (adding directly to facet response) as we already have a framework for 
> merging.
> Although not necessarily part of this initial issue, we should think about 
> how to get information about certain requests that does not involve modifying 
> the actual request or response.  For example, "log telemetry data for the 
> next N requests that match this pattern".  Something like that would more 
> naturally point to method 1 for returning the data (i.e. separate from the 
> response).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Welcome Dennis Gove as Lucene/Solr committer

2015-11-06 Thread Otis Gospodnetic

Welcome and great work so far!

Otis

 

> On Nov 6, 2015, at 10:19, Joel Bernstein  wrote:
> 
> I'm pleased to announce that Dennis Gove has accepted the PMC's
> invitation to become a committer.
> 
> Dennis, it's tradition that you introduce yourself with a brief bio.
> 
> Your account is not entirely ready yet. We will let you know when it is 
> created
> and karma has been granted so that you can add yourself to the committers
> section of the Who We Are page on the website:
> .
> 
> Congratulations and welcome!
> 
> 
> Joel Bernstein
> http://joelsolr.blogspot.com/

[jira] [Commented] (SOLR-8095) Allow disabling HDFS Locality Metrics

2015-09-24 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-8095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907335#comment-14907335
 ] 

Otis Gospodnetic commented on SOLR-8095:


But why does having these metrics bother anyone?  Never heard of turning 
metrics on/off.  If it's just sitting there in JMX, it shouldn't bother any 
one, unless they are very expensive to compute or?

> Allow disabling HDFS Locality Metrics
> -
>
> Key: SOLR-8095
> URL: https://issues.apache.org/jira/browse/SOLR-8095
> Project: Solr
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Mike Drob
>  Labels: metrics
> Fix For: Trunk
>
> Attachments: SOLR-8095.patch
>
>
> We added metrics, but not a way to configure/turn them off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5379) Query-time multi-word synonym expansion

2015-09-01 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726326#comment-14726326
 ] 

Otis Gospodnetic commented on SOLR-5379:


There is a patch for 4.10.3, but it was not committed, so this is still not 
available in Solr AFAIK.  Would be great to get this into 5.x.

> Query-time multi-word synonym expansion
> ---
>
> Key: SOLR-5379
> URL: https://issues.apache.org/jira/browse/SOLR-5379
> Project: Solr
>  Issue Type: Improvement
>  Components: query parsers
>Reporter: Tien Nguyen Manh
>  Labels: multi-word, queryparser, synonym
> Fix For: 4.9, Trunk
>
> Attachments: conf-test-files-4_8_1.patch, quoted-4_8_1.patch, 
> quoted.patch, solr-5379-version-4.10.3.patch, synonym-expander-4_8_1.patch, 
> synonym-expander.patch
>
>
> While dealing with synonym at query time, solr failed to work with multi-word 
> synonyms due to some reasons:
> - First the lucene queryparser tokenizes user query by space so it split 
> multi-word term into two terms before feeding to synonym filter, so synonym 
> filter can't recognized multi-word term to do expansion
> - Second, if synonym filter expand into multiple terms which contains 
> multi-word synonym, The SolrQueryParseBase currently use MultiPhraseQuery to 
> handle synonyms. But MultiPhraseQuery don't work with term have different 
> number of words.
> For the first one, we can extend quoted all multi-word synonym in user query 
> so that lucene queryparser don't split it. There are a jira task related to 
> this one https://issues.apache.org/jira/browse/LUCENE-2605.
> For the second, we can replace MultiPhraseQuery by an appropriate BoleanQuery 
> SHOULD which contains multiple PhraseQuery in case tokens stream have 
> multi-word synonym.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Deleted] (SOLR-7143) MoreLikeThis Query Parser does not handle multiple field names

2015-07-02 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-7143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-7143:
---
Comment: was deleted

(was: Not sure how this ended in my private e-mail.

We were to suggest them to upgrade to 5.x so that, amongst other fixes and
improvements, they can use this new MLT QueryParser to solve problem with
non-functioning MLT handler in cloud mode (
https://issues.apache.org/jira/browse/SOLR-788), but now it seems that even
5 would have to be patched (if those patches work).
On Thu, Jul 2, 2015 at 11:59 PM Otis Gospodnetic (JIRA) j...@apache.org

)

 MoreLikeThis Query Parser does not handle multiple field names
 --

 Key: SOLR-7143
 URL: https://issues.apache.org/jira/browse/SOLR-7143
 Project: Solr
  Issue Type: Bug
  Components: query parsers
Affects Versions: 5.0
Reporter: Jens Wille
Assignee: Anshum Gupta
 Attachments: SOLR-7143.patch, SOLR-7143.patch, SOLR-7143.patch, 
 SOLR-7143.patch, SOLR-7143.patch


 The newly introduced MoreLikeThis Query Parser (SOLR-6248) does not return 
 any results when supplied with multiple fields in the {{qf}} parameter.
 To reproduce within the techproducts example, compare:
 {code}
 curl 
 'http://localhost:8983/solr/techproducts/select?q=%7B!mlt+qf=name%7DMA147LL/A'
 curl 
 'http://localhost:8983/solr/techproducts/select?q=%7B!mlt+qf=features%7DMA147LL/A'
 curl 
 'http://localhost:8983/solr/techproducts/select?q=%7B!mlt+qf=name,features%7DMA147LL/A'
 {code}
 The first two queries return 8 and 5 results, respectively. The third query 
 doesn't return any results (not even the matched document).
 In contrast, the MoreLikeThis Handler works as expected (accounting for the 
 default {{mintf}} and {{mindf}} values in SimpleMLTQParser):
 {code}
 curl 
 'http://localhost:8983/solr/techproducts/mlt?q=id:MA147LL/Amlt.fl=namemlt.mintf=1mlt.mindf=1'
 curl 
 'http://localhost:8983/solr/techproducts/mlt?q=id:MA147LL/Amlt.fl=featuresmlt.mintf=1mlt.mindf=1'
 curl 
 'http://localhost:8983/solr/techproducts/mlt?q=id:MA147LL/Amlt.fl=name,featuresmlt.mintf=1mlt.mindf=1'
 {code}
 After adding the following line to 
 {{example/techproducts/solr/techproducts/conf/solrconfig.xml}}:
 {code:language=XML}
 requestHandler name=/mlt class=solr.MoreLikeThisHandler /
 {code}
 The first two queries return 7 and 4 results, respectively (excluding the 
 matched document). The third query returns 7 results, as one would expect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7143) MoreLikeThis Query Parser does not handle multiple field names

2015-07-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612595#comment-14612595
 ] 

Otis Gospodnetic commented on SOLR-7143:


[~anshumg] - which Solr version is this going in? Fix Version is empty.  Thanks.

 MoreLikeThis Query Parser does not handle multiple field names
 --

 Key: SOLR-7143
 URL: https://issues.apache.org/jira/browse/SOLR-7143
 Project: Solr
  Issue Type: Bug
  Components: query parsers
Affects Versions: 5.0
Reporter: Jens Wille
Assignee: Anshum Gupta
 Attachments: SOLR-7143.patch, SOLR-7143.patch, SOLR-7143.patch, 
 SOLR-7143.patch, SOLR-7143.patch


 The newly introduced MoreLikeThis Query Parser (SOLR-6248) does not return 
 any results when supplied with multiple fields in the {{qf}} parameter.
 To reproduce within the techproducts example, compare:
 {code}
 curl 
 'http://localhost:8983/solr/techproducts/select?q=%7B!mlt+qf=name%7DMA147LL/A'
 curl 
 'http://localhost:8983/solr/techproducts/select?q=%7B!mlt+qf=features%7DMA147LL/A'
 curl 
 'http://localhost:8983/solr/techproducts/select?q=%7B!mlt+qf=name,features%7DMA147LL/A'
 {code}
 The first two queries return 8 and 5 results, respectively. The third query 
 doesn't return any results (not even the matched document).
 In contrast, the MoreLikeThis Handler works as expected (accounting for the 
 default {{mintf}} and {{mindf}} values in SimpleMLTQParser):
 {code}
 curl 
 'http://localhost:8983/solr/techproducts/mlt?q=id:MA147LL/Amlt.fl=namemlt.mintf=1mlt.mindf=1'
 curl 
 'http://localhost:8983/solr/techproducts/mlt?q=id:MA147LL/Amlt.fl=featuresmlt.mintf=1mlt.mindf=1'
 curl 
 'http://localhost:8983/solr/techproducts/mlt?q=id:MA147LL/Amlt.fl=name,featuresmlt.mintf=1mlt.mindf=1'
 {code}
 After adding the following line to 
 {{example/techproducts/solr/techproducts/conf/solrconfig.xml}}:
 {code:language=XML}
 requestHandler name=/mlt class=solr.MoreLikeThisHandler /
 {code}
 The first two queries return 7 and 4 results, respectively (excluding the 
 matched document). The third query returns 7 results, as one would expect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-7571) Return metrics with update requests to allow clients to self-throttle

2015-06-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570150#comment-14570150
 ] 

Otis Gospodnetic edited comment on SOLR-7571 at 6/3/15 2:25 AM:


[~erickerickson] - Not sure yet, would have to see which numbers you'd end up 
having, but I'm guessing you could construct MBean names in such a way that the 
name of the leader would be a part of its name (think about parsing, allowed 
vs. forbidden chars!) and the metric you want to return would be the MBean 
value.

Now that I think about this, you should also probably think about how the size 
of Solr response would be affected by more info being added to every response 
and how much that would affect the client that has to process this.  Providing 
this via JMX, which does not stand between client and server on every request, 
and is checked independently of search requests, my in some ways be better.


was (Author: otis):
[~erickerickson] - Not sure yet, would have to see which numbers you'd end up 
having, but I'm guessing you could construct MBean names in such a way that the 
name of the leader would be a part of its name (think about parsing, allowed 
vs. forbidden chars!) and the metric you want to return would be the MBean 
value.

 Return metrics with update requests to allow clients to self-throttle
 -

 Key: SOLR-7571
 URL: https://issues.apache.org/jira/browse/SOLR-7571
 Project: Solr
  Issue Type: Improvement
Affects Versions: 4.10.3
Reporter: Erick Erickson
Assignee: Erick Erickson

 I've assigned this to myself to keep track of it, anyone who wants please 
 feel free to take this.
 I've recently seen a setup with 10 shards and 4 replicas. The SolrJ client 
 (and post.jar for json files for that matter) firehose updates (150 separate 
 threads in total) at Solr. Eventually, replicas (not leaders) go into 
 recovery and the state cascades and eventually the entire cluster becomes 
 unusable. SOLR-5850 delays the behavior, but it still occurs. There are no 
 errors in the follower's logs this is leader-initiated-recovery because of a 
 timeout.
 I think the root problem is that the client is just sending too many requests 
 to the cluster, and ConcurrentUpdateSolrClient/Server (used by the leader to 
 distribute update requests to all the followers) (this was observed in Solr 
 4.10.3+).  I see thread counts of 500+ when this happens.
 So assuming that this is the root cause, the obvious cure is don't index 
 that fast. This is unsatisfactory since that fast is variable, the only 
 recourse is to set that threshold low enough that the Solr cluster isn't 
 being driven as fast is it can be.
 We should provide some mechanism for having the client throttle itself. The 
 number of outstanding update threads is one possibility. The client could 
 then slow down sending updates to Solr. 
 I'm not sure there's a good way to deal with this on the server. Once the 
 timeout is encountered, you don't know whether the doc has actually been 
 indexed on the follower (actually, in this case it _is_ indexed, it just take 
 a while). Ideally we'd just manage it all magically, but an alternative to 
 let clients dynamically throttle themselves seems do-able.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7571) Return metrics with update requests to allow clients to self-throttle

2015-06-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570150#comment-14570150
 ] 

Otis Gospodnetic commented on SOLR-7571:


[~erickerickson] - Not sure yet, would have to see which numbers you'd end up 
having, but I'm guessing you could construct MBean names in such a way that the 
name of the leader would be a part of its name (think about parsing, allowed 
vs. forbidden chars!) and the metric you want to return would be the MBean 
value.

 Return metrics with update requests to allow clients to self-throttle
 -

 Key: SOLR-7571
 URL: https://issues.apache.org/jira/browse/SOLR-7571
 Project: Solr
  Issue Type: Improvement
Affects Versions: 4.10.3
Reporter: Erick Erickson
Assignee: Erick Erickson

 I've assigned this to myself to keep track of it, anyone who wants please 
 feel free to take this.
 I've recently seen a setup with 10 shards and 4 replicas. The SolrJ client 
 (and post.jar for json files for that matter) firehose updates (150 separate 
 threads in total) at Solr. Eventually, replicas (not leaders) go into 
 recovery and the state cascades and eventually the entire cluster becomes 
 unusable. SOLR-5850 delays the behavior, but it still occurs. There are no 
 errors in the follower's logs this is leader-initiated-recovery because of a 
 timeout.
 I think the root problem is that the client is just sending too many requests 
 to the cluster, and ConcurrentUpdateSolrClient/Server (used by the leader to 
 distribute update requests to all the followers) (this was observed in Solr 
 4.10.3+).  I see thread counts of 500+ when this happens.
 So assuming that this is the root cause, the obvious cure is don't index 
 that fast. This is unsatisfactory since that fast is variable, the only 
 recourse is to set that threshold low enough that the Solr cluster isn't 
 being driven as fast is it can be.
 We should provide some mechanism for having the client throttle itself. The 
 number of outstanding update threads is one possibility. The client could 
 then slow down sending updates to Solr. 
 I'm not sure there's a good way to deal with this on the server. Once the 
 timeout is encountered, you don't know whether the doc has actually been 
 indexed on the follower (actually, in this case it _is_ indexed, it just take 
 a while). Ideally we'd just manage it all magically, but an alternative to 
 let clients dynamically throttle themselves seems do-able.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7571) Return metrics with update requests to allow clients to self-throttle

2015-06-02 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-7571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569459#comment-14569459
]

Otis Gospodnetic commented on SOLR-7571:

If you are going to be keeping track of any (new) metrics around this, in
addition to possibly returning them to clients, please expose them via JMX, so
monitoring tools can expose what is going on with the Solr cluster, too. This
can then trigger alert events and alerts events can trigger actions, such as
reducing the indexing rate.

Return metrics with update requests to allow clients to self-throttle
-

Key: SOLR-7571
URL: https://issues.apache.org/jira/browse/SOLR-7571
Project: Solr
Issue Type: Improvement
Affects Versions: 4.10.3
Reporter: Erick Erickson
Assignee: Erick Erickson

I've assigned this to myself to keep track of it, anyone who wants please
feel free to take this.
I've recently seen a setup with 10 shards and 4 replicas. The SolrJ client
(and post.jar for json files for that matter) firehose updates (150 separate
threads in total) at Solr. Eventually, replicas (not leaders) go into
recovery and the state cascades and eventually the entire cluster becomes
unusable. SOLR-5850 delays the behavior, but it still occurs. There are no
errors in the follower's logs this is leader-initiated-recovery because of a
timeout.
I think the root problem is that the client is just sending too many requests
to the cluster, and ConcurrentUpdateSolrClient/Server (used by the leader to
distribute update requests to all the followers) (this was observed in Solr
4.10.3+). I see thread counts of 500+ when this happens.
So assuming that this is the root cause, the obvious cure is don't index
that fast. This is unsatisfactory since that fast is variable, the only
recourse is to set that threshold low enough that the Solr cluster isn't
being driven as fast is it can be.
We should provide some mechanism for having the client throttle itself. The
number of outstanding update threads is one possibility. The client could
then slow down sending updates to Solr.
I'm not sure there's a good way to deal with this on the server. Once the
timeout is encountered, you don't know whether the doc has actually been
indexed on the follower (actually, in this case it _is_ indexed, it just take
a while). Ideally we'd just manage it all magically, but an alternative to
let clients dynamically throttle themselves seems do-able.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-7458) Expose HDFS Block Locality Metrics

2015-04-23 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-7458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-7458:
---
Labels: metrics  (was: )

 Expose HDFS Block Locality Metrics
 --

 Key: SOLR-7458
 URL: https://issues.apache.org/jira/browse/SOLR-7458
 Project: Solr
  Issue Type: Improvement
  Components: SolrCloud
Reporter: Mike Drob
  Labels: metrics
 Attachments: SOLR-7458.patch


 We should publish block locality metrics when using HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-7457) Make DirectoryFactory publishing MBeanInfo extensible

2015-04-23 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-7457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-7457:
---
Labels: metrics  (was: )

 Make DirectoryFactory publishing MBeanInfo extensible
 -

 Key: SOLR-7457
 URL: https://issues.apache.org/jira/browse/SOLR-7457
 Project: Solr
  Issue Type: Improvement
  Components: JMX
Affects Versions: 5.0
Reporter: Mike Drob
  Labels: metrics
 Fix For: Trunk, 5.2

 Attachments: SOLR-7457.patch


 In SOLR-6766, we added JMX to the HdfsDirectoryFactory. However, the 
 implementation is pretty brittle and difficult to extend.
 It is conceivable that any implementation of DirectoryFactory might have 
 MInfoBeans that it would like to expose, so we should explicitly accommodate 
 that instead of relying on a side effect of the SolrResourceLoader's 
 behaviour.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7344) Use two thread pools, one for internal requests and one for external, to avoid distributed deadlock and decrease the number of threads that need to be created.

2015-04-21 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505658#comment-14505658
 ] 

Otis Gospodnetic commented on SOLR-7344:


[~hgadre] - didn't check the patch, but does that means that we will now be 
able to see request metrics (counts, latencies) for internal vs. external 
requests separately?  That would be awesome because current metrics don't make 
this distinction.

 Use two thread pools, one for internal requests and one for external, to 
 avoid distributed deadlock and decrease the number of threads that need to be 
 created.
 ---

 Key: SOLR-7344
 URL: https://issues.apache.org/jira/browse/SOLR-7344
 Project: Solr
  Issue Type: Improvement
  Components: SolrCloud
Reporter: Mark Miller
 Attachments: SOLR-7344.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2867) Problem Wtih solr Score Display

2015-04-08 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-2867:
---
Labels:   (was: patch)

 Problem Wtih solr Score Display
 ---

 Key: SOLR-2867
 URL: https://issues.apache.org/jira/browse/SOLR-2867
 Project: Solr
  Issue Type: Bug
  Components: SearchComponents - other
Affects Versions: 3.1
 Environment: Linux and Mysql
Reporter: Pragyanjeet Rout
   Original Estimate: 24h
  Remaining Estimate: 24h

 We are firing a solr query and checking its relevancy score.
 But problem with relevancy score is that for some results the value for score 
 is been truncated.
 Example:-I have a query as below
 http://localhost:8983/solr/mywork/select/?q=( contractLength:12 speedScore:[4 
 TO 7] dataScore:[2 TO *])fq=( ( connectionType:Cable 
 connectionType:Naked)AND ( monthlyCost:[* TO *])AND ( speedScore:[4 TO 
 *])AND ( dataScore:[2 TO 
 *]))version=2.2start=0rows=500indent=onsort=score desc, planType asc, 
 monthlyCost1 asc, monthlyCost2  asc
 The below mentioned is my xml returned from solr :-
 doc
 float name=score3.6897283/float
 int name=contractLength12/int
 int name=dataScore3/int
 str name=prodidABC/str
 float name=monthlyCost120.9/float
 int name=speedScore7/int
 /doc
 doc
 float name=score3.689728/float
 int name=contractLength12/int
 int name=dataScore2/int
 str name=prodidDEF/str
 float name=monthlyCost49.95/float
 int name=speedScore6/int
 /doc
 I have used the debugQuery=true in query and I saw solr is calculating the 
 correct score(PSB) but somehow is it truncating the lastdigit i.e 3 from 
 the second result.
 Because of this my ranking order gets disturbed and I get wrong results while 
 displaying 
 str name=ABC
 3.6897283 = (MATCH) sum of:3.1476827 = (MATCH) weight(contractLength:€#0;#12; 
 in 51), product of:0.92363054 = queryWeight(contractLength:€#0;#12;), product 
 of:3.4079456 = idf(docFreq=8, maxDocs=100)  0.27102268 = queryNorm 3.4079456 
 = (MATCH) fieldWeight(contractLength:€#0;#12; in 51), product of:1.0 = 
 tf(termFreq(contractLength:€#0;#12;)=1) 3.4079456 = idf(docFreq=8, 
 maxDocs=100)
   1.0 = fieldNorm(field=contractLength, doc=51)  0.27102268 = (MATCH) 
 ConstantScore(speedScore:[€#0;#4; TO *]), product of:
 1.0 = boost  0.27102268 = queryNorm  0.27102268 = (MATCH) 
 ConstantScore(dataScore:[€#0;#2; TO *]), product of: 1.0 = boost   0.27102268 
 = queryNorm
 /str
 str name=DEF
 3.6897283 = (MATCH) sum of: 3.1476827 = (MATCH) 
 weight(contractLength:€#0;#12; in 97), product of: 0.92363054 = 
 queryWeight(contractLength:€#0;#12;), product of: 3.4079456 = idf(docFreq=8, 
 maxDocs=100)  0.27102268 = queryNorm 3.4079456 = (MATCH) 
 fieldWeight(contractLength:€#0;#12; in 97), product of: 1.0 = 
 tf(termFreq(contractLength:€#0;#12;)=1) 3.4079456 = idf(docFreq=8, 
 maxDocs=100)  1.0 = fieldNorm(field=contractLength, doc=97)  0.27102268 = 
 (MATCH) ConstantScore(speedScore:[€#0;#4; TO *]), product of: 1.0 = boost
 0.27102268 = queryNorm  0.27102268 = (MATCH) 
 ConstantScore(dataScore:[€#0;#2; TO *]), product of:1.0 = boost
 0.27102268 = queryNorm
 /str
 Please educate me for the above behaviour from solr.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Deleted] (SOLR-7319) Workaround the Four Month Bug causing GC pause problems

2015-03-27 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-7319:
---
Comment: was deleted

(was: bq. tools like jstat will no longer function

Sounds problematic, no?)

 Workaround the Four Month Bug causing GC pause problems
 -

 Key: SOLR-7319
 URL: https://issues.apache.org/jira/browse/SOLR-7319
 Project: Solr
  Issue Type: Bug
  Components: scripts and tools
Affects Versions: 5.0
Reporter: Shawn Heisey
Assignee: Shawn Heisey
 Fix For: 5.1

 Attachments: SOLR-7319.patch, SOLR-7319.patch, SOLR-7319.patch


 A twitter engineer found a bug in the JVM that contributes to GC pause 
 problems:
 http://www.evanjones.ca/jvm-mmap-pause.html
 Problem summary (in case the blog post disappears):  The JVM calculates 
 statistics on things like garbage collection and writes them to a file in the 
 temp directory using MMAP.  If there is a lot of other MMAP write activity, 
 which is precisely how Lucene accomplishes indexing and merging, it can 
 result in a GC pause because the mmap write to the temp file is delayed.
 We should implement the workaround in the solr start scripts (disable 
 creation of the mmap statistics tempfile) and document the impact in 
 CHANGES.txt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7319) Workaround the Four Month Bug causing GC pause problems

2015-03-27 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384472#comment-14384472
 ] 

Otis Gospodnetic commented on SOLR-7319:


bq. tools like jstat will no longer function

Sounds problematic, no?

 Workaround the Four Month Bug causing GC pause problems
 -

 Key: SOLR-7319
 URL: https://issues.apache.org/jira/browse/SOLR-7319
 Project: Solr
  Issue Type: Bug
  Components: scripts and tools
Affects Versions: 5.0
Reporter: Shawn Heisey
Assignee: Shawn Heisey
 Fix For: 5.1

 Attachments: SOLR-7319.patch, SOLR-7319.patch, SOLR-7319.patch


 A twitter engineer found a bug in the JVM that contributes to GC pause 
 problems:
 http://www.evanjones.ca/jvm-mmap-pause.html
 Problem summary (in case the blog post disappears):  The JVM calculates 
 statistics on things like garbage collection and writes them to a file in the 
 temp directory using MMAP.  If there is a lot of other MMAP write activity, 
 which is precisely how Lucene accomplishes indexing and merging, it can 
 result in a GC pause because the mmap write to the temp file is delayed.
 We should implement the workaround in the solr start scripts (disable 
 creation of the mmap statistics tempfile) and document the impact in 
 CHANGES.txt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7296) Reconcile facetting implementations

2015-03-25 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381276#comment-14381276
 ] 

Otis Gospodnetic commented on SOLR-7296:


IIRC it requires a sidecar index, which is probably its main negative.

 Reconcile facetting implementations
 ---

 Key: SOLR-7296
 URL: https://issues.apache.org/jira/browse/SOLR-7296
 Project: Solr
  Issue Type: Task
  Components: faceting
Reporter: Steve Molloy

 SOLR-7214 introduced a new way of controlling faceting, the unmbrella 
 SOLR-6348 brings a lot of improvements in facet functionality, namely around 
 pivots. Both make a lot of sense from a user perspective, but currently have 
 completely different implementations. With the analytics components, this 
 makes 3 implementation of the same logic, which is bound to behave 
 differently as time goes by. We should reconcile all implementations to ease 
 maintenance and offer consistent behaviour no matter how parameters are 
 passed to the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Welcome Ramkumar Aiyengar as Lucene/Solr committer

2015-03-03 Thread Otis Gospodnetic

Congratulations!

Otis

 

 On Mar 1, 2015, at 23:39, Shalin Shekhar Mangar shalinman...@gmail.com 
 wrote:
 
 I'm pleased to announce that Ramkumar Aiyengar has accepted the PMC's 
 invitation to become a committer.
 
 Ramkumar, it's tradition that you introduce yourself with a brief bio.
 
 Your handle andyetitmoves has already added to the “lucene LDAP group, so 
 you now have commit privileges. Please test this by adding yourself to the 
 committers section of the Who We Are page on the website: 
 http://lucene.apache.org/whoweare.html (use the ASF CMS bookmarklet at the 
 bottom of the page here: https://cms.apache.org/#bookmark - more info here 
 http://www.apache.org/dev/cms.html).
 
 The ASF dev page also has lots of useful links: http://www.apache.org/dev/.
 
 Congratulations and welcome!
 
 -- 
 Regards,
 Shalin Shekhar Mangar.

[jira] [Commented] (SOLR-7121) Solr nodes should go down based on configurable thresholds and not rely on resource exhaustion

2015-03-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14344566#comment-14344566
 ] 

Otis Gospodnetic commented on SOLR-7121:


bq. I will surely add these metrics to JMX but can we handle that in a 
follow-up ticket to this one?

Sure, if that works for you.

 Solr nodes should go down based on configurable thresholds and not rely on 
 resource exhaustion
 --

 Key: SOLR-7121
 URL: https://issues.apache.org/jira/browse/SOLR-7121
 Project: Solr
  Issue Type: New Feature
Reporter: Sachin Goyal
 Attachments: SOLR-7121.patch, SOLR-7121.patch, SOLR-7121.patch, 
 SOLR-7121.patch, SOLR-7121.patch


 Currently, there is no way to control when a Solr node goes down.
 If the server is having high GC pauses or too many threads or is just getting 
 too many queries due to some bad load-balancer, the cores in the machine keep 
 on serving unless they exhaust the machine's resources and everything comes 
 to a stall.
 Such a slow-dying core can affect other cores as well by taking huge time to 
 serve their distributed queries.
 There should be a way to specify some threshold values beyond which the 
 targeted core can its ill-health and proactively go down to recover.
 When the load improves, the core should come up automatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

2015-03-01 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342655#comment-14342655
]

Otis Gospodnetic commented on SOLR-7082:

Thanks Joel. Re 1) -- but conceptually and functionally speaking, would you
say this is more or less the same as ES aggregations?

Streaming Aggregation for SolrCloud
---

Key: SOLR-7082
URL: https://issues.apache.org/jira/browse/SOLR-7082
Project: Solr
Issue Type: New Feature
Components: SolrCloud
Reporter: Joel Bernstein
Fix For: Trunk, 5.1

Attachments: SOLR-7082.patch, SOLR-7082.patch

This issue provides a general purpose streaming aggregation framework for
SolrCloud. An overview of how it works can be found at this link:
http://heliosearch.org/streaming-aggregation-for-solrcloud/
This functionality allows SolrCloud users to perform operations that we're
typically done using map/reduce or a parallel computing platform.
Here is a brief explanation of how the framework works:
There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io*
Key classes:
*Tuple*: Abstracts a document in a search result as a Map of key/value pairs.
*TupleStream*: is the base class for all of the streams. Abstracts search
results as a stream of Tuples.
*SolrStream*: connects to a single Solr instance. You call the read() method
to iterate over the Tuples.
*CloudSolrStream*: connects to a SolrCloud collection and merges the results
based on the sort param. The merge takes place in CloudSolrStream itself.
*Decorator Streams*: wrap other streams to gather *Metrics* on streams and
*transform* the streams. Some examples are the MetricStream, RollupStream,
GroupByStream, UniqueStream, MergeJoinStream, HashJoinStream, MergeStream,
FilterStream.
*Going parallel with the ParallelStream and Worker Collections*
The io package also contains the *ParallelStream*, which wraps a TupleStream
and sends it to N worker nodes. The workers are chosen from a SolrCloud
collection. These Worker Collections don't have to hold any data, they can
just be used to execute TupleStreams.
*The StreamHandler*
The Worker nodes have a new RequestHandler called the *StreamHandler*. The
ParallelStream serializes a TupleStream, before it is opened, and sends it to
the StreamHandler on the Worker Nodes.
The StreamHandler on each Worker node deserializes the TupleStream, opens the
stream, iterates the tuples and streams them back to the ParallelStream. The
ParallelStream performs the final merge of Metrics and can be wrapped by
other Streams to handled the final merged TupleStream.
*Sorting and Partitioning search results (Shuffling)*
Each Worker node is shuffled 1/N of the document results. There is a
partitionKeys parameter that can be included with each TupleStream to
ensure that Tuples with the same partitionKeys are shuffled to the same
Worker. The actual partitioning is done with a filter query using the
HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in
the filter cache which provides extremely high performance hash partitioning.
Many of the stream transformations rely on the sort order of the TupleStreams
(GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To
accommodate this the search results can be sorted by specific keys. The
/export handler can be used to sort entire result sets efficiently.
By specifying the sort order of the results and the partition keys, documents
will be sorted and partitioned inside of the search engine. So when the
tuples hit the network they are already sorted, partitioned and headed
directly to correct worker node.
*Extending The Framework*
To extend the framework you create new TupleStream Decorators, that gather
custom metrics or perform custom stream transformations.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7159) Add httpclient connection stats to JMX report

2015-02-25 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337716#comment-14337716
 ] 

Otis Gospodnetic commented on SOLR-7159:


[~vamsee] - yes, please split those into multiple attributes - I think this is 
more common/standard AND, importantly, easier for monitoring tools to work 
with. Think of MBeans in JMX as the API.  For example, what if in the next Solr 
version somebody decides to add another value to that {...} string?  Various 
tools out there that parsed this might break.

 Add httpclient connection stats to JMX report
 -

 Key: SOLR-7159
 URL: https://issues.apache.org/jira/browse/SOLR-7159
 Project: Solr
  Issue Type: Improvement
Affects Versions: 4.10.3
Reporter: Vamsee Yarlagadda
Priority: Minor
 Attachments: SOLR-7159.patch, SOLR-7159v2.patch, Screen Shot 
 2015-02-25 at 2.05.34 PM.png, Screen Shot 2015-02-25 at 2.05.45 PM.png, 
 jmx-layout.png


 Currently, we are logging the stats of httpclient as part of debug level.
 bq. 2015-01-20 13:47:48,640 DEBUG 
 org.apache.http.impl.conn.PoolingClientConnectionManager: Connection request: 
 [route: {}-http://plh04.wil.csc.local:8983][total kept alive: 254; route 
 allocated: 100 of 100; total allocated: 462 of 1]
 Instead, it would be good to expose these metrics via JMX for easy checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7136) Add an AutoPhrasing TokenFilter

2015-02-25 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337734#comment-14337734
 ] 

Otis Gospodnetic commented on SOLR-7136:


[~tedsullivan] do you know / have you tested and compared this with SOLR-5379 
to the point where you can say that the functionality provided in this issue is 
a *superset* of SOLR-5379?  Or is that not the case?  Or maybe you didn't test 
and compare enough to be able to say?  Thanks.

 Add an AutoPhrasing TokenFilter
 ---

 Key: SOLR-7136
 URL: https://issues.apache.org/jira/browse/SOLR-7136
 Project: Solr
  Issue Type: New Feature
Reporter: Ted Sullivan
 Attachments: SOLR-7136.patch, SOLR-7136.patch


 Adds an 'autophrasing' token filter which is designed to enable noun phrases 
 that represent a single entity to be tokenized in a singular fashion. Adds 
 support for ManagedResources and Query parser auto-phrasing support given 
 LUCENE-2605.
 The rationale for this Token Filter and its use in solving the long standing 
 multi-term synonym problem in Lucene Solr has been documented online. 
 http://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/
 https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Welcome Varun Thacker as Lucene/Solr committer

2015-02-25 Thread Otis Gospodnetic

Congratulations and thank you for all your work!

Otis


On Tue, Feb 24, 2015 at 5:52 AM, Varun Thacker varunthacker1...@gmail.com
wrote:

 Thank you everyone for the kind welcome!  Right from the days I was a
 student contributing as part of GSoC, It's been an honour to be part of
 this community. I'm looking forward to committing to the project.

 I am currently working for Lucidworks. I like to base a lot of my work
 around the issues users face with Lucene/Solr.

 Before Lucidworks I was part of a startup called Unbxd. I worked on
 building their search platform tailored for eCommerce.

 Looking forward to seeing everyone at future conferences.

 Hi All,

 Please join me in welcoming Varun Thacker as the latest committer on
 Lucene and Solr.

 Varun, tradition is for you to provide a brief bio about yourself.

 Welcome aboard!

 -Grant

[jira] [Commented] (SOLR-7121) Solr nodes should go down based on configurable thresholds and not rely on resource exhaustion

2015-02-25 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337766#comment-14337766
 ] 

Otis Gospodnetic commented on SOLR-7121:


[~sachingoyal] - would it make sense to expose all metrics you rely on via JMX? 
 That way monitoring tools would be able to extract this data and graph it 
which, in addition to logs, would help people do post-mortem, understand what 
happened, which metric(s) went up or down, what it/their historical values 
were, maybe set alerts based on that, etc.

 Solr nodes should go down based on configurable thresholds and not rely on 
 resource exhaustion
 --

 Key: SOLR-7121
 URL: https://issues.apache.org/jira/browse/SOLR-7121
 Project: Solr
  Issue Type: New Feature
Reporter: Sachin Goyal
 Attachments: SOLR-7121.patch, SOLR-7121.patch, SOLR-7121.patch


 Currently, there is no way to control when a Solr node goes down.
 If the server is having high GC pauses or too many threads or is just getting 
 too many queries due to some bad load-balancer, the cores in the machine keep 
 on serving unless they exhaust the machine's resources and everything comes 
 to a stall.
 Such a slow-dying core can affect other cores as well by taking huge time to 
 serve their distributed queries.
 There should be a way to specify some threshold values beyond which the 
 targeted core can its ill-health and proactively go down to recover.
 When the load improves, the core should come up automatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6832) Queries be served locally rather than being forwarded to another replica

2015-02-25 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337776#comment-14337776
 ] 

Otis Gospodnetic commented on SOLR-6832:


bq. The performance gain increases if coresPerMachine is  1 and a single JVM 
has cores from 'k' shards.

Ever managed to measure how much this feature helps in various scenarios?

bq. For a distributed query, the request is always sent to all the shards even 
if the originating SolrCore (handling the original distributed query) is a 
replica of one of the shards.  If the original Solr-Core can check itself 
before sending http requests for any shard, we can probably save some network 
hopping and gain some performance.

This sounds as like it saves only a N local calls out of M, where M  N, N is 
the number of local replicas that could be queried locally, and M is the total 
number of primary shards in the cluster that are to be queries.  Is this 
correct?

So say there are 20 shards spread evenly over 20 nodes (i.e., 1 shard per node) 
and a query request comes in, the node that got the request will query send 19 
requests to the remaining 19 nodes and thus save just one network trip by 
querying a local shard?  I must be missing something...

 Queries be served locally rather than being forwarded to another replica
 

 Key: SOLR-6832
 URL: https://issues.apache.org/jira/browse/SOLR-6832
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.10.2
Reporter: Sachin Goyal
Assignee: Timothy Potter
 Fix For: Trunk, 5.1

 Attachments: SOLR-6832.patch, SOLR-6832.patch, SOLR-6832.patch, 
 SOLR-6832.patch


 Currently, I see that code flow for a query in SolrCloud is as follows:
 For distributed query:
 SolrCore - SearchHandler.handleRequestBody() - HttpShardHandler.submit()
 For non-distributed query:
 SolrCore - SearchHandler.handleRequestBody() - QueryComponent.process()
 \\
 \\
 \\
 For a distributed query, the request is always sent to all the shards even if 
 the originating SolrCore (handling the original distributed query) is a 
 replica of one of the shards.
 If the original Solr-Core can check itself before sending http requests for 
 any shard, we can probably save some network hopping and gain some 
 performance.
 \\
 \\
 We can change SearchHandler.handleRequestBody() or HttpShardHandler.submit() 
 to fix this behavior (most likely the former and not the latter).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7110) Optimize JavaBinCodec to minimize string Object creation

2015-02-25 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337788#comment-14337788
 ] 

Otis Gospodnetic commented on SOLR-7110:


Possibly better ways to test:
* use something like SPM or VisualVM or anything that gives you visualization 
of:
** various memory pools (size + utilization) in the heap
** GC activity (frequency, avg time, max time, size, etc.)
** CPU usage
* enable GC logging, grep for FullGC, or run jstat

 all of this over time - not just a few minutes, but longer runs before 
patch vs. after patch.  Then you can really see what difference this makes.

 Optimize JavaBinCodec to minimize string Object creation
 

 Key: SOLR-7110
 URL: https://issues.apache.org/jira/browse/SOLR-7110
 Project: Solr
  Issue Type: Improvement
Reporter: Noble Paul
Assignee: Noble Paul
Priority: Minor
 Attachments: SOLR-7110.patch, SOLR-7110.patch


 In JavabinCodec we already optimize on strings creation , if they are 
 repeated in the same payload. if we use a cache it is possible to avoid 
 string creation across objects as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7147) Introduce new TrackingShardHandlerFactory for monitoring what requests are sent to shards during tests

2015-02-25 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337723#comment-14337723
 ] 

Otis Gospodnetic commented on SOLR-7147:


Is this TrackingShardHandlerFactory really useful only for tests?  Wouldn't 
this be a useful debugging tool in general?

 Introduce new TrackingShardHandlerFactory for monitoring what requests are 
 sent to shards during tests
 --

 Key: SOLR-7147
 URL: https://issues.apache.org/jira/browse/SOLR-7147
 Project: Solr
  Issue Type: Improvement
Reporter: Hoss Man
 Attachments: SOLR-7147.patch, SOLR-7147.patch, SOLR-7147.patch, 
 SOLR-7147.patch, SOLR-7147.patch, SOLR-7147.patch


 this is an idea shalin proposed as part of the testing for SOLR-7128...
 bq. I created a TrackingShardHandlerFactory which can record shard requests 
 sent from any node. There are a few helper methods to get requests by shard 
 and by purpose.
 ...
 bq. I will likely move the TrackingShardHandlerFactory into its own issue 
 because it is helpful for other distributed tests as well. I also need to 
 decouple it from the MiniSolrCloudCluster abstraction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-7090) Cross collection join

2015-02-25 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-7090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-7090:
---
Issue Type: New Feature  (was: Bug)

 Cross collection join
 -

 Key: SOLR-7090
 URL: https://issues.apache.org/jira/browse/SOLR-7090
 Project: Solr
  Issue Type: New Feature
Reporter: Ishan Chattopadhyaya
 Fix For: 5.1

 Attachments: SOLR-7090.patch


 Although SOLR-4905 supports joins across collections in Cloud mode, there are 
 limitations, (i) the secondary collection must be replicated at each node 
 where the primary collection has a replica, (ii) the secondary collection 
 must be singly sharded.
 This issue explores ideas/possibilities of cross collection joins, even 
 across nodes. This will be helpful for users who wish to maintain boosts or 
 signals in a secondary, more frequently updated collection, and perform query 
 time join of these boosts/signals with results from the primary collection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-6288) FieldCacheRangeFilter missing from MIGRATE.html

2015-02-25 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-6288.
--
Resolution: Invalid

Please ask on the mailing list.

 FieldCacheRangeFilter missing from MIGRATE.html
 ---

 Key: LUCENE-6288
 URL: https://issues.apache.org/jira/browse/LUCENE-6288
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/other
Affects Versions: 5.0
Reporter: Torsten Krah

 Hi,
 i am searching the {{FieldCacheRangeFilter}} - its not mentioned in the 
 {{https://lucene.apache.org/core/5_0_0/MIGRATE.html}} document and not 
 mentioned in the Changelog  - where to find this one and if it is gone, is it 
 possible to mention this in the migration guide please and how to cope with 
 it in 5.x?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

2015-02-23 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14333490#comment-14333490
]

Otis Gospodnetic commented on SOLR-7082:

This looks really nice, Joel. 2 questions:
* this looks a lot like ES aggregations. Have you maybe made any comparisons
in terms of speed or memory footprint? (ES aggregations love heap)
* is this all going to land in Solr or will some of it remain in Heliosearch?

Streaming Aggregation for SolrCloud
---

Key: SOLR-7082
URL: https://issues.apache.org/jira/browse/SOLR-7082
Project: Solr
Issue Type: New Feature
Components: SolrCloud
Reporter: Joel Bernstein
Fix For: Trunk, 5.1

Attachments: SOLR-7082.patch, SOLR-7082.patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5379) Query-time multi-word synonym expansion

2015-02-03 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14304161#comment-14304161
]

Otis Gospodnetic commented on SOLR-5379:

bq. I am sure there is, but there are no working patches for 4.10 or 5.x thus
far.

Right. What I was trying to ask is whether any of the active Solr committers
wants to commit this. If there is no will to commit, I'd rather keep things
simple on our end ignore this issue. But there is a will to commit, I'd love
to see this in Solr, as would 30+ other watchers, I imagine.

Query-time multi-word synonym expansion
---

Key: SOLR-5379
URL: https://issues.apache.org/jira/browse/SOLR-5379
Project: Solr
Issue Type: Improvement
Components: query parsers
Reporter: Tien Nguyen Manh
Labels: multi-word, queryparser, synonym
Fix For: 4.9, Trunk

Attachments: conf-test-files-4_8_1.patch, quoted-4_8_1.patch,
quoted.patch, synonym-expander-4_8_1.patch, synonym-expander.patch

While dealing with synonym at query time, solr failed to work with multi-word
synonyms due to some reasons:
- First the lucene queryparser tokenizes user query by space so it split
multi-word term into two terms before feeding to synonym filter, so synonym
filter can't recognized multi-word term to do expansion
- Second, if synonym filter expand into multiple terms which contains
multi-word synonym, The SolrQueryParseBase currently use MultiPhraseQuery to
handle synonyms. But MultiPhraseQuery don't work with term have different
number of words.
For the first one, we can extend quoted all multi-word synonym in user query
so that lucene queryparser don't split it. There are a jira task related to
this one https://issues.apache.org/jira/browse/LUCENE-2605.
For the second, we can replace MultiPhraseQuery by an appropriate BoleanQuery
SHOULD which contains multiple PhraseQuery in case tokens stream have
multi-word synonym.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5379) Query-time multi-word synonym expansion

2015-02-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301433#comment-14301433
 ] 

Otis Gospodnetic commented on SOLR-5379:


Is there any interest in committing this to 4.x or 5.x?  We have a client at 
Sematext who needs query-time synonym support for their Solr 4.x setup.  So we 
can make sure this patch works for 4.x.  If any of the Solr developers wants to 
commit this to 5.x, please leave a comment here.

 Query-time multi-word synonym expansion
 ---

 Key: SOLR-5379
 URL: https://issues.apache.org/jira/browse/SOLR-5379
 Project: Solr
  Issue Type: Improvement
  Components: query parsers
Reporter: Tien Nguyen Manh
  Labels: multi-word, queryparser, synonym
 Fix For: 4.9, Trunk

 Attachments: conf-test-files-4_8_1.patch, quoted-4_8_1.patch, 
 quoted.patch, synonym-expander-4_8_1.patch, synonym-expander.patch


 While dealing with synonym at query time, solr failed to work with multi-word 
 synonyms due to some reasons:
 - First the lucene queryparser tokenizes user query by space so it split 
 multi-word term into two terms before feeding to synonym filter, so synonym 
 filter can't recognized multi-word term to do expansion
 - Second, if synonym filter expand into multiple terms which contains 
 multi-word synonym, The SolrQueryParseBase currently use MultiPhraseQuery to 
 handle synonyms. But MultiPhraseQuery don't work with term have different 
 number of words.
 For the first one, we can extend quoted all multi-word synonym in user query 
 so that lucene queryparser don't split it. There are a jira task related to 
 this one https://issues.apache.org/jira/browse/LUCENE-2605.
 For the second, we can replace MultiPhraseQuery by an appropriate BoleanQuery 
 SHOULD which contains multiple PhraseQuery in case tokens stream have 
 multi-word synonym.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6161) Applying deletes is sometimes dog slow

2015-01-12 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273897#comment-14273897
 ] 

Otis Gospodnetic commented on LUCENE-6161:
--

I'd assume that while merges are now faster, they are using more of the 
computing resources (than before) needed for the rest of what Lucene is doing, 
hence no improvement in overall indexing time.

 Applying deletes is sometimes dog slow
 --

 Key: LUCENE-6161
 URL: https://issues.apache.org/jira/browse/LUCENE-6161
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 5.0, Trunk

 Attachments: LUCENE-6161.patch, LUCENE-6161.patch, LUCENE-6161.patch


 I hit this while testing various use cases for LUCENE-6119 (adding 
 auto-throttle to ConcurrentMergeScheduler).
 When I tested always call updateDocument (each add buffers a delete term), 
 with many indexing threads, opening an NRT reader once per second (forcing 
 all deleted terms to be applied), I see that 
 BufferedUpdatesStream.applyDeletes sometimes seems to take a lng time, 
 e.g.:
 {noformat}
 BD 0 [2015-01-04 09:31:12.597; Lucene Merge Thread #69]: applyDeletes took 
 339 msec for 10 segments, 117 deleted docs, 607333 visited terms
 BD 0 [2015-01-04 09:31:18.148; Thread-4]: applyDeletes took 5533 msec for 62 
 segments, 10989 deleted docs, 8517225 visited terms
 BD 0 [2015-01-04 09:31:21.463; Lucene Merge Thread #71]: applyDeletes took 
 1065 msec for 10 segments, 470 deleted docs, 1825649 visited terms
 BD 0 [2015-01-04 09:31:26.301; Thread-5]: applyDeletes took 4835 msec for 61 
 segments, 14676 deleted docs, 9649860 visited terms
 BD 0 [2015-01-04 09:31:35.572; Thread-11]: applyDeletes took 6073 msec for 72 
 segments, 13835 deleted docs, 11865319 visited terms
 BD 0 [2015-01-04 09:31:37.604; Lucene Merge Thread #75]: applyDeletes took 
 251 msec for 10 segments, 58 deleted docs, 240721 visited terms
 BD 0 [2015-01-04 09:31:44.641; Thread-11]: applyDeletes took 5956 msec for 64 
 segments, 15109 deleted docs, 10599034 visited terms
 BD 0 [2015-01-04 09:31:47.814; Lucene Merge Thread #77]: applyDeletes took 
 396 msec for 10 segments, 137 deleted docs, 719914 visit
 {noformat}
 What this means is even though I want an NRT reader every second, often I 
 don't get one for up to ~7 or more seconds.
 This is on an SSD, machine has 48 GB RAM, heap size is only 2 GB.  12 
 indexing threads.
 As hideously complex as this code is, I think there are some inefficiencies, 
 but fixing them could be hard / make code even hairier ...
 Also, this code is mega-locked: holds IW's lock, holds BD's lock.  It blocks 
 things like merges kicking off or finishing...
 E.g., we pull the MergedIterator many times on the same set of sub-iterators. 
  Maybe we can create the sorted terms up front and reuse that?
 Maybe we should go term stride (one term visits all N segments) not 
 segment stride (visit each segment, iterating all deleted terms for it).  
 Just iterating the terms to be deleted takes a sizable part of the time, and 
 we now do that once for every segment in the index.
 Also, the isUnique bit in LUCENE-6005 should help here, since if we know 
 the field is unique, we can stop seekExact once we found a segment that has 
 the deleted term, we can maybe pass false for removeDuplicates to 
 MergedIterator...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-6273) Cross Data Center Replication

2015-01-05 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-6273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated SOLR-6273:
---
Summary: Cross Data Center Replication  (was: Cross Data Center Replicaton)

 Cross Data Center Replication
 -

 Key: SOLR-6273
 URL: https://issues.apache.org/jira/browse/SOLR-6273
 Project: Solr
  Issue Type: New Feature
Reporter: Yonik Seeley
 Attachments: SOLR-6273.patch


 This is the master issue for Cross Data Center Replication (CDCR)
 described at a high level here: 
 http://heliosearch.org/solr-cross-data-center-replication/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (SOLR-6674) Solr webapp deployment is very slow with jmx/ in solrconfig.xml

2014-12-02 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-6674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved SOLR-6674.

Resolution: Duplicate

Dupe of SOLR-6675

 Solr webapp deployment is very slow with jmx/ in solrconfig.xml
 -

 Key: SOLR-6674
 URL: https://issues.apache.org/jira/browse/SOLR-6674
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.7
 Environment: Linux Redhat 64bit
Reporter: Forest Soup
Priority: Critical
  Labels: performance

 We have a SolrCloud with Solr version 4.7 with Tomcat 7. And our solr 
 index(cores) are big(50~100G) each core. 
 When we start up tomcat, the solr webapp deployment is very slow. From 
 tomcat's catalina log, every time it takes about 10 minutes to get deployed. 
 After we analyzing java core dump, we notice it's because the loading process 
 cannot finish until the MBean calculation for large index is done.
  
 So we tried to remove the jmx/ from solrconfig.xml, after that, the loading 
 of solr webapp only take about 1 minute. So we can sure the MBean calculation 
 for large index is the root cause.
 Could you please point me if there is any async way to do statistic 
 monitoring without jmx/ in solrconfig.xml, or let it do calculation after 
 the deployment? Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6675) Solr webapp deployment is very slow with jmx/ in solrconfig.xml

2014-12-02 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232555#comment-14232555
 ] 

Otis Gospodnetic commented on SOLR-6675:


I've never heard or seen this before.  Have you tried latest Solr 4.10.x?
Which JVM is this on?


 Solr webapp deployment is very slow with jmx/ in solrconfig.xml
 -

 Key: SOLR-6675
 URL: https://issues.apache.org/jira/browse/SOLR-6675
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.7
 Environment: Linux Redhat 64bit
Reporter: Forest Soup
Priority: Critical
  Labels: performance
 Attachments: callstack.png


 We have a SolrCloud with Solr version 4.7 with Tomcat 7. And our solr 
 index(cores) are big(50~100G) each core. 
 When we start up tomcat, the solr webapp deployment is very slow. From 
 tomcat's catalina log, every time it takes about 10 minutes to get deployed. 
 After we analyzing java core dump, we notice it's because the loading process 
 cannot finish until the MBean calculation for large index is done.
  
 So we tried to remove the jmx/ from solrconfig.xml, after that, the loading 
 of solr webapp only take about 1 minute. So we can sure the MBean calculation 
 for large index is the root cause.
 Could you please point me if there is any async way to do statistic 
 monitoring without jmx/ in solrconfig.xml, or let it do calculation after 
 the deployment? Thanks!
 The callstack.png file in the attachment is the call stack of the long 
 blocking thread which is doing statistics calculation.
 The catalina log of tomcat:
 INFO: Starting Servlet Engine: Apache Tomcat/7.0.54
 Oct 13, 2014 2:00:29 AM org.apache.catalina.startup.HostConfig deployWAR
 INFO: Deploying web application archive 
 /opt/ibm/solrsearch/tomcat/webapps/solr.war
 Oct 13, 2014 2:10:23 AM org.apache.catalina.startup.HostConfig deployWAR
 INFO: Deployment of web application archive 
 /opt/ibm/solrsearch/tomcat/webapps/solr.war has finished in 594,325 ms 
  Time taken for solr app Deployment is about 10 minutes 
 ---
 Oct 13, 2014 2:10:23 AM org.apache.catalina.startup.HostConfig deployDirectory
 INFO: Deploying web application directory 
 /opt/ibm/solrsearch/tomcat/webapps/manager
 Oct 13, 2014 2:10:26 AM org.apache.catalina.startup.HostConfig deployDirectory
 INFO: Deployment of web application directory 
 /opt/ibm/solrsearch/tomcat/webapps/manager has finished in 2,035 ms
 Oct 13, 2014 2:10:26 AM org.apache.catalina.startup.HostConfig deployDirectory
 INFO: Deploying web application directory 
 /opt/ibm/solrsearch/tomcat/webapps/examples
 Oct 13, 2014 2:10:27 AM org.apache.catalina.startup.HostConfig deployDirectory
 INFO: Deployment of web application directory 
 /opt/ibm/solrsearch/tomcat/webapps/examples has finished in 1,789 ms
 Oct 13, 2014 2:10:27 AM org.apache.catalina.startup.HostConfig deployDirectory
 INFO: Deploying web application directory 
 /opt/ibm/solrsearch/tomcat/webapps/docs
 Oct 13, 2014 2:10:28 AM org.apache.catalina.startup.HostConfig deployDirectory
 INFO: Deployment of web application directory 
 /opt/ibm/solrsearch/tomcat/webapps/docs has finished in 1,037 ms
 Oct 13, 2014 2:10:28 AM org.apache.catalina.startup.HostConfig deployDirectory
 INFO: Deploying web application directory 
 /opt/ibm/solrsearch/tomcat/webapps/ROOT
 Oct 13, 2014 2:10:29 AM org.apache.catalina.startup.HostConfig deployDirectory
 INFO: Deployment of web application directory 
 /opt/ibm/solrsearch/tomcat/webapps/ROOT has finished in 948 ms
 Oct 13, 2014 2:10:29 AM org.apache.catalina.startup.HostConfig deployDirectory
 INFO: Deploying web application directory 
 /opt/ibm/solrsearch/tomcat/webapps/host-manager
 Oct 13, 2014 2:10:30 AM org.apache.catalina.startup.HostConfig deployDirectory
 INFO: Deployment of web application directory 
 /opt/ibm/solrsearch/tomcat/webapps/host-manager has finished in 951 ms
 Oct 13, 2014 2:10:31 AM org.apache.coyote.AbstractProtocol start
 INFO: Starting ProtocolHandler [http-bio-8080]
 Oct 13, 2014 2:10:31 AM org.apache.coyote.AbstractProtocol start
 INFO: Starting ProtocolHandler [ajp-bio-8009]
 Oct 13, 2014 2:10:31 AM org.apache.catalina.startup.Catalina start
 INFO: Server startup in 601506 ms



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-6788) Expose Solr version via MBean

2014-11-24 Thread Otis Gospodnetic (JIRA)

Otis Gospodnetic created SOLR-6788:
--

 Summary: Expose Solr version via MBean
 Key: SOLR-6788
 URL: https://issues.apache.org/jira/browse/SOLR-6788
 Project: Solr
  Issue Type: Improvement
Reporter: Otis Gospodnetic
 Fix For: 4.10.3


Solr should expose its version via an MBean so tools know which version of Solr 
they are talking to.  When MBean structure changes tools depend on this 
information to know which MBeans to look for, how to parse/interpret their 
values, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6766) Switch o.a.s.store.blockcache.Metrics to use JMX

2014-11-24 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223754#comment-14223754
 ] 

Otis Gospodnetic commented on SOLR-6766:


Is this aimed at 4.10.3?

 Switch o.a.s.store.blockcache.Metrics to use JMX
 

 Key: SOLR-6766
 URL: https://issues.apache.org/jira/browse/SOLR-6766
 Project: Solr
  Issue Type: Bug
Reporter: Mike Drob
  Labels: metrics
 Attachments: SOLR-6766.patch, SOLR-6766.patch


 The Metrics class currently reports to hadoop metrics, but it would be better 
 to report to JMX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6053) Serbian Analyzer

2014-11-24 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224016#comment-14224016
 ] 

Otis Gospodnetic commented on LUCENE-6053:
--

Hm, calling this Serbian is a bit limiting - languages from all ex-Yugoslavian 
countries use the *exact-same* diacritic characters (the 
abcčćddžđefghijklljmnnjoprsštuvzž ones, not the Cyrillic ones).  [~nikola] - 
do you think you could reorganize things a bit so isolate Cyrillic part and 
thus make the rest reusable?


 Serbian Analyzer
 

 Key: LUCENE-6053
 URL: https://issues.apache.org/jira/browse/LUCENE-6053
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Nikola Smolenski
 Fix For: 5.0, Trunk

 Attachments: LUCENE-Serbian-1.patch


 This is analyzer for Serbian language, so far consisting only of a 
 normalizer. Serbian language uses both Cyrillic and Latin alphabet, so the 
 normalizer works with both alphabets.
 In the future, I'll see to add stopwords, stemmer and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6058) Solr needs a new website

2014-11-13 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14211840#comment-14211840
 ] 

Otis Gospodnetic commented on SOLR-6058:


[~sar...@syr.edu] - the search box at the top goes to 
http://search-lucene.com/lucene?q=foosearchProvider=sl , but it should go to 
http://search-lucene.com/solr?q=monkey (note /lucene = /solr and removal of 
searchProvider=sl which is not needed) .  Do you think you could include this 
little change?

 Solr needs a new website
 

 Key: SOLR-6058
 URL: https://issues.apache.org/jira/browse/SOLR-6058
 Project: Solr
  Issue Type: Task
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Attachments: HTML.rar, SOLR-6058, SOLR-6058.location-fix.patchfile, 
 SOLR-6058.offset-fix.patch, Solr_Icons.pdf, Solr_Logo_on_black.pdf, 
 Solr_Logo_on_black.png, Solr_Logo_on_orange.pdf, Solr_Logo_on_orange.png, 
 Solr_Logo_on_white.pdf, Solr_Logo_on_white.png, Solr_Styleguide.pdf


 Solr needs a new website:  better organization of content, less verbose, more 
 pleasing graphics, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6717) SolrCloud indexing performance when sending updates to incorrect core is terrible

2014-11-07 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202392#comment-14202392
 ] 

Otis Gospodnetic commented on SOLR-6717:


Here's the full thread: http://search-lucene.com/m/QTPaWzeof

 SolrCloud indexing performance when sending updates to incorrect core is 
 terrible
 -

 Key: SOLR-6717
 URL: https://issues.apache.org/jira/browse/SOLR-6717
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.10.2
Reporter: Shawn Heisey
 Fix For: 5.0, Trunk


 A user on the mailing list was sending document updates to a random node/core 
 in his SolrCloud.  Performance was not scaling anywhere close to what was 
 expected.  Basically, indexing performance was not scaling when adding shards 
 and servers.
 As soon as the user implemented a smart router that was aware of the cloud 
 structure and could send to the proper shard leader, performance scaled 
 exactly as expected.  It's not Java code, so CloudSolrServer was not an 
 option.
 There will always be some overhead involved when sending update requests to 
 the wrong shard replica, but hopefully something can be done about the 
 performance hit.
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201411.mbox/%3CCALswpfDQT4+_eZ6416gMyVHkuhdTYtxXxwxQabR6xeTZ8Lx=t...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4587) Implement Saved Searches a la ElasticSearch Percolator

2014-11-07 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202585#comment-14202585
 ] 

Otis Gospodnetic commented on SOLR-4587:


bq.  I believe that much of the job in luwak also comes from the realization 
that the number of documents must be reduced prior to looping

That's correct.  In our work with Luwak this is the key.  You can have 1M 
queries, but if you *really* need to run incoming documents against all 1M 
queries expect to have VERY low throughput and VERY HIGH match latencies.  We 
are working with 1-2M queries and reducing those to a few thousand queries with 
Luwak's Presearcher, and still have latencies of a few hundred milliseconds.

 Implement Saved Searches a la ElasticSearch Percolator
 --

 Key: SOLR-4587
 URL: https://issues.apache.org/jira/browse/SOLR-4587
 Project: Solr
  Issue Type: New Feature
  Components: SearchComponents - other, SolrCloud
Reporter: Otis Gospodnetic
 Fix For: Trunk


 Use Lucene MemoryIndex for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6699) To enable SPDY in a SolrCloud setup

2014-11-03 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195670#comment-14195670
 ] 

Otis Gospodnetic commented on SOLR-6699:


I looked at only about 50-100 top lines in this patch and saw a number of 
versions of various libraries being downgraded, which seems strange, no?

 To enable SPDY in a SolrCloud setup
 ---

 Key: SOLR-6699
 URL: https://issues.apache.org/jira/browse/SOLR-6699
 Project: Solr
  Issue Type: Improvement
Reporter: Harsh Prasad
 Attachments: SOLR-6699.patch


 Solr has lot of inter node communication happening during distributed 
 searching or indexing. Benefits of SPDY is as follows: 
 -Multiple requests can be sent in parallel (multiplexing) and responses can 
 be received out of order.
 -Headers are compressed and optimized.
 This implementation will be using clear-text spdy and not the usual TLS layer 
 spdy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6699) To enable SPDY in a SolrCloud setup

2014-11-03 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195672#comment-14195672
 ] 

Otis Gospodnetic commented on SOLR-6699:


Is this supposed to bring performance or scalability benefits?  If so, do you 
have any numbers you can share?

 To enable SPDY in a SolrCloud setup
 ---

 Key: SOLR-6699
 URL: https://issues.apache.org/jira/browse/SOLR-6699
 Project: Solr
  Issue Type: Improvement
Reporter: Harsh Prasad
 Attachments: SOLR-6699.patch


 Solr has lot of inter node communication happening during distributed 
 searching or indexing. Benefits of SPDY is as follows: 
 -Multiple requests can be sent in parallel (multiplexing) and responses can 
 be received out of order.
 -Headers are compressed and optimized.
 This implementation will be using clear-text spdy and not the usual TLS layer 
 spdy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Welcome Gregory Chanan as Lucene/Solr committer

2014-09-19 Thread Otis Gospodnetic

Congratulations!

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Fri, Sep 19, 2014 at 6:33 PM, Steve Rowe sar...@gmail.com wrote:

 I'm pleased to announce that Gregory Chanan has accepted the PMC's
 invitation to become a committer.

 Gregory, it's tradition that you introduce yourself with a brief bio.

 Mark Miller, the Lucene PMC chair, has already added your gchanan
 account to the “lucene LDAP group, so you now have commit privileges.
 Please test this by adding yourself to the committers section of the Who We
 Are page on the website: http://lucene.apache.org/whoweare.html (use
 the ASF CMS bookmarklet at the bottom of the page here: 
 https://cms.apache.org/#bookmark - more info here 
 http://www.apache.org/dev/cms.html).

 Since you’re a committer on the Apache HBase project, you probably already
 know about it, but I'll include a link to the ASF dev page anyway - lots of
 useful links: http://www.apache.org/dev/.

 Congratulations and welcome!

 Steve


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [VOTE] Move trunk to Java 8

2014-09-12 Thread Otis Gospodnetic

+8
 

 On Sep 12, 2014, at 11:41 AM, Ryan Ernst r...@iernst.net wrote:
 
 It has been 6 months since Java 8 was released.  It has proven to be
 both stable (no issues like with the initial release of java 7) and
 faster.  And there are a ton of features that would make our lives as
 developers easier (and that can improve the quality of Lucene 5 when
 it is eventually released).
 
 We should stay ahead of the curve, and move trunk to Java 8.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Welcome Tomás Fernández Löbbe as Lucene/Solr committer!

2014-07-31 Thread Otis Gospodnetic

Nice Tomás, welcome!  And probably see you in DC.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/



On Thu, Jul 31, 2014 at 5:50 PM, Yonik Seeley yo...@heliosearch.com wrote:

 I'm pleased to announce that Tomás has accepted the PMC's invitation
 to become a Lucene/Solr committer.

 Tomás, it's tradition to introduce yourself with a little bio.

 Congrats and Welcome!

 -Yonik
 http://heliosearch.org - native code faceting, facet functions,
 sub-facets, off-heap data

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1633 matches

Mail list logo