Re: Potential Slow searching for unified highlighting on Solr 8.8.0/8.8.1

2021-03-04 Thread Ere Maijala

Hi,

Solr uses JIRA for issue tickets. You can find it here: 
https://issues.apache.org/jira/browse/SOLR


I'd suggest filing a new bug issue in the SOLR project (note that 
several other projects also use this JIRA installation). Here's an 
example of an existing highlighter issue for reference: 
https://issues.apache.org/jira/browse/SOLR-14019.


See also some brief documentation:

https://cwiki.apache.org/confluence/display/solr/HowToContribute#HowToContribute-JIRAtips(ourissue/bugtracker)

Regards,
Ere

Flowerday, Matthew J kirjoitti 1.3.2021 klo 14.58:

Hi Ere

Please to be of service!

No I have not filed a JIRA ticket. I am new to interacting with the Solr
Community and only beginning to 'find my legs'. I am not too sure what JIRA
is I am afraid!

Regards

Matthew

Matthew Flowerday | Consultant | ULEAF
Unisys | 01908 774830| matthew.flower...@unisys.com
Address Enigma | Wavendon Business Park | Wavendon | Milton Keynes | MK17
8LX



THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY
MATERIAL and is for use only by the intended recipient. If you received this
in error, please contact the sender and delete the e-mail and its
attachments from all devices.



-Original Message-
From: Ere Maijala 
Sent: 01 March 2021 12:53
To: solr-user@lucene.apache.org
Subject: Re: Potential Slow searching for unified highlighting on Solr
8.8.0/8.8.1

EXTERNAL EMAIL - Be cautious of all links and attachments.

Hi,

Whoa, thanks for the heads-up! You may just have saved me from a whole lot
of trouble. Did you file a JIRA ticket already?

Thanks,
Ere

Flowerday, Matthew J kirjoitti 1.3.2021 klo 14.00:

Hi There

I just came across a situation where a unified highlighting search
under solr 8.8.0/8.8.1 can take over 20 mins to run and eventually times

out.

I resolved it by a config change – but it can catch you out. Hence
this email.

With solr 8.8.0 a new unified highlighting parameter
 was implemented which if not set defaults to 0.5.
This attempts to improve the high lighting so that highlighted text
does not appear right at the left. This works well but if you have a
search result with numerous occurrences of the word in question within
the record performance goes right down!

2021-02-27 06:45:03.151 INFO  (qtp762476028-20) [   x:uleaf]
o.a.s.c.S.Request [uleaf]  webapp=/solr path=/select
params={hl.snippets=2=test=on=100=id,d
escription,specification,score=20=*=10&_=161440511913
4}
hits=57008 status=0 QTime=1414320

2021-02-27 06:45:03.245 INFO  (qtp762476028-20) [   x:uleaf]
o.a.s.s.HttpSolrCall Unable to write response, client closed
connection or we are shutting down =>
org.eclipse.jetty.io.EofException

at
org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:279)

org.eclipse.jetty.io.EofException: null

at
org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:279)
~[jetty-io-9.4.34.v20201102.jar:9.4.34.v20201102]

at
org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:422)
~[jetty-io-9.4.34.v20201102.jar:9.4.34.v20201102]

at
org.eclipse.jetty.io.WriteFlusher.completeWrite(WriteFlusher.java:378)
~[jetty-io-9.4.34.v20201102.jar:9.4.34.v20201102]

when I set =0.25 results came back much quicker

2021-02-27 14:59:57.189 INFO  (qtp1291367132-24) [   x:holmes]
o.a.s.c.S.Request [holmes]  webapp=/solr path=/select
params={hl.weightMatches=false=on=id,description,specification,s
core=1=0.25=100=2=test
axAnalyzedChars=100=*=unified=9&_=
1614430061690}
hits=136939 status=0 QTime=87024

And  =0.1

2021-02-27 15:18:45.542 INFO  (qtp1291367132-19) [   x:holmes]
o.a.s.c.S.Request [holmes]  webapp=/solr path=/select
params={hl.weightMatches=false=on=id,description,specification,s
core=1=0.1=100=2=test
xAnalyzedChars=100=*=unified=9&_=1
614430061690}
hits=136939 status=0 QTime=69033

And =0.0

2021-02-27 15:20:38.194 INFO  (qtp1291367132-24) [   x:holmes]
o.a.s.c.S.Request [holmes]  webapp=/solr path=/select
params={hl.weightMatches=false=on=id,description,specification,s
core=1=0.0=100=2=test
xAnalyzedChars=100=*=unified=9&_=1
614430061690}
hits=136939 status=0 QTime=2841

I left our setting at 0.0 – this presumably how it was in 7.7.1 (fully
left aligned).  I am not too sure as to how many time a word has to
occur in a record for performance to go right down – but if too many
it can have a BIG impact.

I also noticed that setting =9 did not break out of
the query until it finished. Perhaps because the query finished
quickly and what took the time was the highlighting. It might be an
idea to get  to also cover any highlighting so that the
query does not run until the jetty timeout is hit. The machine 100%
one core for about
20 mins!.

Hope this helps.

Regards

Matthew

*Matthew Flowerday*| Consultant | ULEAF

Unisys | 01908 774830| matthew.flower...@unisys.com
<mailto:matthew.flower...@unisys.com>

Address Enigma | Wavendon Business Park |

Re: Potential Slow searching for unified highlighting on Solr 8.8.0/8.8.1

2021-03-01 Thread Ere Maijala

Hi,

Whoa, thanks for the heads-up! You may just have saved me from a whole 
lot of trouble. Did you file a JIRA ticket already?


Thanks,
Ere

Flowerday, Matthew J kirjoitti 1.3.2021 klo 14.00:

Hi There

I just came across a situation where a unified highlighting search under 
solr 8.8.0/8.8.1 can take over 20 mins to run and eventually times out. 
I resolved it by a config change – but it can catch you out. Hence this 
email.


With solr 8.8.0 a new unified highlighting parameter  
was implemented which if not set defaults to 0.5. This attempts to 
improve the high lighting so that highlighted text does not appear right 
at the left. This works well but if you have a search result with 
numerous occurrences of the word in question within the record 
performance goes right down!


2021-02-27 06:45:03.151 INFO  (qtp762476028-20) [   x:uleaf] 
o.a.s.c.S.Request [uleaf]  webapp=/solr path=/select 
params={hl.snippets=2=test=on=100=id,description,specification,score=20=*=10&_=1614405119134} 
hits=57008 status=0 QTime=1414320


2021-02-27 06:45:03.245 INFO  (qtp762476028-20) [   x:uleaf] 
o.a.s.s.HttpSolrCall Unable to write response, client closed connection 
or we are shutting down => org.eclipse.jetty.io.EofException


   at 
org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:279)


org.eclipse.jetty.io.EofException: null

   at 
org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:279) 
~[jetty-io-9.4.34.v20201102.jar:9.4.34.v20201102]


   at 
org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:422) 
~[jetty-io-9.4.34.v20201102.jar:9.4.34.v20201102]


   at 
org.eclipse.jetty.io.WriteFlusher.completeWrite(WriteFlusher.java:378) 
~[jetty-io-9.4.34.v20201102.jar:9.4.34.v20201102]


when I set =0.25 results came back much quicker

2021-02-27 14:59:57.189 INFO  (qtp1291367132-24) [   x:holmes] 
o.a.s.c.S.Request [holmes]  webapp=/solr path=/select 
params={hl.weightMatches=false=on=id,description,specification,score=1=0.25=100=2=test=100=*=unified=9&_=1614430061690} 
hits=136939 status=0 QTime=87024


And  =0.1

2021-02-27 15:18:45.542 INFO  (qtp1291367132-19) [   x:holmes] 
o.a.s.c.S.Request [holmes]  webapp=/solr path=/select 
params={hl.weightMatches=false=on=id,description,specification,score=1=0.1=100=2=test=100=*=unified=9&_=1614430061690} 
hits=136939 status=0 QTime=69033


And =0.0

2021-02-27 15:20:38.194 INFO  (qtp1291367132-24) [   x:holmes] 
o.a.s.c.S.Request [holmes]  webapp=/solr path=/select 
params={hl.weightMatches=false=on=id,description,specification,score=1=0.0=100=2=test=100=*=unified=9&_=1614430061690} 
hits=136939 status=0 QTime=2841


I left our setting at 0.0 – this presumably how it was in 7.7.1 (fully 
left aligned).  I am not too sure as to how many time a word has to 
occur in a record for performance to go right down – but if too many it 
can have a BIG impact.


I also noticed that setting =9 did not break out of the 
query until it finished. Perhaps because the query finished quickly and 
what took the time was the highlighting. It might be an idea to get 
 to also cover any highlighting so that the query does not 
run until the jetty timeout is hit. The machine 100% one core for about 
20 mins!.


Hope this helps.

Regards

Matthew

*Matthew Flowerday*| Consultant | ULEAF

Unisys | 01908 774830| matthew.flower...@unisys.com 
<mailto:matthew.flower...@unisys.com>


Address Enigma | Wavendon Business Park | Wavendon | Milton Keynes | 
MK17 8LX


unisys_logo <http://www.unisys.com/>

THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY 
MATERIAL and is for use only by the intended recipient. If you received 
this in error, please contact the sender and delete the e-mail and its 
attachments from all devices.


Grey_LI <http://www.linkedin.com/company/unisys>Grey_TW 
<http://twitter.com/unisyscorp>Grey_YT 
<http://www.youtube.com/theunisyschannel>Grey_FB 
<http://www.facebook.com/unisyscorp>Grey_Vimeo 
<https://vimeo.com/unisys>Grey_UB <http://blogs.unisys.com/>




--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Replica goes into recovery mode in Solr 6.1.0

2020-07-10 Thread Ere Maijala
vishal patel kirjoitti 10.7.2020 klo 12.45:
> Thanks for your input.
> 
> Walter already said that setting soft commit max time to 100 ms is a recipe 
> for disaster
>>> I know that but our application is already developed and run on live 
>>> environment since last 5 years. Actually, we want to show a data very 
>>> quickly after the insert.
> 
> you have huge JVM heaps without an explanation for the reason
>>> We gave the 55GB ram because our usage is like that large query search and 
>>> very frequent searching and indexing.
> Here is my memory snapshot which I have taken from GC.

Yes, I can see that a lot of memory is in use, but the question is why.
I assume caches (are they too large?), perhaps uninverted indexes.
Docvalues would help with latter ones. Do you use them?

> I have tried Solr upgrade from 6.1.0 to 8.5.1 but due to some issue we cannot 
> do. I have also asked in here
> https://lucene.472066.n3.nabble.com/Sorting-in-other-collection-in-Solr-8-5-1-td4459506.html#a4459562

You could also try upgrading to the latest version in 6.x series as a
starter.

> Why we cannot find the reason of recovery from log? like memory or CPU issue, 
> frequent index or search, large query hit,
> My log at the time of recovery
> https://drive.google.com/file/d/1F8Bn7jSXspe2HRelh_vJjKy9DsTRl9h0/view
> [https://lh5.googleusercontent.com/htOUfpihpAqncFsMlCLnSUZPu1_9DRKGNajaXV1jG44fpFzgx51ecNtUK58m5lk=w1200-h630-p]<https://drive.google.com/file/d/1F8Bn7jSXspe2HRelh_vJjKy9DsTRl9h0/view>
> recovery_shard.txt<https://drive.google.com/file/d/1F8Bn7jSXspe2HRelh_vJjKy9DsTRl9h0/view>
> drive.google.com

Isn't it right there on the first lines?

2020-07-09 14:42:43.943 ERROR
(updateExecutor-2-thread-21007-processing-http:11.200.212.305:8983//solr//products
x:products r:core_node1 n:11.200.212.306:8983_solr s:shard1 c:products)
[c:products s:shard1 r:core_node1 x:products]
o.a.s.u.StreamingSolrClients error
org.apache.http.NoHttpResponseException: 11.200.212.305:8983 failed to
respond

followed by a couple more error messages about the same problem and then
initiation of recovery:

2020-07-09 14:42:44.002 INFO  (qtp1239731077-771611) [c:products
s:shard1 r:core_node1 x:products] o.a.s.c.ZkController Put replica
core=products coreNodeName=core_node3 on 11.200.212.305:8983_solr into
leader-initiated recovery.

So the node in question isn't responding quickly enough to http requests
and gets put into recovery. The log for the recovering node starts too
late, so I can't say anything about what happened before 14:42:43.943
that lead to recovery.

--Ere

> 
> 
> From: Ere Maijala 
> Sent: Friday, July 10, 2020 2:10 PM
> To: solr-user@lucene.apache.org 
> Subject: Re: Replica goes into recovery mode in Solr 6.1.0
> 
> Walter already said that setting soft commit max time to 100 ms is a
> recipe for disaster. That alone can be the issue, but if you're not
> willing to try higher values, there's no way of being sure. And you have
> huge JVM heaps without an explanation for the reason. If those do not
> cause problems, you indicated that you also run some other software on
> the same server. Is it possible that the other processes hog CPU, disk
> or network and starve Solr?
> 
> I must add that Solr 6.1.0 is over four years old. You could be hitting
> a bug that has been fixed for years, but even if you encounter an issue
> that's still present, you will need to uprgade to get it fixed. If you
> look at the number of fixes done in subsequent 6.x versions alone in the
> changelog (https://lucene.apache.org/solr/8_5_1/changes/Changes.html)
> you'll see that there are a lot of them. You could be hitting something
> like SOLR-10420, which has been fixed for over three years.
> 
> Best,
> Ere
> 
> vishal patel kirjoitti 10.7.2020 klo 7.52:
>> I’ve been running Solr for a dozen years and I’ve never needed a heap larger 
>> than 8 GB.
>>>> What is your data size? same like us 1 TB? is your searching or indexing 
>>>> frequently? NRT model?
>>
>> My question is why replica is going into recovery? When replica went down, I 
>> checked GC log but GC pause was not more than 2 seconds.
>> Also, I cannot find out any reason for recovery from Solr log file. i want 
>> to know the reason why replica goes into recovery.
>>
>> Regards,
>> Vishal Patel
>> 
>> From: Walter Underwood 
>> Sent: Friday, July 10, 2020 3:03 AM
>> To: solr-user@lucene.apache.org 
>> Subject: Re: Replica goes into recovery mode in Solr 6.1.0
>>
>> Those are extremely large JVMs. Unless you have proven that you MUST
>> have 55 GB of heap, use a smaller heap.
>>
>> I’ve been r

Re: Replica goes into recovery mode in Solr 6.1.0

2020-07-10 Thread Ere Maijala
te or blocked?
>>>
>>> "-Dsolr.autoSoftCommit.maxTime=100” is way too short (100 ms).
>>>>> Our requirement is NRT so we keep the less time
>>>
>>> Regards,
>>> Vishal Patel
>>> 
>>> From: Walter Underwood 
>>> Sent: Tuesday, July 7, 2020 8:15 PM
>>> To: solr-user@lucene.apache.org 
>>> Subject: Re: Replica goes into recovery mode in Solr 6.1.0
>>>
>>> This isn’t a support list, so nobody looks at issues. We do try to help.
>>>
>>> It looks like you have 1 TB of index on a system with 320 GB of RAM.
>>> I don’t know what "Shard1 Allocated memory” is, but maybe half of
>>> that RAM is used by JVMs or some other process, I guess. Are you
>>> running multiple huge JVMs?
>>>
>>> The servers will be doing a LOT of disk IO, so look at the read and
>>> write iops. I expect that the solr processes are blocked on disk reads
>>> almost all the time.
>>>
>>> "-Dsolr.autoSoftCommit.maxTime=100” is way too short (100 ms).
>>> That is probably causing your outages.
>>>
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>>> On Jul 7, 2020, at 5:18 AM, vishal patel  
>>>> wrote:
>>>>
>>>> Any one is looking my issue? Please guide me.
>>>>
>>>> Regards,
>>>> Vishal Patel
>>>>
>>>>
>>>> 
>>>> From: vishal patel 
>>>> Sent: Monday, July 6, 2020 7:11 PM
>>>> To: solr-user@lucene.apache.org 
>>>> Subject: Replica goes into recovery mode in Solr 6.1.0
>>>>
>>>> I am using Solr version 6.1.0, Java 8 version and G1GC on production. We 
>>>> have 2 shards and each shard has 1 replica. We have 3 collection.
>>>> We do not use any cache and also disable in Solr config.xml. Search and 
>>>> Update requests are coming frequently in our live platform.
>>>>
>>>> *Our commit configuration in solr.config are below
>>>> 
>>>> 60
>>>> 2
>>>> false
>>>> 
>>>> 
>>>> ${solr.autoSoftCommit.maxTime:-1}
>>>> 
>>>>
>>>> *We used Near Real Time Searching So we did below configuration in 
>>>> solr.in.cmd
>>>> set SOLR_OPTS=%SOLR_OPTS% -Dsolr.autoSoftCommit.maxTime=100
>>>>
>>>> *Our collections details are below:
>>>>
>>>> Collection  Shard1  Shard1 Replica  Shard2  Shard2 Replica
>>>> Number of Documents Size(GB)Number of Documents Size(GB)   
>>>>  Number of Documents Size(GB)Number of Documents 
>>>> Size(GB)
>>>> collection1 26913364201 26913379202 26913380   
>>>>  198 26913379198
>>>> collection2 13934360310 13934367310 13934368   
>>>>  219 13934367219
>>>> collection3 351539689   73.5351540040   73.5351540136  
>>>>  75.2351539722   75.2
>>>>
>>>> *My server configurations are below:
>>>>
>>>>  Server1 Server2
>>>> CPU Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz, 2301 Mhz, 10 Core(s), 
>>>> 20 Logical Processor(s)Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz, 
>>>> 2301 Mhz, 10 Core(s), 20 Logical Processor(s)
>>>> HardDisk(GB)3845 ( 3.84 TB) 3485 GB (3.48 TB)
>>>> Total memory(GB)320 320
>>>> Shard1 Allocated memory(GB) 55
>>>> Shard2 Replica Allocated memory(GB) 55
>>>> Shard2 Allocated memory(GB) 55
>>>> Shard1 Replica Allocated memory(GB) 55
>>>> Other Applications Allocated Memory(GB) 60  22
>>>> Other Number Of Applications11  7
>>>>
>>>>
>>>> Sometimes, any one replica goes into recovery mode. Why replica goes into 
>>>> recovery? Due to heavy search OR heavy update/insert OR long GC pause 
>>>> time? If any one of them then what should we do in configuration?
>>>> Should we increase the shard for recovery issue?
>>>>
>>>> Regards,
>>>> Vishal Patel
>>>>
>>>
>>
> 
> 

-- 
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Exact match

2019-12-04 Thread Ere Maijala
Hi,

Here's our example of exact match fields:

https://github.com/NatLibFi/finna-solr/blob/master/vufind/biblio/conf/schema.xml#L48

textProper_l requires a partial match from the beginning. textProper_lr
requires a full match. I'm not sure if this works for you, but at least
we have this creative use of PathHierarchyTokenizerFactory allowing the
left-anchored search.

HTH,
Ere

Paras Lehana kirjoitti 3.12.2019 klo 13.49:
> Hi Omer,
> 
> If you mean exact match with same number of words (Emir's), you can also
> add an identifier in the beginning and end of the some other field like
> title_exact. This can be done in your indexing script or using Pattern
> Replace. During query side, you can use this identifier. For example,
> indexing "united states" with "exactStart united states exactEnd" and
> querying with the same. Obviously, you can have scoring issues here so only
> use if you want it to debug or retrieve docs.
> 
> Just adding to the all possible ways. *Anyways, I like the Keyword method.*
> 
> On Tue, 3 Dec 2019 at 03:59, Erick Erickson  wrote:
> 
>> There are two different interpretations of “exact match” going on here,
>> don’t be confused!
>>
>> Emir’s version is “the text has to match the _entire_ input. So a field
>> with “a b c d” will NOT match “a b” or “a b c” or “b c", but only “a b c d”.
>>
>> David’s version is “The text has to contain some sequence of words that
>> exactly matches my query”, so a field with “a b c d” _would_ match “a b”,
>> “a b c”, “a b c d”, “b c”, “c d”, etc.
>>
>> Both are entirely valid use-cases, depending on what you mean by “exact
>> match"
>>
>> Best,
>> Erick
>>
>>> On Dec 2, 2019, at 4:38 PM, Emir Arnautović <
>> emir.arnauto...@sematext.com> wrote:
>>>
>>> Hi Omer,
>>> From performance perspective, it is the best if you index title as a
>> single token: KeywordTokenizer + LowerCaseFilter
>>>
>>> If you need to query that field in some other way, you can index it
>> differently as some other field using copyField.
>>>
>>> HTH,
>>> Emir
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>
>>>
>>>
>>>> On 2 Dec 2019, at 21:43, OTH  wrote:
>>>>
>>>> Hello,
>>>>
>>>> What would be the best way to get exact matches (if any) to a query?
>>>>
>>>> E.g.:  Let's the document text is:  "united states of america".
>>>> Currently, any query containing one or more of the three words "united",
>>>> "states", or "america" will match with the above document.  I would
>> like a
>>>> way so that the document matches only and only if the query were also
>>>> "united states of america" (case-insensitive).
>>>>
>>>> Document field type:  TextField
>>>> Index Analyzer: TokenizerChain
>>>> Index Tokenizer: StandardTokenizerFactory
>>>> Index Token Filters: StopFilterFactory, LowerCaseFilterFactory,
>>>> SnowballPorterFilterFactory
>>>> The Query Analyzer / Tokenizer / Token Filters are the same as the Index
>>>> ones above.
>>>>
>>>> FYI I'm relatively novice at Solr / Lucene / Search.
>>>>
>>>> Much appreciated
>>>> Omer
>>>
>>
>>
> 

-- 
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: NRT vs TLOG bulk indexing performances

2019-10-25 Thread Ere Maijala
Shawn Heisey kirjoitti 25.10.2019 klo 14.54:
> With newer Solr versions, you can ask SolrCloud to prefer PULL replicas
> for querying, so queries will be targeted to those replicas, unless they
> all go down, in which case it will go to non-preferred replica types.  I
> do not know how to do this, I only know that it is possible.
It's controlled by the shards.preference parameter. Docs:

https://lucene.apache.org/solr/guide/8_2/distributed-requests.html#shards-preference-parameter

It also allows one to prefer certain replica locations. This could be
useful e.g. if you want to avoid the indexing server handling queries.
It can also be used to prefer local replicas to minimize network access.

--Ere

-- 
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Throughput does not increase in spite of low CPU usage

2019-09-30 Thread Ere Maijala
Just a side note: -Xmx32G is really bad for performance as it forces
Java to use non-compressed pointers. You'll actually get better results
with -Xmx31G. For more information, see e.g.
https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/

Regards,
Ere

Yasufumi Mizoguchi kirjoitti 30.9.2019 klo 11.05:
> Hi, Deepak.
> Thank you for replying me.
> 
> JVM settings from solr.in.sh file are as follows. (Sorry, I could not show
> all due to our policy)
> 
> -verbose:gc
> -XX:+PrintHeapAtGC
> -XX:+PrintGCDetails
> -XX:+PrintGCDateStamps
> -XX:+PrintGCTimeStamps
> -XX:+PrintTenuringDistribution
> -XX:+PrintGCApplicationStoppedTime
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dcom.sun.management.jmxremote.port=18983
> -XX:OnOutOfMemoryError=/home/solr/solr-6.2.1/bin/oom_solr.sh
> -XX:NewSize=128m
> -XX:MaxNewSize=128m
> -XX:+UseG1GC
> -XX:+PerfDisableSharedMem
> -XX:+ParallelRefProcEnabled
> -XX:G1HeapRegionSize=8m
> -XX:MaxGCPauseMillis=250
> -XX:InitiatingHeapOccupancyPercent=75
> -XX:+UseLargePages
> -XX:+AggressiveOpts
> -Xmx32G
> -Xms32G
> -Xss256k
> 
> 
> Thanks & Regards
> Yasufumi.
> 
> 2019年9月30日(月) 16:12 Deepak Goel :
> 
>> Hello
>>
>> Can you please share the JVM heap settings in detail?
>>
>> Deepak
>>
>> On Mon, 30 Sep 2019, 11:15 Yasufumi Mizoguchi, 
>> wrote:
>>
>>> Hi,
>>>
>>> I am trying some tests to confirm if single Solr instance can perform
>> over
>>> 1000 queries per second(!).
>>>
>>> But now, although CPU usage is 40% or so and iowait is almost 0%,
>>> throughput does not increase over 60 queries per second.
>>>
>>> I think there are some bottlenecks around Kernel, JVM, or Solr settings.
>>>
>>> The values we already checked and configured are followings.
>>>
>>> * Kernel:
>>> file descriptor
>>> net.ipv4.tcp_max_syn_backlog
>>> net.ipv4.tcp_syncookies
>>> net.core.somaxconn
>>> net.core.rmem_max
>>> net.core.wmem_max
>>> net.ipv4.tcp_rmem
>>> net.ipv4.tcp_wmem
>>>
>>> * JVM:
>>> Heap [ -> 32GB]
>>> G1GC settings
>>>
>>> * Solr:
>>> (Jetty) MaxThreads [ -> 2]
>>>
>>>
>>> And the other info is as follows.
>>>
>>> CPU : 16 cores
>>> RAM : 128 GB
>>> Disk : SSD 500GB
>>> NIC : 10Gbps(maybe)
>>> OS : Ubuntu 14.04
>>> JVM : OpenJDK 1.8.0u191
>>> Solr : 6.2.1
>>> Index size : about 60GB
>>>
>>> Any insights will be appreciated.
>>>
>>> Thanks and regards,
>>> Yasufumi.
>>>
>>
> 

-- 
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: 8.2.0 After changing replica types, state.json is wrong and replication no longer takes place

2019-08-23 Thread Ere Maijala
Hi,

We've had PULL replicas stop replicating a couple of times in Solr 7.x.
Restarting Solr has got it going again. No errors in logs, and I've been
unable to reproduce the issue at will. At least once it happened when I
reloaded a collection, but other times that hasn't caused any issues.

I'll make a note to check state.json next time we encounter the
situation to see if I can see what you reported.

Regards,
Ere

Markus Jelsma kirjoitti 22.8.2019 klo 16.36:
> Hello,
> 
> There is a newly created 8.2.0 all NRT type cluster for which i replaced each 
> NRT replica with a TLOG type replica. Now, the replicas no longer replicate 
> when the leader receives data. The situation is odd, because some shard 
> replicas kept replicating up until eight hours ago, another one (same 
> collection, same node) seven hours, and even another one four hours!
> 
> I inspected state.json to see what might be wrong, and compare it with 
> another fully working, but much older, 8.2.0 all TLOG collection.
> 
> The faulty one still lists, probably from when it was created:
> "nrtReplicas":"2",
> "tlogReplicas":"0"
> "pullReplicas":"0",
> "replicationFactor":"2",
> 
> The working collection only has:
> "replicationFactor":"1",
> 
> What actually could cause this new collection to start replicating when i 
> delete the data directory, but later on stop replicating at some random time, 
> which is different for each shard.
> 
> Is there something i should change in state.json, and can it just be 
> reuploaded to ZK?
> 
> Thanks,
> Markus
> 

-- 
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Solr cloud questions

2019-08-16 Thread Ere Maijala
Does your web application, by any chance, allow deep paging or something
like that which requires returning rows at the end of a large result
set? Something like a query where you could have parameters like
=10=100 ? That can easily cause OOM with Solr when using
a sharded index. It would typically require a large number of rows to be
returned and combined from all shards just to get the few rows to return
in the correct order.

For the above example with 8 shards, Solr would have to fetch 1 000 010
rows from each shard. That's over 8 million rows! Even if it's just
identifiers, that's a lot of memory required for an operation that seems
so simple from the surface.

If this is the case, you'll need to prevent the web application from
issuing such queries. This may mean something like supporting paging
only among the first 10 000 results. Typical requirement may also be to
be able to see the last results of a query, but this can be accomplished
by allowing sorting in both ascending and descending order.

Regards,
Ere

Kojo kirjoitti 14.8.2019 klo 16.20:
> Shawn,
> 
> Only my web application access this solr. at a first look at http server
> logs I didnt find something different.  Sometimes I have a very big crawler
> access to my servers, this was my first bet.
> 
> No scheduled crons running at this time too.
> 
> I think that I will reconfigure my boxes with two solr nodes each instead
> of four and increase heap to 16GB. This box only run Solr and has 64Gb.
> Each Solr will use 16Gb and the box will still have 32Gb for the OS. What
> do you think?
> 
> This is a production server, so I will plan to migrate.
> 
> Regards,
> Koji
> 
> 
> Em ter, 13 de ago de 2019 às 12:58, Shawn Heisey 
> escreveu:
> 
>> On 8/13/2019 9:28 AM, Kojo wrote:
>>> Here are the last two gc logs:
>>>
>>>
>> https://send.firefox.com/download/6cc902670aa6f7dd/#Ee568G9vUtyK5zr-nAJoMQ
>>
>> Thank you for that.
>>
>> Analyzing the 20MB gc log actually looks like a pretty healthy system.
>> That log covers 58 hours of runtime, and everything looks very good to me.
>>
>> https://www.dropbox.com/s/yu1pyve1bu9maun/gc-analysis-kojo.png?dl=0
>>
>> But the small log shows a different story.  That log only covers a
>> little more than four minutes.
>>
>> https://www.dropbox.com/s/vkxfoihh12brbnr/gc-analysis-kojo2.png?dl=0
>>
>> What happened at approximately 10:55:15 PM on the day that the smaller
>> log was produced?  Whatever happened caused Solr's heap usage to
>> skyrocket and require more than 6GB.
>>
>> Thanks,
>> Shawn
>>
> 

-- 
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Solr Geospatial Polygon Indexing/Querying Issue

2019-07-25 Thread Ere Maijala
Oops, sorry! Don't know how I missed that.

Have you tested if it makes any difference if you put the sfield
parameter inside the fq like in the example
(https://lucene.apache.org/solr/guide/8_1/spatial-search.html#geofilt)?
We actually put pt and d in there too, e.g.

{!geofilt+sfield%3Dlocation_geo+pt%3D61.2%2C24.9+d%3D1}

--Ere

Sanders, Marshall (CAI - Atlanta) kirjoitti 24.7.2019 klo 16.33:
> My example query has d=1 as the first parameter, so none of the results 
> should be coming back, but they are which makes it seem like it's not doing 
> any geofiltering for some reason.
> 
> On 7/24/19, 2:06 AM, "Ere Maijala"  wrote:
> 
> I think you might be missing the d parameter in geofilt. I'm not sure if
> geofilt actually does anything useful without it.
> 
> Regards,
> Ere
> 
> Sanders, Marshall (CAI - Atlanta) kirjoitti 23.7.2019 klo 21.32:
> > We’re trying to index a polygon into solr and then filter/calculate 
> geodist on the polygon (ideally we actually want a circle, but it looks like 
> that’s not really supported officially by wkt/geojson and instead you have to 
> switch format=”legacy” which seems like something that might be removed in 
> the future so don’t want to rely on it).
> > 
> > Here’s the info from schema:
> >  multiValued="true"/>
> > 
> >  class="solr.SpatialRecursivePrefixTreeFieldType"
> >geo="true" distErrPct="0.025" maxDistErr="0.09" 
> distanceUnits="kilometers"
> > spatialContextFactory="Geo3D"/>
> > 
> > 
> > We’ve tried indexing some different data, but to keep it as simple as 
> possible we started with a triangle (will eventually add more points to 
> approximate a circle).  Here’s an example document that we’ve added just for 
> testing:
> > 
> > {
> > "latlng": ["POLYGON((33.7942704 -84.4412613, 33.7100611 -84.4028091, 
> 33.7802888 -84.3279648, 33.7942704 -84.4412613))"],
> > "ID": "284598223"
> > }
> > 
> > 
> > However, it seems like filtering/distance calculations aren’t working 
> (at least not the way we are used to doing it for points).  Here’s an example 
> query where the pt is several hundred kilometers away from the polygon, yet 
> the document still returns.  Also, it seems that regardless of origin point 
> or polygon location the calculated geodist is always 20015.115
> > 
> > Example query:
> > 
> select?d=1=ID,latlng,geodist()=%7B!geofilt%7D=on=33.9798087,-94.3286133=*:*=latlng=json
> > 
> > Example documents coming back anyway:
> > "docs": [
> > {
> > "latlng": ["POLYGON((33.7942704 -84.4412613, 33.7100611 -84.4028091, 
> 33.7802888 -84.3279648, 33.7942704 -84.4412613))"],
> > "ID": "284598223",
> > "geodist()": 20015.115
> > },
> > {
> > "latlng": ["POLYGON((33.7942704 -84.4412613, 33.7100611 -84.4028091, 
> 33.7802888 -84.3279648, 33.7942704 -84.4412613))"],
> > "ID": "284600596",
> > "geodist()": 20015.115
>     > }
> > ]
> > 
> > 
> > Anyone who has experience in this area can you point us in the right 
> direction about what we’re doing incorrectly with either how we are indexing 
> the data and/or how we are querying against the polygons.
> > 
> > Thank you,
> > 
> > 
> > --
> > Marshall Sanders
> > Principal Software Engineer
> > Autotrader.com
> > 
> marshall.sande...@coxautoinc.com<mailto:marshall.sande...@coxautoinc.com>
> > 
> > 
> 
> -- 
> Ere Maijala
> Kansalliskirjasto / The National Library of Finland
> 
> 

-- 
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Solr Geospatial Polygon Indexing/Querying Issue

2019-07-24 Thread Ere Maijala
I think you might be missing the d parameter in geofilt. I'm not sure if
geofilt actually does anything useful without it.

Regards,
Ere

Sanders, Marshall (CAI - Atlanta) kirjoitti 23.7.2019 klo 21.32:
> We’re trying to index a polygon into solr and then filter/calculate geodist 
> on the polygon (ideally we actually want a circle, but it looks like that’s 
> not really supported officially by wkt/geojson and instead you have to switch 
> format=”legacy” which seems like something that might be removed in the 
> future so don’t want to rely on it).
> 
> Here’s the info from schema:
>  multiValued="true"/>
> 
>  class="solr.SpatialRecursivePrefixTreeFieldType"
>geo="true" distErrPct="0.025" maxDistErr="0.09" 
> distanceUnits="kilometers"
> spatialContextFactory="Geo3D"/>
> 
> 
> We’ve tried indexing some different data, but to keep it as simple as 
> possible we started with a triangle (will eventually add more points to 
> approximate a circle).  Here’s an example document that we’ve added just for 
> testing:
> 
> {
> "latlng": ["POLYGON((33.7942704 -84.4412613, 33.7100611 -84.4028091, 
> 33.7802888 -84.3279648, 33.7942704 -84.4412613))"],
> "ID": "284598223"
> }
> 
> 
> However, it seems like filtering/distance calculations aren’t working (at 
> least not the way we are used to doing it for points).  Here’s an example 
> query where the pt is several hundred kilometers away from the polygon, yet 
> the document still returns.  Also, it seems that regardless of origin point 
> or polygon location the calculated geodist is always 20015.115
> 
> Example query:
> select?d=1=ID,latlng,geodist()=%7B!geofilt%7D=on=33.9798087,-94.3286133=*:*=latlng=json
> 
> Example documents coming back anyway:
> "docs": [
> {
> "latlng": ["POLYGON((33.7942704 -84.4412613, 33.7100611 -84.4028091, 
> 33.7802888 -84.3279648, 33.7942704 -84.4412613))"],
> "ID": "284598223",
> "geodist()": 20015.115
> },
> {
> "latlng": ["POLYGON((33.7942704 -84.4412613, 33.7100611 -84.4028091, 
> 33.7802888 -84.3279648, 33.7942704 -84.4412613))"],
> "ID": "284600596",
> "geodist()": 20015.115
> }
> ]
> 
> 
> Anyone who has experience in this area can you point us in the right 
> direction about what we’re doing incorrectly with either how we are indexing 
> the data and/or how we are querying against the polygons.
> 
> Thank you,
> 
> 
> --
> Marshall Sanders
> Principal Software Engineer
> Autotrader.com
> marshall.sande...@coxautoinc.com<mailto:marshall.sande...@coxautoinc.com>
> 
> 

-- 
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Ignore accent in a request

2019-02-12 Thread Ere Maijala
I'm not brave enough to try char filter with such a large table, so I
can't really comment on that. I gave up with char filter after running
into some trouble handling cyrillic letters. At least ICUFoldingFilter
is really simple to use, and with more recent Solr versions you can also
use it with MappingCharFilter if necessary by defining a filter that
leaves given characters alone (see
https://lucene.apache.org/solr/guide/7_6/filter-descriptions.html#FilterDescriptions-ICUFoldingFilter
instead of the previous link I posted for up to date documentation).
Here's the real life configuration we use:

https://github.com/NatLibFi/finna-solr/blob/master/vufind/biblio/conf/schema.xml#L6

--Ere

elisabeth benoit kirjoitti 11.2.2019 klo 11.37:
> Thanks for the hint. We've been using the char filter for full unidecode
> normalization. Is the ICUFoldingFilter supposed to be faster? Or just
> simpler to use?
> 
> Le lun. 11 févr. 2019 à 09:58, Ere Maijala  a
> écrit :
> 
>> Please note that mapping characters works well for a small set of
>> characters, but if you want full UNICODE normalization, take a look at
>> the ICUFoldingFilter:
>>
>> https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ICUFoldingFilter
>>
>> --Ere
>>
>> elisabeth benoit kirjoitti 8.2.2019 klo 22.47:
>>> yes you do
>>>
>>> and use the char filter at index and query time
>>>
>>> Le ven. 8 févr. 2019 à 19:20, SAUNIER Maxence  a
>> écrit :
>>>
>>>> For the charFilter, I need to reindex all documents ?
>>>>
>>>> -Message d'origine-
>>>> De : Erick Erickson 
>>>> Envoyé : vendredi 8 février 2019 18:03
>>>> À : solr-user 
>>>> Objet : Re: Ignore accent in a request
>>>>
>>>> Elisabeth's suggestion is spot on for the accent.
>>>>
>>>> One other thing I noticed. You are using KeywordTokenizerFactory
>> combined
>>>> with EdgeNGramFilterFactory. This implies that you can't search for
>>>> individual _words_, only prefix queries, i.e.
>>>> je
>>>> je s
>>>> je su
>>>> je sui
>>>> je suis
>>>>
>>>> You can't search for "suis" for instance.
>>>>
>>>> basically this is an efficient way to search anything starting with
>>>> three-or-more letter prefixes at the expense of index size. You might be
>>>> better off just using wildcards (restrict to three letters at the prefix
>>>> though).
>>>>
>>>> This is perfectly valid, I'm mostly asking if it's your intent.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Fri, Feb 8, 2019 at 9:35 AM SAUNIER Maxence 
>> wrote:
>>>>>
>>>>> Thanks you !
>>>>>
>>>>> -Message d'origine-
>>>>> De : elisabeth benoit  Envoyé : vendredi 8
>>>>> février 2019 14:12 À : solr-user@lucene.apache.org Objet : Re: Ignore
>>>>> accent in a request
>>>>>
>>>>> Hello,
>>>>>
>>>>> We use solr 7 and use
>>>>>
>>>>> >>>> mapping="mapping-ISOLatin1Accent.txt"/>
>>>>>
>>>>> with mapping-ISOLatin1Accent.txt
>>>>>
>>>>> containing lines like
>>>>>
>>>>> # À => A
>>>>> "\u00C0" => "A"
>>>>>
>>>>> # Á => A
>>>>> "\u00C1" => "A"
>>>>>
>>>>> # Â => A
>>>>> "\u00C2" => "A"
>>>>>
>>>>> # Ã => A
>>>>> "\u00C3" => "A"
>>>>>
>>>>> # Ä => A
>>>>> "\u00C4" => "A"
>>>>>
>>>>> # Å => A
>>>>> "\u00C5" => "A"
>>>>>
>>>>> # Ā Ă Ą =>
>>>>> "\u0100" => "A"
>>>>> "\u0102" => "A"
>>>>> "\u0104" => "A"
>>>>>
>>>>> # Æ => AE
>>>>> "\u00C6" => "AE"
>>>>>
>>>>> # Ç => C
>>>>> "\u00C7" => "C"
>>>>>
>>>>> # é => e
>>>>> "\u00E9" => "e"
>>>>>
&

Re: Ignore accent in a request

2019-02-11 Thread Ere Maijala
bject
>>>>
>>>> etc.
>>>>
>>>> Because mm=757 looks really wrong. From the docs:
>>>> Defines the minimum number of clauses that must match, regardless of
>>>> how many clauses there are in total.
>>>>
>>>> edismax is used much more than dismax as it's more flexible, but
>>>> that's not germane here.
>>>>
>>>> finally, try adding =query to the url to see exactly how the
>>>> query is parsed.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Mon, Feb 4, 2019 at 9:09 AM SAUNIER Maxence 
>> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> How can I ignore accent in the query result ?
>>>>>
>>>>> Request :
>>>>> http://*:8983/solr/***/select?defType=dismax=je+suis+avarié;
>>>>> qf
>>>>> =t
>>>>> itle%5e20+subject%5e15+category%5e1+content%5e0.5=757
>>>>>
>>>>> I want to have doc with avarié and avarie.
>>>>>
>>>>> I have add this in my schema :
>>>>>
>>>>>   {
>>>>> "name": "string",
>>>>> "positionIncrementGap": "100",
>>>>> "analyzer": {
>>>>>   "filters": [
>>>>> {
>>>>>   "class": "solr.LowerCaseFilterFactory"
>>>>> },
>>>>> {
>>>>>   "class": "solr.ASCIIFoldingFilterFactory"
>>>>> },
>>>>> {
>>>>>   "class": "solr.EdgeNGramFilterFactory",
>>>>>   "minGramSize": "3",
>>>>>   "maxGramSize": "50"
>>>>> }
>>>>>   ],
>>>>>   "tokenizer": {
>>>>> "class": "solr.KeywordTokenizerFactory"
>>>>>   }
>>>>> },
>>>>> "stored": true,
>>>>> "indexed": true,
>>>>> "sortMissingLast": true,
>>>>> "class": "solr.TextField"
>>>>>   },
>>>>>
>>>>> But it not working.
>>>>>
>>>>> Thanks.
>>>>
>>
> 

-- 
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: **SPAM** Re: SolrCloud scaling/optimization for high request rate

2018-11-12 Thread Ere Maijala
From what I've gathered and what's been my experience docValues should 
be enabled, but if you can't think of anything else, I'd try turning 
them off to see if it makes any difference. As far as I can recall 
turning them off will increase usage of Solr's own caches and that 
caused noticeable slowdown for us, but your mileage may vary.


--Ere

Sofiya Strochyk kirjoitti 12.11.2018 klo 14.23:
Thanks for the suggestion Ere. It looks like they are actually enabled; 
in schema file the field is only marked as stored (field name="_id" 
type="string" multiValued="false" indexed="true" required="true" 
stored="true") but the admin UI shows DocValues as enabled, so I guess 
this is by default. Is the solution to add "docValues=false" in the schema?



On 12.11.18 10:43, Ere Maijala wrote:

Sofiya,

Do you have docValues enabled for the id field? Apparently that can 
make a significant difference. I'm failing to find the relevant 
references right now, but just something worth checking out.


Regards,
Ere

Sofiya Strochyk kirjoitti 6.11.2018 klo 16.38:

Hi Toke,

sorry for the late reply. The query i wrote here is edited to hide 
production details, but I can post additional info if this helps.


I have tested all of the suggested changes none of these seem to make 
a noticeable difference (usually response time and other metrics 
fluctuate over time, and the changes caused by different parameters 
are smaller than the fluctuations). What this probably means is that 
the heaviest task is retrieving IDs by query and not fields by ID. 
I've also checked QTime logged for these types of operations, and it 
is much higher for "get IDs by query" than for "get fields by IDs 
list". What could be done about this?


On 05.11.18 14:43, Toke Eskildsen wrote:

So far no answer from Sofiya. That's fair enough: My suggestions might
have seemed random. Let me try to qualify them a bit.


What we have to work with is the redacted query
q===0===24=2.2=json
and an earlier mention that sorting was complex.

My suggestions were to try

1) Only request simple sorting by score

If this improves performance substantially, we could try and see if
sorting could be made more efficient: Reducing complexity, pre-
calculating numbers etc.

2) Reduce rows to 0
3) Increase rows to 100

This measures one aspect of retrieval. If there is a big performance
difference between these two, we can further probe if the problem is
the number or size of fields - perhaps there is a ton of stored text,
perhaps there is a bunch of DocValued fields?

4) Set fl=id only

This is a variant of 2+3 to do a quick check if it is the resolving of
specific field values that is the problem. If using fl=id speeds up
substantially, the next step would be to add fields gradually until
(hopefully) there is a sharp performance decrease.

- Toke Eskildsen, Royal Danish Library




--
Email Signature
*Sofiia Strochyk
*


s...@interlogic.com.ua <mailto:s...@interlogic.com.ua>
InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>

Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn 
icon <https://www.linkedin.com/company/interlogic>






--
Email Signature
*Sofiia Strochyk
*


s...@interlogic.com.ua <mailto:s...@interlogic.com.ua>
InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>

Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn 
icon <https://www.linkedin.com/company/interlogic>




--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: SolrCloud scaling/optimization for high request rate

2018-11-12 Thread Ere Maijala

Sofiya,

Do you have docValues enabled for the id field? Apparently that can make 
a significant difference. I'm failing to find the relevant references 
right now, but just something worth checking out.


Regards,
Ere

Sofiya Strochyk kirjoitti 6.11.2018 klo 16.38:

Hi Toke,

sorry for the late reply. The query i wrote here is edited to hide 
production details, but I can post additional info if this helps.


I have tested all of the suggested changes none of these seem to make a 
noticeable difference (usually response time and other metrics fluctuate 
over time, and the changes caused by different parameters are smaller 
than the fluctuations). What this probably means is that the heaviest 
task is retrieving IDs by query and not fields by ID. I've also checked 
QTime logged for these types of operations, and it is much higher for 
"get IDs by query" than for "get fields by IDs list". What could be done 
about this?


On 05.11.18 14:43, Toke Eskildsen wrote:

So far no answer from Sofiya. That's fair enough: My suggestions might
have seemed random. Let me try to qualify them a bit.


What we have to work with is the redacted query
q===0===24=2.2=json
and an earlier mention that sorting was complex.

My suggestions were to try

1) Only request simple sorting by score

If this improves performance substantially, we could try and see if
sorting could be made more efficient: Reducing complexity, pre-
calculating numbers etc.

2) Reduce rows to 0
3) Increase rows to 100

This measures one aspect of retrieval. If there is a big performance
difference between these two, we can further probe if the problem is
the number or size of fields - perhaps there is a ton of stored text,
perhaps there is a bunch of DocValued fields?

4) Set fl=id only

This is a variant of 2+3 to do a quick check if it is the resolving of
specific field values that is the problem. If using fl=id speeds up
substantially, the next step would be to add fields gradually until
(hopefully) there is a sharp performance decrease.

- Toke Eskildsen, Royal Danish Library




--
Email Signature
*Sofiia Strochyk
*


s...@interlogic.com.ua <mailto:s...@interlogic.com.ua>
InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>

Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn 
icon <https://www.linkedin.com/company/interlogic>




--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: TLOG replica stucks

2018-11-01 Thread Ere Maijala
Could it be related to reloading a collection? I need to do some 
testing, but it just occurred to me that reload was done at least once 
during the period the cluster had been up.


Regards,
Ere

Ere Maijala kirjoitti 30.10.2018 klo 12.03:

Hi,

We had the same happen with PULL replicas with Solr 7.5. Solr was 
showing that they all had correct index version, but the changes were 
not showing. Unfortunately the solr.log size was too small to catch any 
issues, so I've now increased and waiting for it to happen again.


Regards,
Ere

Vadim Ivanov kirjoitti 25.10.2018 klo 18.42:

Thanks Erick for you attention!
My comments below, but supposing that the problem resides in zookeeper
I'll collect more information  from zk logs and solr logs and be back 
soon.



bq. I've noticed that some replicas stop receiving updates from the
leader without any visible signs from the cluster status.

Hmm, yes, this isn't expected at all. What are you seeing that causes
you to say this? You'd have to be monitoring the log for update
messages to the replicas that aren't leaders or the like.  If anyone is
going to have a prayer of reproducing we'll need more info on exactly
what you're seeing and how you're measuring this.


Meanwhile, I have log level WARN... I'l decrease  it to INFO and see. Tnx



Have you changed any configurations in your replicas at all? We'd need
the exact steps you performed if so.
Command to create replicas was like this (implicit sharding and custom 
CoreName ) :


mysolr07:8983/solr/admin/collections?action=ADDREPLICA
=rpk94
=rpk94_1_0
=rpk94_1_0_07
=tlog
=mysolr07:8983_solr



On a quick test I didn't see this, but if it were that easy to
reproduce I'd expect it to have shown up before.


Yesterday I've tried to reproduce...  trying to change leader with 
REBALANCELEADERS command.
It ended up with no leader at all for the shard  and I could not set 
leader at all for a long time.


    There was a problem trying to register as the 
leader:org.apache.solr.common.SolrException: Could not register as the 
leader because creating the ephemeral registration node in ZooKeeper 
failed

...
    Deleting duplicate registration: 
/collections/rpk94/leader_elect/rpk94_1_117/election/2983181187899523085-core_node73-n_22 


...
   Index fetch failed :org.apache.solr.common.SolrException: No 
registered leader was found after waiting for 4000ms , collection: 
rpk94 slice: rpk94_1_117

...

Even to delete all replicas for the shard and recreate Replica to the 
same node with the same name did not help - no leader for that shard.
I had to delete collection, wait till morning and then it recreated 
successfully.

Suppose some weird znodes were deleted from  zk by morning.



NOTE: just looking at the cloud graph and having a node be active is
not _necessarily_ sufficient for the node to be up to date. It
_should_ be sufficient if (and only if) the node was shut down
gracefully, but a "kill -9" or similar doesn't give the replicas on
the node the opportunity to change the state. The "live_nodes" znode
in ZooKeeper must also contain the node the replica resides on.


Node was live, cluster was healthy



If you see this state again, you could try pinging the node directly,
does it respond? Your URL should look something like:
http://host:port/solr/colection_shard1_replica_t1/query?q=*:*=false 



Yes, sure I did. Ill replica responded and number of documents differs 
with the leader




The "distrib=false" is important as it won't forward the query to any
other replica. If what you're reporting is really happening, that node
should respond with a document count different from other nodes.

NOTE: there's a delay between the time the leader indexes a doc and
it's visible on the follower. Are you sure you're waiting for
leader_commit_interval+polling_interval+autowarm_time before
concluding that there's a problem? I'm a bit suspicious that checking
the versions is concluding that your indexes are out of sync when
really they're just catching up normally. If it's at all possible to
turn off indexing for a few minutes when this happens and everything
just gets better then it's not really a problem.


Sure, the problem was on many shards but not on all shards
and for the long time.



If we prove out that this is really happening as you think, then a
JIRA (with steps to reproduce) is _definitely_ in order.

Best,
Erick
On Wed, Oct 24, 2018 at 2:07 AM Vadim Ivanov
 wrote:


Hi All !

I'm testing Solr 7.5 with TLOG replicas on SolrCloud with 5 nodes.

My collection has shards and every shard has 3 TLOG replicas on 
different

nodes.

I've noticed that some replicas stop receiving updates from the leader
without any visible signs from the cluster status.

(all replicas active and green in Admin UI CLOUD graph). But 
indexversion of

'ill' replica not increasing with the leader.

It seems to be dangerous, because that 'ill' replica could become a 
leader

after rest

Re: TLOG replica stucks

2018-10-30 Thread Ere Maijala
and
recreate ill replicas when difference with the leader indexversion  more
than one

Any suggestions?

--

Best regards, Vadim







--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: SolrCloud scaling/optimization for high request rate

2018-10-29 Thread Ere Maijala

Hi Sofiya,

You've already received a lot of ideas, but I think this wasn't yet 
mentioned: You didn't specify the number of rows your queries fetch or 
whether you're using deep paging in the queries. Both can be real 
perfomance killers in a sharded index because a large set of records 
have to be fetched from all shards. This consumes a relatively high 
amount of memory, and even if the servers are able to handle a certain 
number of these queries simultaneously, you'd run into garbage 
collection trouble with more queries being served. So just one more 
thing to be aware of!


Regards,
Ere

Sofiya Strochyk kirjoitti 26.10.2018 klo 18.55:

Hi everyone,

We have a SolrCloud setup with the following configuration:

  * 4 nodes (3x128GB RAM Intel Xeon E5-1650v2, 1x64GB RAM Intel Xeon
E5-1650v2, 12 cores, with SSDs)
  * One collection, 4 shards, each has only a single replica (so 4
replicas in total), using compositeId router
  * Total index size is about 150M documents/320GB, so about 40M/80GB
per node
  * Zookeeper is on a separate server
  * Documents consist of about 20 fields (most of them are both stored
and indexed), average document size is about2kB
  * Queries are mostly 2-3 words in the q field, with 2 fq parameters,
with complex sort expression (containing IF functions)
  * We don't use faceting due to performance reasons but need to add it
in the future
  * Majority of the documents are reindexed 2 times/day, as fast as the
SOLR allows, in batches of 1000-1 docs. Some of the documents
are also deleted (by id, not by query)
  * autoCommit is set to maxTime of 1 minute with openSearcher=false and
autoSoftCommit maxTime is 30 minutes with openSearcher=true. Commits
from clients are ignored.
  * Heap size is set to 8GB.

Target query rate is up to 500 qps, maybe 300, and we need to keep 
response time at <200ms. But at the moment we only see very good search 
performance with up to 100 requests per second. Whenever it grows to 
about 200, average response time abruptly increases to 0.5-1 second. 
(Also it seems that request rate reported by SOLR in admin metrics is 2x 
higher than the real one, because for every query, every shard receives 
2 requests: one to obtain IDs and second one to get data by IDs; so 
target rate for SOLR metrics would be 1000 qps).


During high request load, CPU usage increases dramatically on the SOLR 
nodes. It doesn't reach 100% but averages at 50-70% on 3 servers and 
about 93% on 1 server (random server each time, not the smallest one).


The documentation mentions replication to spread the load between the 
servers. We tested replicating to smaller servers (32GB RAM, Intel Core 
i7-4770). However, when we tested it, the replicas were going out of 
sync all the time (possibly during commits) and reported errors like 
"PeerSync Recovery was not successful - trying replication." Then they 
proceed with replication which takes hours and the leader handles all 
requests singlehandedly during that time. Also both leaders and replicas 
started encountering OOM errors (heap space) for unknown reason. Heap 
dump analysis shows that most of the memory is consumed by [J (array of 
long) type, my best guess would be that it is "_version_" field, but 
it's still unclear why it happens. Also, even though with replication 
request rate and CPU usage drop 2 times, it doesn't seem to affect 
mean_ms, stddev_ms or p95_ms numbers (p75_ms is much smaller on nodes 
with replication, but still not as low as under load of <100 requests/s).


Garbage collection is much more active during high load as well. Full GC 
happens almost exclusively during those times. We have tried tuning GC 
options like suggested here 
<https://wiki.apache.org/solr/ShawnHeisey#CMS_.28ConcurrentMarkSweep.29_Collector> 
and it didn't change things though.


My questions are

  * How do we increase throughput? Is replication the only solution?
  * if yes - then why doesn't it affect response times, considering that
CPU is not 100% used and index fits into memory?
  * How to deal with OOM and replicas going into recovery?
  * Is memory or CPU the main problem? (When searching on the internet,
i never see CPU as main bottleneck for SOLR, but our case might be
different)
  * Or do we need smaller shards? Could segments merging be a problem?
  * How to add faceting without search queries slowing down too much?
  * How to diagnose these problems and narrow down to the real reason in
hardware or setup?

Any help would be much appreciated.

Thanks!

--
Email Signature
*Sofiia Strochyk
*


s...@interlogic.com.ua <mailto:s...@interlogic.com.ua>
InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>

Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn 
icon <https://www.linkedin.com/company/interlogic>




--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: CMS GC - Old Generation collection never finishes (due to GC Allocation Failure?)

2018-10-04 Thread Ere Maijala

Hi,

In addition to what others wrote already, there are a couple of things 
that might trigger sudden memory allocation surge that you can't really 
account for:


1. Deep paging, especially in a sharded index. Don't allow it and you'll 
be much happier.


2. Faceting without docValues especially in a large index.

These would be my top two things to check before anything else. I've 
gone from 48 GB heap and GC having massive trouble keeping up to 8 GB 
heap and no trouble at all just by getting rid of deep paging and using 
docValues with all faceted fields.


--Ere

yasoobhaider kirjoitti 3.10.2018 klo 17.01:

Hi

I'm working with a Solr cluster with master-slave architecture.

Master and slave config:
ram: 120GB
cores: 16

At any point there are between 10-20 slaves in the cluster, each serving ~2k
requests per minute. Each slave houses two collections of approx 10G
(~2.5mil docs) and 2G(10mil docs) when optimized.

I am working with Solr 6.2.1

Solr configuration:

-XX:+CMSParallelRemarkEnabled
-XX:+CMSScavengeBeforeRemark
-XX:+ParallelRefProcEnabled
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
-XX:-OmitStackTraceInFastThrow
-XX:CMSInitiatingOccupancyFraction=50
-XX:CMSMaxAbortablePrecleanTime=6000
-XX:ConcGCThreads=4
-XX:MaxTenuringThreshold=8
-XX:ParallelGCThreads=4
-XX:PretenureSizeThreshold=64m
-XX:SurvivorRatio=15
-XX:TargetSurvivorRatio=90
-Xmn10G
-Xms80G
-Xmx80G

Some of these configurations have been reached by multiple trial and errors
over time, including the huge heap size.

This cluster usually runs without any error.

In the usual scenario, old gen gc is triggered according to the
configuration at 50% old gen occupancy, and the collector clears out the
memory over the next minute or so. This happens every 10-15 minutes.

However, I have noticed that sometimes the GC pattern of the slaves
completely changes and old gen gc is not able to clear the memory.

After observing the gc logs closely for multiple old gen gc collections, I
noticed that the old gen gc is triggered at 50% occupancy, but if there is a
GC Allocation Failure before the collection completes (after CMS Initial
Remark but before CMS reset), the old gen collection is not able to clear
much memory. And as soon as this collection completes, another old gen gc is
triggered.

And in worst case scenarios, this cycle of old gen gc triggering, GC
allocation failure keeps happening, and the old gen memory keeps increasing,
leading to a single threaded STW GC, which is not able to do much, and I
have to restart the solr server.

The last time this happened after the following sequence of events:

1. We optimized the bigger collection bringing it to its optimized size of
~10G.
2. For an unrelated reason, we had stopped indexing to the master. We
usually index at a low-ish throughput of ~1mil docs/day. This is relevant as
when we are indexing, the size of the collection increases, and this effects
the heap size used by collection.
3. The slaves started behaving erratically, with old gc collection not being
able to free up the required memory and finally being stuck in a STW GC.

As unlikely as this sounds, this is the only thing that changed on the
cluster. There was no change in query throughput or type of queries.

I restarted the slaves multiple times but the gc behaved in the same way for
over three days. Then when we fixed the indexing and made it live, the
slaves resumed their original gc pattern and are running without any issues
for over 24 hours now.

I would really be grateful for any advice on the following:

1. What could be the reason behind CMS not being able to free up the memory?
What are some experiments I can run to solve this problem?
2. Can stopping/starting indexing be a reason for such drastic changes to GC
pattern?
3. I have read at multiple places on this mailing list that the heap size
should be much lower (2x-3x the size of collection), but the last time I
tried CMS was not able to run smoothly and GC STW would occur which was only
solved by a restart. My reasoning for this is that the type of queries and
the throughput are also a factor in deciding the heap size, so it may be
that our queries are creating too many objects maybe. Is my reasoning
correct or should I try with a lower heap size (if it helps achieve a stable
gc pattern)?

(4. Silly question, but what is the right way to ask question on the mailing
list? via mail or via the nabble website? I sent this question earlier as a
mail, but it was not showing up on the nabble website so I am posting it
from the website now)

-
-

Logs which show this:


Desired survivor size 568413384 bytes, new threshold 2 (max 8)
- age   1:  437184344 bytes,  

Re: Java version 11 for solr 7.5?

2018-09-27 Thread Ere Maijala

Shawn Heisey kirjoitti 26.9.2018 klo 21.16:

On 9/26/2018 9:35 AM, Jeff Courtade wrote:

My concern with using g1 is solely based on finding this.
Does anyone have any information on this?

https://wiki.apache.org/lucene-java/JavaBugs#Oracle_Java_.2F_Sun_Java_.2F_OpenJDK_Bugs 



I have never had a single problem with Solr running with the G1 
collector.  I'm only aware of one actual bug in Lucene that mentions G1 
... and it is specific to the 32-bit version of Java.  It is strongly 
recommended for other reasons to only use a 64-bit Java.


On the subject of the blog post mentioned by Zisis T... generally 
speaking, it is not a good idea to explicitly set the size of the 
various generations.  G1 will tune the sizes of each generation as it 
runs for best results.  By setting or limiting the size, that tuning 
cannot work with freedom, and you might be unhappy with the results.


Here is a wiki page that contains my own experiments with garbage 
collection tuning:


https://wiki.apache.org/solr/ShawnHeisey


There's one caveat that I know of: You might need to modify G1 tuning 
parameters if your Solr has dramatically changing usage patterns. For 
instance heavy search use during the day and heavy indexing work during 
the night. This may lead G1 to optimize too heavily for one and it 
taking a while to adjust for the other. At some point I've used e.g. the 
following settings to limit the young generation:


-XX:+UnlockExperimentalVMOptions -XX:G1MaxNewSizePercent=5

This was used with a very specific usage pattern, so it probably doesn't 
apply in most situations.


Regards,
Ere

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Boost matches occurring early in the field (offset)

2018-09-17 Thread Ere Maijala
The original question is interesting and also I'd like to boost terms 
with lower positions, but if it's possible with the payload stuff, the 
slides and the article at 
https://lucidworks.com/2017/09/14/solr-payloads/ left me completely 
confused. A simple complete example would be so great.


Regards,
Ere

Alexandre Rafalovitch kirjoitti 29.8.2018 klo 23.51:

TokenOffsetPayloadTokenFilter ? It is mentioned in
https://www.slideshare.net/lucidworks/payloads-in-solr-erik-hatcher-lucidworks
, but no detailed example seems to be given.

I do see this question from time to time, so a definitive feedback
would be useful for the future.

Regards,
Alex.

On 29 August 2018 at 16:18, Jan Høydahl  wrote:

I also tend to use "sentinel tokens" for exact match or to anchor a search. But 
in order to obtain decaying boost the further down in the article a match is, you'd need 
to write several such span/slop queries with varying slops, e.g. highest boost for first 
10 words, medium boost for first 50 words, low boost for first 150 words, no boost below 
that.

As I wrote in my initial mail, we can do such workarounds, or play with 
payloads etc. But my real question is whether/how it is possible to factor the 
actual term offset information from a matching term into the scoring algorithm? 
Would you need to implement your own Scorer/Weight impl?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com


29. aug. 2018 kl. 15:37 skrev Doug Turnbull 
:

You can also insert a token at the beginning of the query during analysis
using a char filter. I call these sort of boundary tokens "sentinel
tokens". So a phrase search for "red shoes" becomes " red shoes".
You can add some slop to allow for permissible distance (with

You can also use the Limit Token Count Token Filter and create a copyField,
so if you want to boost on first 10 matches, just limit to 10 tokens then
use this as a boost query
https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-LimitTokenCountFilter

-Doug

On Wed, Aug 29, 2018 at 6:26 AM Mikhail Khludnev  wrote:



<
https://lucene.apache.org/solr/guide/6_6/other-parsers.html#OtherParsers-XMLQueryParser




On Wed, Aug 29, 2018 at 1:19 PM Jan Høydahl  wrote:


Hi,

Is there an ootb way to boost term matches based on their position/offset
inside a field, so that the term gets a higher score if it occurs in the
befinning of the field and lower boost or a deboost if it occurs towards
the end of a field?

I know that I could index the first part of the text in a new field and
boost on that, but that is kind of "binary".
I could also add the term offset as payload for every term and boost on
that, but this should not be necessary since offset info is already part

of

the index?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com




--
Sincerely yours
Mikhail Khludnev


--
CTO, OpenSource Connections
Author, Relevant Search
http://o19s.com/doug




--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: 20180917-Need Apache SOLR support

2018-09-17 Thread Ere Maijala



Shawn Heisey kirjoitti 17.9.2018 klo 19.03:

7.   If I have Billions of indexes, If the "start" parameter is 10th
Million index and "end" parameter is  start+100th index, for this case 
any

performance issue will be raised ?


Let's say that you send a request with these parameters, and the index 
has three shards:


start=1000=100

Every shard in the index is going to return a result to the coordinating 
node of ten million plus 100.  That's thirty million individual 
results.  The coordinating node will combine those results, sort them, 
and then request full documents for the 100 specific rows that were 
requested.  This takes a lot of time and a lot of memory.


What Shawn says above means that even if you give Solr a heap big enough 
to handle that, you'll run into serious performance issues even with a 
light load since the these huge allocations easily lead to 
stop-the-world garbage collections that kill performance. I've tried it 
and it was bad.


If you are thinking of a user interface that allows jumping to an 
arbitrary result page, you'll have to limit it to some sensible number 
of results (10 000 is probably safe, 100 000 may also work) or use 
something else than Solr. Cursor mark or streaming are great options, 
but only if you want to process all the records. Often the deep paging 
need is practically the need to see the last results, and that can also 
be achieved by allowing reverse sorting.


Regards,
Ere

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: 7.3.1: Query of death - all nodes ran out of memory and had to be shut down

2018-08-21 Thread Ere Maijala
1at


org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)

  #011at


java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135)

  #011at


java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)

  #011at java.base/java.lang.Thread.run(Thread.java:844)
  solr: WARN  DistributedUpdateProcessor Error sending update to
http://10.0.8.157:8983/solr
  org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
Error from server at

http://10.0.8.157:8983/solr/media_shard6_replica_t35:

Server Error
  message repeated 2 times: []
  request:


http://10.0.8.157:8983/solr/media_shard6_replica_t35/update?update.distrib=FROMLEADER=http%3A%2F%2F10.0.10.117%3A8983%2Fsolr%2Fmedia_shard6_replica_t10%2F=javabin=2

  Remote error message: java.util.concurrent.TimeoutException: Idle

timeout

expired: 120307/12 ms
  #011at


org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:383)

  #011at


org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:182)

  #011at


com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)

  #011at


org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)

  #011at


java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135)

  #011at


java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)

  #011at java.base/java.lang.Thread.run(Thread.java:844)
  solr: WARN  DistributedUpdateProcessor Error sending update to
http://10.0.9.47:8983/solr
  org.apache.http.NoHttpResponseException: 10.0.9.47:8983 failed to

respond

  #011at


org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:141)

  #011at


org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)

  #011at
org.apache.http.impl.io

.AbstractMessageParser.parse(AbstractMessageParser.java:259)

  #011at


org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)

  #011at


org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165)

  #011at


org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)

  #011at


org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)

  #011at


org.apache.solr.util.stats.InstrumentedHttpRequestExecutor.execute(InstrumentedHttpRequestExecutor.java:118)

  #011at


org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)

  #011at


org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)

  #011at

org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)

  #011at


org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)

  #011at


org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)

  #011at


org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)

  #011at


org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)

  #011at


org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:347)

  #011at


org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:182)

  #011at


com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)

  #011at


org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)

  #011at


java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135)

  #011at


java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)

  #011at java.base/java.lang.Thread.run(Thread.java:844)
  solr: WARN  DistributedUpdateProcessor Error sending update to
http://10.0.8.157:8983/solr
  org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
Error from server at

http://10.0.8.157:8983/solr/media_shard2_replica_t20:

Server Error


--
*P.S. We've launched a new blog to share the latest ideas and case

studies

from our team. Check it out here: product.canva.com
<http://product.canva.com/>. ***
** <https://canva.com>Empowering the world
to design
Also, we're hiring. Apply here!
<https://about.canva.com/careers/>
  <https://twitter.com/canva>
<https://facebook.com/canva> <https://au.linkedin.com/company/canva>
<https://instagram.com/canva>












--
Ere Maijala
Kansalliskirjasto / The National Library of Finland



Re: Setting preferred replica for query/read

2018-06-07 Thread Ere Maijala

Hi,

What I did in SOLR-11982 was meant to be used with replica types. The 
idea is that you could have a set of NRT replicas used for indexing and 
a set of PULL replicas used for queries. That's the easiest way to split 
the work since PULL replicas never do indexing work, and then you can 
say in the queries that "shards.preference=replica.type:PULL" or have 
that as a default parameter in solrconfig. SOLR-8146 is not needed for 
this. I suppose now that SOLR-11982 is done, SOLR-8146 would only be 
needed to make it easier to set the preferred replica type etc.


SOLR-11982 also allows you to use replica location in node preference. 
The nodes to use could be deduced from the cluster state and then you 
could use shards.preference with replica.location. But that means the 
client has to know which replicas to prefer.


Regards,
Ere

Zheng Lin Edwin Yeo kirjoitti 4.6.2018 klo 19.09:

Hi,

SOLR-8146 has not been updated since January last year, but I have just
commented it.

So we need both to be updated in order to achieve the full functionality of
setting preferred replica for query/read? Currently, is there a way to
achieve this by other means?

Regards,
Edwin

On 4 June 2018 at 19:43, Ere Maijala  wrote:


Hi,

Well, SOLR-11982 adds server-side support for part of what SOLR-8146 aims
to do (shards.preference=replica.location:[something]). It doesn't do
regular expressions or snitches at the moment, though it would be easy to
add. So, it looks to me like SOLR-8146 would need to be updated in this
regard.

--Ere


Zheng Lin Edwin Yeo kirjoitti 4.6.2018 klo 12.45:


Hi,

Is there any similarities between these two requests in the JIRA regarding
setting of prefer replica function?

(SOLR-11982) Add support for preferReplicaTypes parameter

(SOLR-8146) Allowing SolrJ CloudSolrClient to have preferred replica for
query/read

I am looking at setting one of the replica to be the preferred replica for
query/read, and another replica to be use for indexing.

I am using Solr 7.3.1 currently.

Regards,
Edwin



--
Ere Maijala
Kansalliskirjasto / The National Library of Finland





--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Setting preferred replica for query/read

2018-06-04 Thread Ere Maijala

Hi,

Well, SOLR-11982 adds server-side support for part of what SOLR-8146 
aims to do (shards.preference=replica.location:[something]). It doesn't 
do regular expressions or snitches at the moment, though it would be 
easy to add. So, it looks to me like SOLR-8146 would need to be updated 
in this regard.


--Ere

Zheng Lin Edwin Yeo kirjoitti 4.6.2018 klo 12.45:

Hi,

Is there any similarities between these two requests in the JIRA regarding
setting of prefer replica function?

(SOLR-11982) Add support for preferReplicaTypes parameter

(SOLR-8146) Allowing SolrJ CloudSolrClient to have preferred replica for
query/read

I am looking at setting one of the replica to be the preferred replica for
query/read, and another replica to be use for indexing.

I am using Solr 7.3.1 currently.

Regards,
Edwin



--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Limit search queries only to pull replicas

2018-02-14 Thread Ere Maijala
I've now posted https://issues.apache.org/jira/browse/SOLR-11982 with a 
patch. It works just like preferLocalShards. SOLR-10880 is awesome, but 
my idea is not to filter out anything, so this just adjusts the order of 
nodes.


--Ere

Tomas Fernandez Lobbe kirjoitti 8.1.2018 klo 21.42:

This feature is not currently supported. I was thinking in implementing it by 
extending the work done in SOLR-10880. I still didn’t have time to work on it 
though.  There is a patch for SOLR-10880 that doesn’t implement support for 
replica types, but could be used as base.

Tomás


On Jan 8, 2018, at 12:04 AM, Ere Maijala <ere.maij...@helsinki.fi> wrote:

Server load alone doesn't always indicate the server's ability to serve 
queries. Memory and cache state are important too, and they're not as easy to 
monitor. Additionally, server load at any single point in time or a short term 
average is not indicative of the server's ability to handle search requests if 
indexing happens in short but intense bursts.

It can also complicate things if there are more than one Solr instance running 
on a single server.

I'm definitely not against intelligent routing. In many cases it makes perfect 
sense, and I'd still like to use it, just limited to the pull replicas.

--Ere

Erick Erickson kirjoitti 5.1.2018 klo 19.03:

Actually, I think a much better option is to route queries to server load.
The theory of preferring pull replicas to leaders would be that the leader
will be doing the indexing work and the pull replicas would be doing less
work therefore serving queries faster. But that's a fragile assumption.
Let's say indexing stops totally. Now your leader is sitting there idle
when it could be serving queries.
The autoscaling work will allow for more intelligent routing, you can
monitor the CPU load on your servers and if the leader has some spare
cycles use them .vs. crudely routing all queries to pull replicas (or tlog
replicas for that matter). NOTE: I don't know whether this is being
actively worked on or not, but seems a logical extension of the increased
monitoring capabilities being put in place for autoscaling, but I'd rather
see effort put in there than support routing based solely on a node's type.
Best,
Erick
On Fri, Jan 5, 2018 at 7:51 AM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

It is interesting that ES had similar feature to prefer primary/replica
but it deprecating that and will remove it - could not find explanation why.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/




On 5 Jan 2018, at 15:22, Ere Maijala <ere.maij...@helsinki.fi> wrote:

Hi,

It would be really nice to have a server-side option, though. Not

everyone uses Solrj, and a typical fairly dummy client just queries the
server without any understanding about shards etc. Solr could be clever
enough to not forward the query to NRT shards when configured to prefer
PULL shards and they're available. Maybe it could be something similar to
the preferLocalShards parameter, like "preferShardTypes=TLOG,PULL".


--Ere

Emir Arnautović kirjoitti 14.12.2017 klo 11.41:

Hi Stanislav,
I don’t think that there is a built in feature to do this, but that

sounds like nice feature of Solrj - maybe you should check if available.
You can implement it outside of Solrj - check cluster state to see which
shards are available and send queries only to pull replicas.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/

On 14 Dec 2017, at 09:58, Stanislav Sandalnikov <

s.sandalni...@gmail.com> wrote:


Hi,

We have a Solr 7.1 setup with SolrCloud where we have multiple shards

on one server (for indexing) each shard has a pull replica on other servers.


What are the possible ways to limit search request only to pull type

replicase?

At the moment the only solution I found is to append shards parameter

to each query, but if new shards added later it requires to change
solrconfig. Is it the only way to do this?


Thank you

Regards
Stanislav



--
Ere Maijala
Kansalliskirjasto / The National Library of Finland





--
Ere Maijala
Kansalliskirjasto / The National Library of Finland




--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Request routing / load-balancing TLOG & PULL replica types

2018-02-14 Thread Ere Maijala

A patch is now available: https://issues.apache.org/jira/browse/SOLR-11982

--Ere

Greg Roodt kirjoitti 12.2.2018 klo 22.06:

Thanks Ere. I've taken a look at the discussion here:
http://lucene.472066.n3.nabble.com/Limit-search-queries-only-to-pull-replicas-td4367323.html
This is how I was imagining TLOG & PULL replicas would wor, so if this
functionality does get developed, it would be useful to me.

I still have 2 questions at the moment:
1. I am running the single shard scenario. I'm thinking of using a
dedicated HTTP load-balancer in front of the PULL replicas only with
read-only queries directed directly at the load-balancer. In this
situation, the healthy PULL replicas *should* handle the queries on the
node itself without a proxy hop (assuming state=active). New PULL replicas
added to the load-balancer will internally proxy queries to the other PULL
or TLOG replicas while in state=recovering until the switch to
state=active. Is my understanding correct?

2. Is it all worth it? Is there any advantage to running a cluster of 3
TLOGs + 10 PULL replicas vs running 13 TLOG replicas?




On 12 February 2018 at 19:25, Ere Maijala <ere.maij...@helsinki.fi> wrote:


Your question about directing queries to PULL replicas only has been
discussed on the list. Look for topic "Limit search queries only to pull
replicas". What I'd like to see is something similar to the
preferLocalShards parameter. It could be something like
"preferReplicaTypes=TLOG,PULL". Tomás mentioned previously that
SOLR-10880 could be used as a base for such funtionality, and I'm
considering taking a stab at implementing it.

--Ere


Greg Roodt kirjoitti 12.2.2018 klo 6.55:


Thank you both for your very detailed answers.

This is great to know. I knew that SolrJ had the cluster aware knowledge
(via zookeeper), but I was wondering what something like curl would do.
Great to know that internally the cluster will proxy queries to the
appropriate place regardless.

I am running the single shard scenario. I'm thinking of using a dedicated
HTTP load-balancer in front of the PULL replicas only with read-only
queries directed directly at the load-balancer. In this situation, the
healthy PULL replicas *should* handle the queries on the node itself
without a proxy hop (assuming state=active). New PULL replicas added to
the
load-balancer will internally proxy queries to the other PULL or TLOG
replicas while in state=recovering until the switch to state=active.

Is my understanding correct?

Is this sensible to do, or is it not worth it due to the smart proxying
that SolrCloud can do anyway?

If the TLOG and PULL replicas are so similar, is there any real advantage
to having a mixed cluster? I assume a bit less work is required across the
cluster to propagate writes if you only have 3 TLOG nodes vs 10+ PULL
nodes? Or would it be better to just have 13 TLOG nodes?





On 12 February 2018 at 15:24, Tomas Fernandez Lobbe <tflo...@apple.com>
wrote:

On the last question:

For Writes: Yes. Writes are going to be sent to the shard leader, and
since PULL replicas can’t  be leaders, it’s going to be a TLOG replica.
If
you are using CloudSolrClient, then this routing will be done directly
from
the client (since it will send the update to the leader), and if you are
using some other HTTP client, then yes, the PULL replica will forward the
update, the same way any non-leader node would.

For reads: this won’t happen today, and any replica can respond to
queries. I do believe there is value in this kind of routing logic,
sometimes you simply don’t want the leader to handle any queries,
specially
when queries can be expensive. You could do this today if you want, by
putting some load balancer in front and just direct your queries to the
nodes you know are PULL, but keep in mind that this would only work in
the
single shard scenario, and only if you hit an active replica (otherwise,
as
you said, the query will be routed to any other node of the shard,
regardless of the type), if you have multiple shards then you need to use
the “shards” parameter and tell Solr exactly which nodes you want to hit
for each shard (the “shards” approach can also be done in the single
shard
case, although you would be adding an extra hop I believe)

Tomás
Sent from my iPhone

On Feb 11, 2018, at 6:35 PM, Greg Roodt <gro...@gmail.com> wrote:


Hi

I have a question around how queries are routed and load-balanced in a
cluster of mixed TLOG and PULL replicas.

I thought that I might have to put a load-balancer in front of the PULL
replicas and direct queries at them manually as nodes are added and


removed


as PULL replicas. However, it seems that SolrCloud handles this
automatically?

If I add a new PULL replica node, it goes into state="recovering" while


it


pulls the core. As expected. What happens if queries are directed at
this
node while in this state? From what I am observing, the query gets


directed


to another nod

Re: Request routing / load-balancing TLOG & PULL replica types

2018-02-12 Thread Ere Maijala
2. In my experience using PULL replicas can have a significant positive 
effect on the server load. It depends of course on your analysis chain, 
but we do some fairly expensive analysis, and not having to do the same 
work X times does have a benefit. Unfortunately we need multiple shards 
so we can't currently isolate the query traffic from the indexing work.


I took a quick look at the shard selection code yesterday, and it seems 
it might be quite simple to add replica selection to the same place 
where preferLocalShards parameter is handled.


--Ere

Greg Roodt kirjoitti 12.2.2018 klo 22.06:

Thanks Ere. I've taken a look at the discussion here:
http://lucene.472066.n3.nabble.com/Limit-search-queries-only-to-pull-replicas-td4367323.html
This is how I was imagining TLOG & PULL replicas would wor, so if this
functionality does get developed, it would be useful to me.

I still have 2 questions at the moment:
1. I am running the single shard scenario. I'm thinking of using a
dedicated HTTP load-balancer in front of the PULL replicas only with
read-only queries directed directly at the load-balancer. In this
situation, the healthy PULL replicas *should* handle the queries on the
node itself without a proxy hop (assuming state=active). New PULL replicas
added to the load-balancer will internally proxy queries to the other PULL
or TLOG replicas while in state=recovering until the switch to
state=active. Is my understanding correct?

2. Is it all worth it? Is there any advantage to running a cluster of 3
TLOGs + 10 PULL replicas vs running 13 TLOG replicas?




On 12 February 2018 at 19:25, Ere Maijala <ere.maij...@helsinki.fi> wrote:


Your question about directing queries to PULL replicas only has been
discussed on the list. Look for topic "Limit search queries only to pull
replicas". What I'd like to see is something similar to the
preferLocalShards parameter. It could be something like
"preferReplicaTypes=TLOG,PULL". Tomás mentioned previously that
SOLR-10880 could be used as a base for such funtionality, and I'm
considering taking a stab at implementing it.

--Ere


Greg Roodt kirjoitti 12.2.2018 klo 6.55:


Thank you both for your very detailed answers.

This is great to know. I knew that SolrJ had the cluster aware knowledge
(via zookeeper), but I was wondering what something like curl would do.
Great to know that internally the cluster will proxy queries to the
appropriate place regardless.

I am running the single shard scenario. I'm thinking of using a dedicated
HTTP load-balancer in front of the PULL replicas only with read-only
queries directed directly at the load-balancer. In this situation, the
healthy PULL replicas *should* handle the queries on the node itself
without a proxy hop (assuming state=active). New PULL replicas added to
the
load-balancer will internally proxy queries to the other PULL or TLOG
replicas while in state=recovering until the switch to state=active.

Is my understanding correct?

Is this sensible to do, or is it not worth it due to the smart proxying
that SolrCloud can do anyway?

If the TLOG and PULL replicas are so similar, is there any real advantage
to having a mixed cluster? I assume a bit less work is required across the
cluster to propagate writes if you only have 3 TLOG nodes vs 10+ PULL
nodes? Or would it be better to just have 13 TLOG nodes?





On 12 February 2018 at 15:24, Tomas Fernandez Lobbe <tflo...@apple.com>
wrote:

On the last question:

For Writes: Yes. Writes are going to be sent to the shard leader, and
since PULL replicas can’t  be leaders, it’s going to be a TLOG replica.
If
you are using CloudSolrClient, then this routing will be done directly
from
the client (since it will send the update to the leader), and if you are
using some other HTTP client, then yes, the PULL replica will forward the
update, the same way any non-leader node would.

For reads: this won’t happen today, and any replica can respond to
queries. I do believe there is value in this kind of routing logic,
sometimes you simply don’t want the leader to handle any queries,
specially
when queries can be expensive. You could do this today if you want, by
putting some load balancer in front and just direct your queries to the
nodes you know are PULL, but keep in mind that this would only work in
the
single shard scenario, and only if you hit an active replica (otherwise,
as
you said, the query will be routed to any other node of the shard,
regardless of the type), if you have multiple shards then you need to use
the “shards” parameter and tell Solr exactly which nodes you want to hit
for each shard (the “shards” approach can also be done in the single
shard
case, although you would be adding an extra hop I believe)

Tomás
Sent from my iPhone

On Feb 11, 2018, at 6:35 PM, Greg Roodt <gro...@gmail.com> wrote:


Hi

I have a question around how queries are routed and load-balanced in a
cluster of mixed TLOG and PULL replicas.

I though

Re: Request routing / load-balancing TLOG & PULL replica types

2018-02-12 Thread Ere Maijala
Your question about directing queries to PULL replicas only has been 
discussed on the list. Look for topic "Limit search queries only to pull 
replicas". What I'd like to see is something similar to the 
preferLocalShards parameter. It could be something like 
"preferReplicaTypes=TLOG,PULL". Tomás mentioned previously that 
SOLR-10880 could be used as a base for such funtionality, and I'm 
considering taking a stab at implementing it.


--Ere

Greg Roodt kirjoitti 12.2.2018 klo 6.55:

Thank you both for your very detailed answers.

This is great to know. I knew that SolrJ had the cluster aware knowledge
(via zookeeper), but I was wondering what something like curl would do.
Great to know that internally the cluster will proxy queries to the
appropriate place regardless.

I am running the single shard scenario. I'm thinking of using a dedicated
HTTP load-balancer in front of the PULL replicas only with read-only
queries directed directly at the load-balancer. In this situation, the
healthy PULL replicas *should* handle the queries on the node itself
without a proxy hop (assuming state=active). New PULL replicas added to the
load-balancer will internally proxy queries to the other PULL or TLOG
replicas while in state=recovering until the switch to state=active.

Is my understanding correct?

Is this sensible to do, or is it not worth it due to the smart proxying
that SolrCloud can do anyway?

If the TLOG and PULL replicas are so similar, is there any real advantage
to having a mixed cluster? I assume a bit less work is required across the
cluster to propagate writes if you only have 3 TLOG nodes vs 10+ PULL
nodes? Or would it be better to just have 13 TLOG nodes?





On 12 February 2018 at 15:24, Tomas Fernandez Lobbe <tflo...@apple.com>
wrote:


On the last question:
For Writes: Yes. Writes are going to be sent to the shard leader, and
since PULL replicas can’t  be leaders, it’s going to be a TLOG replica. If
you are using CloudSolrClient, then this routing will be done directly from
the client (since it will send the update to the leader), and if you are
using some other HTTP client, then yes, the PULL replica will forward the
update, the same way any non-leader node would.

For reads: this won’t happen today, and any replica can respond to
queries. I do believe there is value in this kind of routing logic,
sometimes you simply don’t want the leader to handle any queries, specially
when queries can be expensive. You could do this today if you want, by
putting some load balancer in front and just direct your queries to the
nodes you know are PULL, but keep in mind that this would only work in the
single shard scenario, and only if you hit an active replica (otherwise, as
you said, the query will be routed to any other node of the shard,
regardless of the type), if you have multiple shards then you need to use
the “shards” parameter and tell Solr exactly which nodes you want to hit
for each shard (the “shards” approach can also be done in the single shard
case, although you would be adding an extra hop I believe)

Tomás
Sent from my iPhone


On Feb 11, 2018, at 6:35 PM, Greg Roodt <gro...@gmail.com> wrote:

Hi

I have a question around how queries are routed and load-balanced in a
cluster of mixed TLOG and PULL replicas.

I thought that I might have to put a load-balancer in front of the PULL
replicas and direct queries at them manually as nodes are added and

removed

as PULL replicas. However, it seems that SolrCloud handles this
automatically?

If I add a new PULL replica node, it goes into state="recovering" while

it

pulls the core. As expected. What happens if queries are directed at this
node while in this state? From what I am observing, the query gets

directed

to another node?

If SolrCloud is handling the routing of requests to active nodes, will it
automatically favour PULL replicas for read queries and TLOG replicas for
writes?

Thanks
Greg






--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Long GC Pauses

2018-02-07 Thread Ere Maijala




Firstly, it's really hard to even take guesses about potential causes



or remediations without more details about your load characteristics



(average/peak QPS, index size, average document size, etc.).  If no



one gives any satisfactory advice, please consider uploading



additional details to help us help you.







Secondly, I don't know anything about the load characteristics you're



putting on your Solr cluster, but I'm curious whether you've



experimented with lower RAM settings.  Generally speaking, the more



RAM you have, the longer your GC pauses are likely to be (even with



the tuning that various GC settings provide).  If you can get away



with giving the Solr process less RAM, you should see your GC pauses



shrink.  Was 40GB chosen after some trial-and-error experimentation, or is it 
something you could investigate?







For a bit more overview on this, see this slightly outdated (but still



useful) wiki page:



https://wiki.apache.org/solr/SolrPerformanceProblems#RAM







Hope that helps, even if just to disqualify some potential



causes/solutions to close in on a real fix.







Best,







Jason







On Wed, Jan 31, 2018 at 8:17 AM, Maulin Rathod 
<mrat...@asite.com<mailto:mrat...@asite.com>> wrote:







Hi,







We are using solr cloud 6.1. We have around 20 collection on 4 nodes



(We have 2 shards and each shard have 2 replicas). We have allocated



40 GB RAM to each shard.







Intermittently we found long GC pauses (60 sec to 200 sec) due to



which solr stops responding and hence collections goes in recovering



mode. It takes minimum 5-10 minutes (sometime it takes more and we



have to restart the solr node) for recovering all collections. We are



using default GC setting (CMS) as per solr.cmd.







We tried different G1 GC to see if it help, but still we see long GC



pauses(60 sec to 200 sec) and also found that memory usage is more in



in case G1 GC.







What could be reason for long GC pauses and how can fix it?



Insufficient memory or problem with GC setting or something else? Any



suggestion would be greatly appreciated.







In our analysis, we also found some inefficient queries (which uses *



many times in query) in solr logs. Could it be reason for high memory usage?







Slow Query



--







INFO  (qtp1239731077-498778) [c:documents s:shard1 r:core_node1



x:documents] o.a.s.c.S.Request [documents]  webapp=/solr path=/select



params={df=summary=false=id=4&



start=0=true=description+asc,id+desc==



s1.asite.com:8983/solr/documents|s1r1.asite.com:



8983/solr/documents=250=2=((id:(



REV78364_24705418+REV78364_24471492+REV78364_24471429+



REV78364_24470771+REV78364_24470271+))+OR+summary:((HPC*+



AND+*+AND+*+AND+OH1150*+AND+*+AND+*+AND+U0*+AND+*+AND+*+AND+



HGS*+AND+*+AND+*+AND+MDL*+AND+*+AND+*+AND+100067*+AND+*+AND+



-*+AND+Reinforcement*+AND+*+AND+Mode*)+))++AND++(title:((*



HPC\+\-\+OH1150\+\-\+U0\+\-\+HGS\+\-\+MDL\+\-\+100067\+-\+



Reinforcement\+Mode*)+))+AND+project_id:(-2+78243+78365+



78364)+AND+is_active:true+AND+((isLatest:(true)+AND+



isFolderActive:true+AND+isXref:false+AND+-document_



type_id:(3+7)+AND+((is_public:true+OR+distribution_list:



4858120+OR+folderadmin_list:4858120+OR+author_user_id:



4858120)+AND+((defaultAccess:(true)+OR+allowedUsers:(



4858120)+OR+allowedRoles:(6342201+172408+6336860)+OR+



combinationUsers:(4858120))+AND+-blockedUsers:(4858120



+OR+(isLatestRevPrivate:(true)+AND+allowedUsersForPvtRev:(



4858120)+AND+-folderadmin_list:(4858120)))=true=



1516786982952=true=javabin} hits=0 status=0 QTime=83309



















Regards,







Maulin







[CC Award Winners!]










--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: 7.2.1 cluster dies within minutes after restart

2018-02-01 Thread Ere Maijala
sters to 7.2.1 and I am not

sure

I quite follow the conversation here.
Is there a simple workaround to set the ZK_CLIENT_TIMEOUT to a higher

value

in the config (and it's just a default value being wrong/overridden
somewhere)?
Or is it more severe in the sense that any config set for

ZK_CLIENT_TIMEOUT

by the user is just ignored completely by Solr in 7.2.1 ?

Thanks
SG


On Mon, Jan 29, 2018 at 3:09 AM, Markus Jelsma <

markus.jel...@openindex.io>

wrote:


Ok, i applied the patch and it is clear the timeout is 15000.

Solr.xml

says 3 if ZK_CLIENT_TIMEOUT is not set, which is by default

unset

in

solr.in.sh,but set in bin/solr to 15000. So it seems Solr's

default is

still 15000, not 3.

But, back to my topic. I see we explicitly set it in solr.in.sh to

3.

To be sure, i applied your patch to a production machine, all our
collections run with 3. So how would that explain this log

line?


o.a.z.ClientCnxn Client session timed out, have not heard from

server

in

22130ms

We also see these with smaller values, seven seconds. And, is this
actually an indicator of the problems we have?

Any ideas?

Many thanks,
Markus


-Original message-

From:Markus Jelsma <markus.jel...@openindex.io>
Sent: Saturday 27th January 2018 10:03
To: solr-user@lucene.apache.org
Subject: RE: 7.2.1 cluster dies within minutes after restart

Hello,

I grepped for it yesterday and found nothing but 3 in the

settings,

but judging from the weird time out value, you may be right. Let me

apply

your patch early next week and check for spurious warnings.


Another note worthy observation for those working on cloud

stability

and

recovery, whenever this happens, some nodes are also absolutely

sure

to run

OOM. The leaders usually live longest, the replica's don't, their

heap

usage peaks every time, consistently.


Thanks,
Markus

-Original message-

From:Shawn Heisey <apa...@elyograg.org>
Sent: Saturday 27th January 2018 0:49
To: solr-user@lucene.apache.org
Subject: Re: 7.2.1 cluster dies within minutes after restart

On 1/26/2018 10:02 AM, Markus Jelsma wrote:

o.a.z.ClientCnxn Client session timed out, have not heard

from

server in 22130ms (although zkClientTimeOut is 3).


Are you absolutely certain that there is a setting for

zkClientTimeout

that is actually getting applied?  The default value in Solr's

example

configs is 30 seconds, but the internal default in the code

(when

no

configuration is found) is still 15.  I have confirmed this in

the

code.


Looks like SolrCloud doesn't log the values it's using for

things

like

zkClientTimeout.  I think it should.

https://issues.apache.org/jira/browse/SOLR-11915

Thanks,
Shawn


















--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Limit search queries only to pull replicas

2018-01-08 Thread Ere Maijala
Server load alone doesn't always indicate the server's ability to serve 
queries. Memory and cache state are important too, and they're not as 
easy to monitor. Additionally, server load at any single point in time 
or a short term average is not indicative of the server's ability to 
handle search requests if indexing happens in short but intense bursts.


It can also complicate things if there are more than one Solr instance 
running on a single server.


I'm definitely not against intelligent routing. In many cases it makes 
perfect sense, and I'd still like to use it, just limited to the pull 
replicas.


--Ere

Erick Erickson kirjoitti 5.1.2018 klo 19.03:

Actually, I think a much better option is to route queries to server load.

The theory of preferring pull replicas to leaders would be that the leader
will be doing the indexing work and the pull replicas would be doing less
work therefore serving queries faster. But that's a fragile assumption.
Let's say indexing stops totally. Now your leader is sitting there idle
when it could be serving queries.

The autoscaling work will allow for more intelligent routing, you can
monitor the CPU load on your servers and if the leader has some spare
cycles use them .vs. crudely routing all queries to pull replicas (or tlog
replicas for that matter). NOTE: I don't know whether this is being
actively worked on or not, but seems a logical extension of the increased
monitoring capabilities being put in place for autoscaling, but I'd rather
see effort put in there than support routing based solely on a node's type.

Best,
Erick

On Fri, Jan 5, 2018 at 7:51 AM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:


It is interesting that ES had similar feature to prefer primary/replica
but it deprecating that and will remove it - could not find explanation why.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/




On 5 Jan 2018, at 15:22, Ere Maijala <ere.maij...@helsinki.fi> wrote:

Hi,

It would be really nice to have a server-side option, though. Not

everyone uses Solrj, and a typical fairly dummy client just queries the
server without any understanding about shards etc. Solr could be clever
enough to not forward the query to NRT shards when configured to prefer
PULL shards and they're available. Maybe it could be something similar to
the preferLocalShards parameter, like "preferShardTypes=TLOG,PULL".


--Ere

Emir Arnautović kirjoitti 14.12.2017 klo 11.41:

Hi Stanislav,
I don’t think that there is a built in feature to do this, but that

sounds like nice feature of Solrj - maybe you should check if available.
You can implement it outside of Solrj - check cluster state to see which
shards are available and send queries only to pull replicas.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/

On 14 Dec 2017, at 09:58, Stanislav Sandalnikov <

s.sandalni...@gmail.com> wrote:


Hi,

We have a Solr 7.1 setup with SolrCloud where we have multiple shards

on one server (for indexing) each shard has a pull replica on other servers.


What are the possible ways to limit search request only to pull type

replicase?

At the moment the only solution I found is to append shards parameter

to each query, but if new shards added later it requires to change
solrconfig. Is it the only way to do this?


Thank you

Regards
Stanislav



--
Ere Maijala
Kansalliskirjasto / The National Library of Finland







--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Limit search queries only to pull replicas

2018-01-07 Thread Ere Maijala
Interesting indeed, but maybe in line with the idea that ES knows what 
to do best without the user interfering.


My example parameter name was bad, it should have been something like 
"preferReplicaTypes=TLOG,PULL". I can't see what would be bad about 
that, but then to me it seems Solr has always been much more about 
giving control to the administrator or developer instead of 
automatically just working. This may be daunting in the beginning, but 
it seems I always start to look for more control of how things are done 
in the long run.


--Ere

Emir Arnautović kirjoitti 5.1.2018 klo 17.51:

It is interesting that ES had similar feature to prefer primary/replica but it 
deprecating that and will remove it - could not find explanation why.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/




On 5 Jan 2018, at 15:22, Ere Maijala <ere.maij...@helsinki.fi> wrote:

Hi,

It would be really nice to have a server-side option, though. Not everyone uses Solrj, 
and a typical fairly dummy client just queries the server without any understanding about 
shards etc. Solr could be clever enough to not forward the query to NRT shards when 
configured to prefer PULL shards and they're available. Maybe it could be something 
similar to the preferLocalShards parameter, like "preferShardTypes=TLOG,PULL".

--Ere

Emir Arnautović kirjoitti 14.12.2017 klo 11.41:

Hi Stanislav,
I don’t think that there is a built in feature to do this, but that sounds like 
nice feature of Solrj - maybe you should check if available. You can implement 
it outside of Solrj - check cluster state to see which shards are available and 
send queries only to pull replicas.
HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/

On 14 Dec 2017, at 09:58, Stanislav Sandalnikov <s.sandalni...@gmail.com> wrote:

Hi,

We have a Solr 7.1 setup with SolrCloud where we have multiple shards on one 
server (for indexing) each shard has a pull replica on other servers.

What are the possible ways to limit search request only to pull type replicase?
At the moment the only solution I found is to append shards parameter to each 
query, but if new shards added later it requires to change solrconfig. Is it 
the only way to do this?

Thank you

Regards
Stanislav



--
Ere Maijala
Kansalliskirjasto / The National Library of Finland




--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Limit search queries only to pull replicas

2018-01-05 Thread Ere Maijala

Hi,

It would be really nice to have a server-side option, though. Not 
everyone uses Solrj, and a typical fairly dummy client just queries the 
server without any understanding about shards etc. Solr could be clever 
enough to not forward the query to NRT shards when configured to prefer 
PULL shards and they're available. Maybe it could be something similar 
to the preferLocalShards parameter, like "preferShardTypes=TLOG,PULL".


--Ere

Emir Arnautović kirjoitti 14.12.2017 klo 11.41:

Hi Stanislav,
I don’t think that there is a built in feature to do this, but that sounds like 
nice feature of Solrj - maybe you should check if available. You can implement 
it outside of Solrj - check cluster state to see which shards are available and 
send queries only to pull replicas.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/




On 14 Dec 2017, at 09:58, Stanislav Sandalnikov <s.sandalni...@gmail.com> wrote:

Hi,

We have a Solr 7.1 setup with SolrCloud where we have multiple shards on one 
server (for indexing) each shard has a pull replica on other servers.

What are the possible ways to limit search request only to pull type replicase?
At the moment the only solution I found is to append shards parameter to each 
query, but if new shards added later it requires to change solrconfig. Is it 
the only way to do this?

Thank you

Regards
Stanislav





--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: slow solr facet processing

2018-01-05 Thread Ere Maijala

Hi Everyone,

This is a followup on the discussion from September 2017. Since then 
I've spent a lot of time gathering a better understanding on docValues 
compared to UIF and other stuff related to Solr performance. Here's a 
summary of the results based on my real-world experience:


1. Making sure Solr needs as little Java heap as possible is crucial.

2. UIF requires a lot of Java heap. With a larger index it becomes 
impractical, since Java GC can't easily keep up with the heaps required.


3. UIF is really fast, but only after serious warmup. DocValues work 
better if the index is updated regularly, since same level of warmup is 
not needed.


4. DocValues, taking advantage of memory-mapped files, don't have the 
above problem, and after moving to all-docValues we have been able to 
reduce the Java heap from 31G to 6G. This is pretty significant, since 
it means we don't have to deal with long GC pauses.


5. Make sure docValues are enabled also for all fields used for sorting. 
This helps avoid spending memory on field cache. Without docValues we 
could easily have 2 GB of field cache entries.


5. It seems that having docValues for the id field is useful too. For 
now stored needs to remain true too (see 
https://issues.apache.org/jira/browse/SOLR-10816).


6. Sharding the index helps faceting with docValues perform more work in 
parallel and results in a lot better performance. This doesn't seem to 
negatively affect the overall performance (at least enough to be 
perceived), and it seems that splitting our index to three shards 
resulted in speedup that's better than previous performance divided by 
three. There is a caveat [1], though.


7. In many cases fields that have docValues enabled can be switched from 
stored="true" to stored="false" since Solr can fetch the contents from 
docValues. A notable exception is multivalued fields where the order of 
the values is important. This means that enabling docValues doesn't add 
to the index size significantly.


8. Different replica types available in Solr 7 are really useful in 
reducing the CPU time spent indexing records. I'd still like to have a 
way to have PULL replicas with NRT replicas so that only the PULL 
replicas handle search queries.


9. Lastly, a lot can be done on the application level. For instance in 
our case many users don't care about facets or only use a couple of 
them, so we fetch them asynchronously as needed and collapse most by 
default without fetching them at all. This lowers the server load 
significantly (I'll work on contributing the option to upstream VuFind).



I hope this helps others make informed choices.

--Ere


[1] Care must be taken to avoid requests that cause Solr to fetch a lot 
of rows at once from each shard, since that blows up the memory usage 
wreaking havoc in Solr. One particular case that, at first sight, 
doesn't look too dangerous, is deep paging without a cursor (Yonik has a 
good explanation of this at http://yonik.com/solr/paging-and-deep-paging/).


Re: SolrCloud not able to view cloud page - Loading of "/solr/zookeeper?wt=json" failed (HTTP-Status 500)

2017-10-30 Thread Ere Maijala
On the Solr side there's at least 
https://issues.apache.org/jira/browse/SOLR-9818 which may cause trouble 
with the queue. I once had the core reload command in the admin UI add 
more than 200k entries to the overseer queue..


--Ere

Shawn Heisey kirjoitti 25.10.2017 klo 15.57:

On 10/24/2017 8:11 AM, Tarjono, C. A. wrote:

Would like to check if anyone have seen this issue before, we started
having this a few days ago:

  


The only error I can see in solr console is below:

5960847[main-SendThread(172.16.130.132:2281)] WARN
org.apache.zookeeper.ClientCnxn [ ] – Session 0x65f4e28b7370001 for
server 172.16.130.132/172.16.130.132:2281, unexpected error, closing
socket connection and attempting reconnect java.io.IOException: Packet
len30829010 is out of range!



Combining the last part of what I quoted above with the image you shared
later, I am pretty sure I know what is happening.

The overseer queue in zookeeper (at the ZK path of /overseer/queue) has
a lot of entries in it.  Based on the fact that you are seeing a packet
length beyond 30 million bytes, I am betting that the number of entries
in the queue is between 1.5 million and 2 million.  ZK cannot handle
that packet size without a special startup argument.  The value of the
special parameter defaults to a little over one million bytes.

To fix this, you're going to need to wipe out the overseer queue.  ZK
includes a script named ZkCli.  Note that Solr includes a script called
zkcli as well, which does very different things.  You need the one
included with zookeeper.

Wiping out the queue when it is that large is not straightforward.  You
need to start the ZkCli script included with zookeeper with a
-Djute.maxbuffer=3100 argument and the same zkHost value used by
Solr, and then use a command like "rmr /overseer/queue" in that command
shell to completely remove the /overseer/queue path.  Then you can
restart the ZK servers without the jute.maxbuffer setting.  You may need
to restart Solr.  Running this procedure might also require temporarily
restarting the ZK servers with the same jute.maxbuffer argument, but I
am not sure whether that is required.

The basic underlying problem here is that ZK allows adding new nodes
even when the size of the parent node exceeds the default buffer size.
That issue is documented here:

https://issues.apache.org/jira/browse/ZOOKEEPER-1162

I can't be sure why why your cloud is adding so many entries to the
overseer queue.  I have seen this problem happen when restarting a
server in the cloud, particularly when there are a large number of
collections or shard replicas in the cloud.  Restarting multiple servers
or restarting the same server multiple times without waiting for the
overseer queue to empty could also cause the issue.

Thanks,
Shawn



--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Sorting by distance resources with WKT polygon data

2017-09-22 Thread Ere Maijala

Hi,

our strategy is to have a separate center coordinate field that we use 
for sorting. This has the additional benefit that it's possible to have 
the indexed center point differ from the polygon's centroid, which can 
be useful e.g. with cities, where the city center can be quite a bit 
offset from the centroid.


--Ere

Grondin Luc kirjoitti 13.9.2017 klo 0.07:

Hello,

I am having difficulties with sorting by distance resources indexed with WKT 
geolocation data. I have tried different field configurations and query 
parameters and I did not get working results.

I am using SOLR 6.6 and JTS-core 1.14. My test sample includes resources with point coordinates 
plus one associated with a polygon. I tried using both fieldtypes 
"solr.SpatialRecursivePrefixTreeFieldType" and 
"solr.RptWithGeometrySpatialField". In both cases, I get good results if I do not care 
about sorting. The problem arises when I include sorting.

With SpatialRecursivePrefixTreeFieldType:

The best request I used, based on the documentation I could find, was:
select?fl=*,score={!geofilt%20sfield=PositionGeo%20pt=45.52,-73.53%20d=10%20score=distance}=score%20asc

The distance appears to be correctly evaluated for resources indexed with point 
coordinates. However, it is wrong for the resource with a polygon


   2.3913236
   4.3242383
   4.671504
   4.806902
   20015.115


(Please note that I have verified the polygon externally and it is correct)

With solr.RptWithGeometrySpatialField:

I get an exception triggered by the presence of « score=distance » in the 
request « 
q={!geofilt%20sfield=PositionGeo%20pt=45.52,-73.53%20d=10%20score=distance} »

java.lang.UnsupportedOperationException
 at 
org.apache.lucene.spatial.composite.CompositeSpatialStrategy.makeDistanceValueSource(CompositeSpatialStrategy.java:92)
 at 
org.apache.solr.schema.AbstractSpatialFieldType.getValueSourceFromSpatialArgs(AbstractSpatialFieldType.java:412)
 at 
org.apache.solr.schema.AbstractSpatialFieldType.getQueryFromSpatialArgs(AbstractSpatialFieldType.java:359)
 at 
org.apache.solr.schema.AbstractSpatialFieldType.createSpatialQuery(AbstractSpatialFieldType.java:308)
 at 
org.apache.solr.search.SpatialFilterQParser.parse(SpatialFilterQParser.java:80)

 From there, I am rather stuck with no ideas on how to resolve these problems. 
So advises in that regards would be much appreciated. I can provide more 
details if necessary.

Thank you in advance,


  ---
   Luc Grondin
   Analyste en gestion de l'information numérique
   Centre d'expertise numérique pour la recherche - Université de Montréal
   téléphone: 514-343-6111 p. 3988  --  
luc.gron...@umontreal.ca<mailto:luc.gron...@umontreal.ca>




--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: slow solr facet processing

2017-09-05 Thread Ere Maijala

Toke Eskildsen kirjoitti 5.9.2017 klo 13.49:

On Mon, 2017-09-04 at 11:03 -0400, Yonik Seeley wrote:

It's due to this (see comments in UnInvertedField):


I have read that. What I don't understand is the difference between 4.x
and 6.x. But as you say, Ere seems to be in the process of verifying
whether this is simply due to more segments in 6.x.


During my testing I never optimized the 4.x index, so unless it 
maintains a minimal number of segments automatically, there's something 
else too.



There's probably a number of ways we can speed this up somewhat:
- optimize how much memory is used to store the term index and use
the savings to store more than every 128th term
- store the terms contiguously in block(s)


I'm considering taking a shot at that. A fairly easy optimization would
be to replace the BytesRef[] indexedTermsArray with a BytesRefArray.


I'd be happy to try out any patches.. :)

--Ere

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: slow solr facet processing

2017-09-05 Thread Ere Maijala

Yonik Seeley kirjoitti 4.9.2017 klo 18.03:

It's due to this (see comments in UnInvertedField):
*   To further save memory, the terms (the actual string values) are
not all stored in
*   memory, but a TermIndex is used to convert term numbers to term values only
*   for the terms needed after faceting has completed.  Only every
128th term value
*   is stored, along with its corresponding term number, and this is used as an
*   index to find the closest term and iterate until the desired number is hit

There's probably a number of ways we can speed this up somewhat:
- optimize how much memory is used to store the term index and use the
savings to store more than every 128th term
- store the terms contiguously in block(s)
- don't store the whole term, only store what's needed to seek to the
Nth term correctly
- when retrieving many terms, sort them first and convert from ord->str in order


For what it's worth, I've now tested on our production servers that can 
hold the full index in memory, and the results are in line with the 
previous ones (47 million records, 1785 buckets in the tested facet):


1.) index with docValues="true":

- unoptimized: ~6000ms if facet.method is not specified
- unoptimized: ~7000ms with facet.method=uif
- optimized: ~7800ms if facet.method is not specified
- optimized: ~7700ms with facet.method=uif

Note that optimization took its time and other activity varies 
throughout the day, so the numbers between optimized and unoptimized 
cannot be directly compared. Still bugs me a bit that the optimized 
index seems to be a bit slower here.


2.) index with docValues="false":

- unoptimized: ~2600ms if facet.method is not specified
- unoptimized ~1200ms with facet.method=uif
- optimized: ~2600ms if facet.method is not specified
- optimized: ~17ms with facet.method=uif

--Ere

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: slow solr facet processing

2017-09-04 Thread Ere Maijala

Toke Eskildsen kirjoitti 4.9.2017 klo 13.38:

On Mon, 2017-09-04 at 13:21 +0300, Ere Maijala wrote:

Thanks for the insight, Yonik. I can confirm that #2 is true. I ran



and after it completed I was able to retrieve 2000 values in 17ms.


Very interesting. Is this on spinning disks or SSD? Is your index data
cached in memory? What I am aiming at is if this is primarily a "many
relatively slow random access"-thing or more due to the way DocValues
are represented in the segments (the codec).


I indexed a few million new/changed records, and the performance is back 
to slow. Upside is that I can test again with a slow server.


It's spinning disks on a SAN, and the full index doesn't fit into 
memory. I don't see any IO wait, and repeated attempts are just as slow 
even though I would have thought the relevant parts would be cached in 
memory. During testing and reporting the results I've always discarded 
the very first requests since they're always slower than subsequent 
repeats due to there being another test index on the same server. Maybe 
worth noting is that while there's no IO wait, there is fairly high CPU 
usage for Solr's Java process hovering around 100% if I repeat the 
request in a loop.


I took a quick sample with VisualVM, and the top hotspots are:

org.apache.solr.search.facet.UnInvertedField.getCounts()	32.079956	7,356 
ms (32.1%)	7,356 ms	7,655 ms	7,655 ms
org.apache.lucene.util.PriorityQueue.downHeap()	30.232546	6,932 ms 
(30.2%)	6,932 ms	6,932 ms	6,932 ms
org.apache.lucene.index.MultiTermsEnum.pushTop()	11.628195	2,666 ms 
(11.6%)	2,666 ms	11,177 ms	11,177 ms
org.apache.lucene.index.MultiTermsEnum$TermMergeQueue.fillTop()	9.079571 
2,082 ms (9.1%)	2,082 ms	2,082 ms	2,082 ms
org.apache.lucene.store.ByteBufferGuard.getBytes()	4.176216	957 ms 
(4.2%)	957 ms	957 ms	957 ms
org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.next() 
2.6867974	616 ms (2.7%)	616 ms	616 ms	616 ms
org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.scanToTermLeaf() 
1.393562	319 ms (1.4%)	319 ms	319 ms	319 ms
org.apache.lucene.util.fst.ByteSequenceOutputs.read()	1.2111844	277 ms 
(1.2%)	277 ms	277 ms	277 ms


(sorry if that looks bad in the email)

I'm building another index on a higher-end server that can load the full 
index to memory and will retest with that. But note that this index has 
docValues disabled as facet.method=uif seems to only cause trouble if 
docValues are enabled.


--Ere

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: slow solr facet processing

2017-09-04 Thread Ere Maijala
Yonik Seeley kirjoitti 1.9.2017 klo 17.03:> On Fri, Sep 1, 2017 at 9:17 
AM, Ere Maijala <ere.maij...@helsinki.fi> wrote:

>> I spoke a bit too soon. Now I see why I didn't see any improvement from
>> facet.method=uif before: its performance seems to depend heavily on 
how many
>> facets are returned. With an index of 6 million records and the 
facet having

>> 1960 buckets:
>>
>> facet.limit=20 takes 4ms
>> facet.limit=200 takes ~100ms
>> facet.limit=2000 takes ~1300ms
>>
>> So, for some uses it provides a nice boost, but if you need to fetch 
more

>> than a few top items, it doesn't perform properly.
>
> Another thought on this one:
> If it does slow down more than 4.x when requesting many items, it's 
either

> 1) a bug introduced at some point
> 2) not actually slower, but due to the 6.6 index having more segments
> (ord->string conversion needs to merge multiple term enumerators, so
> more segments == slower)
>
> If you could check #2, that would be great!  If it doesn't seem to be
> the problem, could you open up a new JIRA issue for this?
>
Thanks for the insight, Yonik. I can confirm that #2 is true. I ran



and after it completed I was able to retrieve 2000 values in 17ms.

Does this mean we should have a very aggressive merge policy? That's 
something I haven't tweaked, and it's not quite clear to me what would 
be the best way to achieve consistently low number of segments.


I encountered one issue with some further testing. I assume this is a 
bug: Trying to use facet.method=uif with a solr.DateRangeField causes 
the following exception:


2017-09-04 12:50:33.246 ERROR (qtp1205044462-18602) [   x:biblio2] 
o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: 
Exception during facet.field: search_daterange_mv
at 
org.apache.solr.request.SimpleFacets.lambda$getFacetFieldCounts$0(SimpleFacets.java:809)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
org.apache.solr.request.SimpleFacets$3.execute(SimpleFacets.java:742)
at 
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:818)
at 
org.apache.solr.handler.component.FacetComponent.getFacetCounts(FacetComponent.java:326)
at 
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:274)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:304)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
at 
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)

at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)

at org.eclipse.jetty.server.Server.handle(Server.java:534)
at 
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)

at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at 
or

Re: slow solr facet processing

2017-09-01 Thread Ere Maijala
I spoke a bit too soon. Now I see why I didn't see any improvement from 
facet.method=uif before: its performance seems to depend heavily on how 
many facets are returned. With an index of 6 million records and the 
facet having 1960 buckets:


facet.limit=20 takes 4ms
facet.limit=200 takes ~100ms
facet.limit=2000 takes ~1300ms

So, for some uses it provides a nice boost, but if you need to fetch 
more than a few top items, it doesn't perform properly.


Query used was:

q=*:*=0=true=building=1=2000=true=uif

--Ere

Ere Maijala kirjoitti 1.9.2017 klo 13.10:
I can confirm that we're seeing the same issue as Günter. For a 
collection of 57 million bibliographic records, Solr 4.10.2 (without 
docValues) can consistently return a facet in about 20ms, while Solr 
6.6.0 with docValues takes around 2600ms. I've tested some versions 
between those two too, but I don't have comparable numbers for them.


I thought I had tried all different combinations of 
docValues="true/false" and facet.method=fc/uif/enum, but now that I 
checked it again, it seems that I may have missed a test, as an 6.6.0 
index with docValues="false" and facet.method=uif is markedly faster 
than other methods. At around 700ms it's still not nowhere near as fast 
as 4.10.2, but a whole lot better. It seems that docValues needs to be 
disabled for facet.method=uif to have effect though, which is 
unfortunate. Otherwise it reports that applied method is UIF, but the 
performance is actually much worse than with FC. I'll do just another 
round of testing to verify all this. I can report to SOLR-8096 when I 
have something conclusive.


--Ere

Yonik Seeley kirjoitti 31.8.2017 klo 20.04:

A possible improvement for some multiValued fields might be to use the
"uif" facet method (UnInvertedField was the default method for
multiValued fields in 4.x)
I'm not sure if you would need to reindex without docValues on that
field to try it though.

Example: to enable on the "union" field, add f.union.facet.method=uif

Support for this was added in 
https://issues.apache.org/jira/browse/SOLR-8466


-Yonik


On Thu, Aug 31, 2017 at 10:41 AM, Günter Hipler
<guenter.hip...@unibas.ch> wrote:

Hi,

in the meantime I came across the reason for the slow facet processing
capacities of SOLR since version 5.x

  https://issues.apache.org/jira/browse/SOLR-8096
https://issues.apache.org/jira/browse/LUCENE-5666

compared to version 4.x

Various library networks across the world are suffering from the same
symptoms:

Facet processing is one of the most important features of a search 
server

(for us) and it seems (at least IMHO) there is no solution for the issue
since March 2015 (release date for the last SOLR 4 version)

What are the plans / ideas of the solr developers for a possible future
solution? Or maybe there is already a solution I haven't seen so far.

Thanks for a feedback

Günter



On 21.08.2017 15:35, guenterh.li...@bluewin.ch wrote:


Hi,

I can't figure out the reason why the facet processing in version 6 
needs

significantly more time compared to version 4.

The debugging response (for 30 million documents)

solr 4
280.0name="query">0.0name="facet">
name="time">280.0
(once the query is cached)
before caching: between 1.5 and 2 sec


solr 6.x (my last try was with 6.6)
without docvalues for facetting fields (same schema as version 4)
5874.0name="query">0.0name="facet">
name="time">5873.00.0
the time is not getting better even after repeating the query several
times


solr 6.6 with docvalues for facetting fields
9837.0name="query">0.0name="facet">
name="time">9837.00.0

used query (our productive system with version 4)

http://search.swissbib.ch/solr/sb-biblio/select?debugQuery=true=*:*=true=union=navAuthor_full=format=language=navSub_green=navSubform=publishDate=edismax=2=arrarr=recip(abs(ms(NOW/DAY,freshness)),3.16e-10,100,100)=*,score=250=0=AND=score+desc=0=START_HILITE=100=END_HILITE=false=title_short^1000+title_alt^200+title_sub^200+title_old^200+title_new^200+author^750+author_additional^100+author_additional_dsv11_txt_mv^100+title_additional_dsv11_txt_mv^100+series^200+topic^500+addfields_txt_mv^50+publplace_txt_mv^25+publplace_dsv11_txt_mv^25+fulltext+callnumber^1000+ctrlnum^1000+publishDate+isbn+variant_isbn_isn_mv+issn+localcode+id=title_short^1000=1=fulltext&=xml=count 




Running the queries on smaller indices (8 million docs) the 
difference is

similar although the absolut figures for processing time are smaller


Any hints why this huge differences?

Günter











--
Universität Basel
Universitätsbibliothek
Günter Hipler
Projekt SwissBib
Schoenbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: + 41 (0)61 267 31 12 Fax: ++41 61 267 3103
E-Mail guenter.hip...@unibas.ch
URL: www.swissbib.org  / http://www.ub.unibas.ch/





--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: slow solr facet processing

2017-09-01 Thread Ere Maijala
I can confirm that we're seeing the same issue as Günter. For a 
collection of 57 million bibliographic records, Solr 4.10.2 (without 
docValues) can consistently return a facet in about 20ms, while Solr 
6.6.0 with docValues takes around 2600ms. I've tested some versions 
between those two too, but I don't have comparable numbers for them.


I thought I had tried all different combinations of 
docValues="true/false" and facet.method=fc/uif/enum, but now that I 
checked it again, it seems that I may have missed a test, as an 6.6.0 
index with docValues="false" and facet.method=uif is markedly faster 
than other methods. At around 700ms it's still not nowhere near as fast 
as 4.10.2, but a whole lot better. It seems that docValues needs to be 
disabled for facet.method=uif to have effect though, which is 
unfortunate. Otherwise it reports that applied method is UIF, but the 
performance is actually much worse than with FC. I'll do just another 
round of testing to verify all this. I can report to SOLR-8096 when I 
have something conclusive.


--Ere

Yonik Seeley kirjoitti 31.8.2017 klo 20.04:

A possible improvement for some multiValued fields might be to use the
"uif" facet method (UnInvertedField was the default method for
multiValued fields in 4.x)
I'm not sure if you would need to reindex without docValues on that
field to try it though.

Example: to enable on the "union" field, add f.union.facet.method=uif

Support for this was added in https://issues.apache.org/jira/browse/SOLR-8466

-Yonik


On Thu, Aug 31, 2017 at 10:41 AM, Günter Hipler
<guenter.hip...@unibas.ch> wrote:

Hi,

in the meantime I came across the reason for the slow facet processing
capacities of SOLR since version 5.x

  https://issues.apache.org/jira/browse/SOLR-8096
https://issues.apache.org/jira/browse/LUCENE-5666

compared to version 4.x

Various library networks across the world are suffering from the same
symptoms:

Facet processing is one of the most important features of a search server
(for us) and it seems (at least IMHO) there is no solution for the issue
since March 2015 (release date for the last SOLR 4 version)

What are the plans / ideas of the solr developers for a possible future
solution? Or maybe there is already a solution I haven't seen so far.

Thanks for a feedback

Günter



On 21.08.2017 15:35, guenterh.li...@bluewin.ch wrote:


Hi,

I can't figure out the reason why the facet processing in version 6 needs
significantly more time compared to version 4.

The debugging response (for 30 million documents)

solr 4
280.00.0280.0
(once the query is cached)
before caching: between 1.5 and 2 sec


solr 6.x (my last try was with 6.6)
without docvalues for facetting fields (same schema as version 4)
5874.00.05873.00.0
the time is not getting better even after repeating the query several
times


solr 6.6 with docvalues for facetting fields
9837.00.09837.00.0

used query (our productive system with version 4)

http://search.swissbib.ch/solr/sb-biblio/select?debugQuery=true=*:*=true=union=navAuthor_full=format=language=navSub_green=navSubform=publishDate=edismax=2=arrarr=recip(abs(ms(NOW/DAY,freshness)),3.16e-10,100,100)=*,score=250=0=AND=score+desc=0=START_HILITE=100=END_HILITE=false=title_short^1000+title_alt^200+title_sub^200+title_old^200+title_new^200+author^750+author_additional^100+author_additional_dsv11_txt_mv^100+title_additional_dsv11_txt_mv^100+series^200+topic^500+addfields_txt_mv^50+publplace_txt_mv^25+publplace_dsv11_txt_mv^25+fulltext+callnumber^1000+ctrlnum^1000+publishDate+isbn+variant_isbn_isn_mv+issn+localcode+id=title_short^1000=1=fulltext&=xml=count


Running the queries on smaller indices (8 million docs) the difference is
similar although the absolut figures for processing time are smaller


Any hints why this huge differences?

Günter











--
Universität Basel
Universitätsbibliothek
Günter Hipler
Projekt SwissBib
Schoenbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: + 41 (0)61 267 31 12 Fax: ++41 61 267 3103
E-Mail guenter.hip...@unibas.ch
URL: www.swissbib.org  / http://www.ub.unibas.ch/



--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Solr Prod Issue | KeeperErrorCode = ConnectionLoss for /overseer_elect/leader

2017-07-05 Thread Ere Maijala
From the fact that someone has tried to access /etc/passwd file via 
your Solr (see all those WARN messages), it seems you have it exposed to 
the world, unless of course it's a security scanner you use internally. 
Internet is a hostile place, and the very first thing I would do is 
shield Solr from external traffic. Even if it's your own security 
scanning, I wouldn't do it until you have the system stable.


Doing the above you'll reduce noise in the logs and might be able to 
better identify the issue.


Losing the Zookeeper connection is typically a Java garbage collection 
issue. If GC causes too long pauses, the connection may time out. So I 
would recommend you start by reading 
https://wiki.apache.org/solr/SolrPerformanceProblems and 
https://wiki.apache.org/solr/ShawnHeisey. Also make sure that 
Zookeeper's Java settings are good.


--Ere

Bhalla, Rahat kirjoitti 5.7.2017 klo 11.05:

Hi

I’m not sure if any of you have had a chance to see this email yet.

We had a reoccurrence of the Issue Today, and I’m attaching the Logs 
from today as well inline below.


Please let me know if any of you have seen this issue before as this 
would really help me to get to the root of the problem to fix it. I’m a 
little lost here and not entirely sure what to do.


Thanks,

Rahat Bhalla

8696248 [qtp778720569-28] [ WARN] 2017-07-04 01:40:20 
(HttpParser.java:parseNext:1391) - parse exception: 
java.lang.IllegalArgumentException: No Authority for 
HttpChannelOverHttp@30a86e14{r=0,c=false,a=IDLE,uri=null}


java.lang.IllegalArgumentException: No Authority

 at 
org.eclipse.jetty.http.HostPortHttpField.(HostPortHttpField.java:43)


 at 
org.eclipse.jetty.http.HttpParser.parsedHeader(HttpParser.java:877)


 at 
org.eclipse.jetty.http.HttpParser.parseHeaders(HttpParser.java:1050)


 at 
org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:1266)


 at 
org.eclipse.jetty.server.HttpConnection.parseRequestBuffer(HttpConnection.java:344)


 at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:227)


 at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)


 at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)

 at 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)


 at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246)


 at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156)


 at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)


 at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)


 at java.lang.Thread.run(Unknown Source)

8697308 [qtp778720569-21] [ WARN] 2017-07-04 01:40:21 
(HttpParser.java:parseNext:1364) - bad HTTP parsed: 400 Bad URI for 
HttpChannelOverHttp@1276{r=16,c=false,a=IDLE,uri=/../../../../../../../../../../etc/passwd}


8697338 [qtp778720569-29] [ WARN] 2017-07-04 01:40:21 
(HttpParser.java:parseNext:1364) - bad HTTP parsed: 400 No Host for 
HttpChannelOverHttp@50a994ce{r=29,c=false,a=IDLE,uri=null}


8697388 [qtp778720569-21] [ WARN] 2017-07-04 01:40:22 
(HttpParser.java:parseNext:1364) - bad HTTP parsed: 400 Bad URI for 
HttpChannelOverHttp@19a624ec{r=1,c=false,a=IDLE,uri=//prod-solr-node01.healthplan.com:9080/solr/admin/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/%2e%2e/etc/passwd}


8697401 [qtp778720569-27] [ WARN] 2017-07-04 01:40:22 
(URIUtil.java:decodePath:348) - 
/solr/admin/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/%c0%ae%c0%ae/etc/passwd 
org.eclipse.jetty.util.Utf8Appendable$NotUtf8Exception: Not valid UTF8! 
byte C0 in state 0


8697444 [qtp778720569-25] [ WARN] 2017-07-04 01:40:22 
(URIUtil.java:decodePath:348) - 
/solr/admin/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/%e0%80%ae%e0%80%ae/etc/passwd 
org.eclipse.jetty.util.Utf8Appendable$NotUtf8Exception: Not valid UTF8! 
byte 80 in state 4


8697475 [qtp778720569-26] [ WARN] 2017-07-04 01:40:22 
(URIUtil.java:decodePath:348) - 
/solr/admin/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/%f0%80%80%ae%f0%80%80%ae/etc/passwd 
org.eclipse.jetty.util.Utf8Appendable$NotUtf8Exception: Not valid UTF8! 
byte 80 in state 6


8697500 [qtp778720569-29] [ WARN] 2017-07-04 01:40:22 
(URIUtil.java:decodePath:348) - 

Re: Grouping and group.facet performance disaster

2017-05-31 Thread Ere Maijala
While I can't say whether it affects you in this case, Solr 6.4.1 has 
serious performance issues. I'd suggest upgrading to at least 6.4.2.


--Ere

31.5.2017, 14.16, Marek Tichy kirjoitti:

Hi,

I'm getting a very slow response times on grouping, especially on facet
grouping.

Without grouping, the query takes 14ms, faceting 57ms.

With grouping, the query time goes up to 1131ms, with facet grouping,
the faceting goes up to the unbearable 12103 ms.

Single solr instance, 927086docs, 518.23 MB size, solr 6.4.1.

Is this really the price of grouping ? Are there any magic
tricks/tips/techniques to improve the speed ?
The query params below.

Many thanks for any help, much appreciated.

Best
 Marek Tichy








fq=((type:knihy) OR (type:defekty))
fl=*
start=0
f.ebook_formats.facet.mincount=1
f.authorid.facet.mincount=1
f.thematicgroupid.facet.mincount=1
f.articleparts.facet.mincount=1
f.type.facet.mincount=1
f.languageid.facet.mincount=1
f.showwindow.facet.mincount=1
f.articletypeid_grouped.facet.mincount=1
f.languageid.facet.limit=10
f.ebook_formats.facet.limit=10
f.authorid.facet.limit=10
f.type.facet.limit=10
f.articleparts.facet.limit=10
f.thematicgroupid.facet.limit=10
f.articletypeid_grouped.facet.limit=10
f.showwindow.facet.limit=100
version=2.2
group.limit=30
rows=30
echoParams=all
sort=date desc,planneddate asc
group.field=edition
facet.method=enum
group.truncate=false
group.format=grouped
group=true
group.ngroups=true
stats=true
facet=true
group.facet=true
stats.field={!distinctValues=true}categoryid
facet.field={!ex=at}articletypeid_grouped
facet.field={!ex=at}type
facet.field={!ex=author}authorid
facet.field={!ex=format}articleparts
facet.field={!ex=format}ebook_formats
facet.field={!ex=lang}languageid
facet.field={!ex=sw}showwindow
facet.field={!ex=tema}thematicgroupid
stats.field={!min=true max=true}price
stats.field={!min=true max=true}yearout



--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Collection backup fails with java.nio.file.NoSuchFileException

2017-05-03 Thread Ere Maijala
I'm running a three-node SolrCloud (tested with versions 6.4.2 and 
6.5.0) with 1 shard and 3 replicas, and I'm having trouble getting the 
collection backup API to actually do the job. This is the request I use 
to initiate the backup:


http://localhost:8983/solr/admin/collections?action=BACKUP=biblioprod=biblio1=/data/backup/solr

The result is always something like this:



500name="QTime">840name="[servername]:8983_solr">org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error 
from server at http://[servername]:8983/solr: Failed to backup 
core=biblio1_shard1_replica3 because java.nio.file.NoSuchFileException: 
/data/solr1-1/vufind/biblio1_shard1_replica3/data/index.20170424115540663/segments_y1name="Operation backup caused 
exception:">org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
Could not backup all replicasname="msg">Could not backup all replicasname="rspCode">500name="metadata">name="error-class">org.apache.solr.common.SolrExceptionname="root-error-class">org.apache.solr.common.SolrExceptionname="msg">Could not backup all replicasname="trace">org.apache.solr.common.SolrException: Could not backup all 
replicas
	at 
org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:287)

[snip stack]
500


The index seems to be working fine. I've also tested optimizing it first 
(shouldn't need to do that, right?), but it always fails with 
segments_[something] missing. Also going to Solr Admin UI, selecting the 
collection and clicking e.g. Schema in left hand menu causes the 
following to be written into solr.log:


2017-04-27 11:19:50.724 WARN  (qtp1205044462-1590) [c:biblio1 s:shard1 
r:core_node1 x:biblio1_shard1_replica3] o.a.s.h.a.LukeRequestHandler 
Error getting file length for [segments_y1]
java.nio.file.NoSuchFileException: 
/data/solr1-1/vufind/biblio1_shard1_replica3/data/index.20170424115540663/segments_y1
at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at 
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at 
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at 
sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
at 
sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
at 
sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)

at java.nio.file.Files.readAttributes(Files.java:1737)
at java.nio.file.Files.size(Files.java:2332)
at 
org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)
at 
org.apache.lucene.store.NRTCachingDirectory.fileLength(NRTCachingDirectory.java:128)
at 
org.apache.solr.handler.admin.LukeRequestHandler.getFileLength(LukeRequestHandler.java:614)
at 
org.apache.solr.handler.admin.LukeRequestHandler.getIndexInfo(LukeRequestHandler.java:587)
at 
org.apache.solr.handler.admin.LukeRequestHandler.handleRequestBody(LukeRequestHandler.java:138)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:166)

[snip]

Looks like the backup only works intermittently after the server has 
been sitting for a while without any update requests being made.


Any ideas?

Regards,
Ere


Re: Performance degradation after upgrading from 6.2.1 to 6.4.1

2017-02-14 Thread Ere Maijala

It might be <https://issues.apache.org/jira/browse/SOLR-10130>.

--Ere

14.2.2017, 11.52, Henrik Brautaset Aronsen kirjoitti:

We are seeing performance degradation on our SolrCloud instances after
upgrading to 6.4.1.


Here are a couple of graphs.  As you can see, 6.4.1 was introduced 2/10
1200:


https://www.dropbox.com/s/qrc0wodain50azz/solr1.png?dl=0

https://www.dropbox.com/s/sdk30imm8jlomz2/solr2.png?dl=0


These are two very different usage scenarios:


* Solr1 has constant updates and very volatile data (30 minutes TTL, 20
shards with no replicas, across 8 servers).  Requests in the 99 percentile
went from ~400ms to 1000-1500ms. (Hystrix cutoff at 1.5s)


* Solr2 is a more traditional instance with long-lived data (updated once a
day, 24 shards with 2 replicas, across 8 servers).  Requests in the 99
percentile went from ~400ms to at least 1s. (Hystrix cutoff at 1s)


I've been looking around, but cannot really find a reason for the
performance degradation.  Does any of you have an idea?


Cheers,

Henrik



--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Heads up: SOLR-10130, Performance issue in Solr 6.4.1

2017-02-13 Thread Ere Maijala

Hi all,

this is just a quick heads-up that we've stumbled on serious performance 
issues after upgrading to Solr 6.4.1 apparently due to the new metrics 
collection causing a major slowdown. I've filed an issue 
(https://issues.apache.org/jira/browse/SOLR-10130) about it, but decided 
to post this just so that anyone else doesn't need to encounter this 
unprepared. It seems to me that metrics would need to be explicitly 
disabled altogether in the index config to avoid the issue.


--Ere


Re: Removing duplicate terms from query

2017-02-10 Thread Ere Maijala
Thanks for the insight. You're right, of course, regarding the score 
calculation. I'll think about it. There are certain cases where the 
search is human-obviously bad and could be cleaned up, but it's not too 
easy to write rules for that.


--Ere

9.2.2017, 18.37, Walter Underwood kirjoitti:

1. I don’t think this is a good idea. It means that a search for “hey hey hey” 
won’t score that document higher.

2. Maybe you want to change how tf is calculated. Ignore multiple occurrences 
of a word.

I ran into this with the movie title “New York, New York” at Netflix. It isn’t 
twice as much about New York, but it needs to be the best match for the query 
“new york new york”.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Feb 9, 2017, at 5:18 AM, Ere Maijala <ere.maij...@helsinki.fi> wrote:

Thanks Emir.

I was thinking of something very simple like doing what 
RemoveDuplicatesTokenFilter does but ignoring positions. It would of course 
still be possible to have the same term multiple times, but at least the 
adjacent ones could be deduplicated. The reason I'm not too eager to do it in a 
query preprocessor is that I'd have to essentially duplicate functionality of 
the query analysis chain that contains ICUTokenizerFactory, 
WordDelimiterFilterFactory and whatnot.

Regards,
Ere

9.2.2017, 14.52, Emir Arnautovic kirjoitti:

Hi Ere,

I don't think that there is such filter. Implementing such filter would
require looking backward which violates streaming approach of token
filters and unpredictable memory usage.

I would do it as part of query preprocessor and not necessarily as part
of Solr.

HTH,
Emir


On 09.02.2017 12:24, Ere Maijala wrote:

Hi,

I just noticed that while we use RemoveDuplicatesTokenFilter during
query time, it will consider term positions and not really do anything
e.g. if query is 'term term term'. As far as I can see the term
positions make no difference in a simple non-phrase search. Is there a
built-in way to deal with this? I know I can write a filter to do
this, but I feel like this would be something quite basic to do for
the query. And I don't think it's even anything too weird for normal
users to do. Just consider e.g. searching for music by title:

Hey, hey, hey ; Shivers of pleasure

I also verified that at least according to debugQuery=true and
anecdotal evicende the search really slows down if you repeat the same
term enough.

--Ere




--
Ere Maijala
Kansalliskirjasto / The National Library of Finland





--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Removing duplicate terms from query

2017-02-09 Thread Ere Maijala

Thanks Emir.

I was thinking of something very simple like doing what 
RemoveDuplicatesTokenFilter does but ignoring positions. It would of 
course still be possible to have the same term multiple times, but at 
least the adjacent ones could be deduplicated. The reason I'm not too 
eager to do it in a query preprocessor is that I'd have to essentially 
duplicate functionality of the query analysis chain that contains 
ICUTokenizerFactory, WordDelimiterFilterFactory and whatnot.


Regards,
Ere

9.2.2017, 14.52, Emir Arnautovic kirjoitti:

Hi Ere,

I don't think that there is such filter. Implementing such filter would
require looking backward which violates streaming approach of token
filters and unpredictable memory usage.

I would do it as part of query preprocessor and not necessarily as part
of Solr.

HTH,
Emir


On 09.02.2017 12:24, Ere Maijala wrote:

Hi,

I just noticed that while we use RemoveDuplicatesTokenFilter during
query time, it will consider term positions and not really do anything
e.g. if query is 'term term term'. As far as I can see the term
positions make no difference in a simple non-phrase search. Is there a
built-in way to deal with this? I know I can write a filter to do
this, but I feel like this would be something quite basic to do for
the query. And I don't think it's even anything too weird for normal
users to do. Just consider e.g. searching for music by title:

Hey, hey, hey ; Shivers of pleasure

I also verified that at least according to debugQuery=true and
anecdotal evicende the search really slows down if you repeat the same
term enough.

--Ere




--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Removing duplicate terms from query

2017-02-09 Thread Ere Maijala

Hi,

I just noticed that while we use RemoveDuplicatesTokenFilter during 
query time, it will consider term positions and not really do anything 
e.g. if query is 'term term term'. As far as I can see the term 
positions make no difference in a simple non-phrase search. Is there a 
built-in way to deal with this? I know I can write a filter to do this, 
but I feel like this would be something quite basic to do for the query. 
And I don't think it's even anything too weird for normal users to do. 
Just consider e.g. searching for music by title:


Hey, hey, hey ; Shivers of pleasure

I also verified that at least according to debugQuery=true and anecdotal 
evicende the search really slows down if you repeat the same term enough.


--Ere


Solr 6.4.0 and deprecated SynonymFilterFactory

2017-02-02 Thread Ere Maijala

Hi,

on startup Solr 6.4.0 logs the following warning:

o.a.s.c.SolrResourceLoader Solr loaded a deprecated plugin/analysis 
class [solr.SynonymFilterFactory]. Please consult documentation how to 
replace it accordingly.


What documentation? As far as I can see, there's nothing at 
 
or
 
nor did a quick Google search come up with anything definitive.


Am I looking in the wrong places or does the mentioned documentation 
exist at all?


--Ere


Re: MLT Java example for Solr 6.3

2016-12-27 Thread Ere Maijala
Just a note that field boosting with the MLT Query Parser is broken, and 
for SolrCloud the whole thing is practically unusable if you index stuff 
in English because CloudMLTQParser includes strings from field 
definitions (such as "stored" and "indexed") in the query. I'm still 
hoping someone will review 
https://issues.apache.org/jira/browse/SOLR-9644, which contains a fix, 
at some point..


--Ere

24.12.2016, 1.26, Anshum Gupta kirjoitti:

Hi Todd,

You can query for similar documents using the MLT Query Parser. The code
would look something like:

// Assuming you want to use CloudSolrClient
CloudSolrClient client = new CloudSolrClient.Builder()
.withZkHost(zkHost)
.build();
client.setDefaultCollection(COLLECTION_NAME);
QueryResponse queryResponse = client.query(new SolrQuery("{!mlt
qf=foo}docId"));

Notice the *docId*, *qf*, and the *!mlt* part.
docId - External document ID/unique ID of the document you want to query for
qf - fields that you want to use for similarity (you can read more about it
here:
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-MoreLikeThisQueryParser
)
!mlt - the query parser you want to use.


On Thu, Dec 22, 2016 at 3:01 PM <todd_peter...@mgtsciences.com> wrote:


I am having trouble locating a decent example for using the MLT Java API
in Solr 6.3. What I want is to retrieve document IDs that are similar to a
given document ID.

Todd Peterson
Chief Embedded Systems Engineer
Management Sciences, Inc.
6022 Constitution Ave NE
Albuquerque, NM 87144
505-255-8611 <(505)%20255-8611> (office)
505-205-7057 <(505)%20205-7057> (cell)




--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Soft commit and reading data just after the commit

2016-12-19 Thread Ere Maijala

Hi,

so, the app already has a database connection because it updates the 
READ flag when the user clicks an entry, right? If you only need the 
flag for display purposes, it sounds like it would make sense to also 
fetch it directly from the database when displaying the listing. Of 
course if you also need to search for READ/UNREAD you need to index the 
change, but perhaps you could get away with it taking longer.


--Ere

20.12.2016, 4.12, Lasitha Wattaladeniya kirjoitti:

Hi Shawn,

Thanks for your well detailed explanation. Now I understand, I won't be
able to achieve the 100ms softcommit timeout with my hardware setup.
However let's say someone has a requirement as below (quoted from my
previous mail)

*Requirement *is,  we are showing a list of entries on a page. For each
user there's a read / unread flag.  The data for listing is fetched from
solr. And you can see the entry was previously read or not. So when a user
views an entry by clicking.  We are updating the database flag to READ and
use real time indexing to update solr index.  So when the user close the
full view of the entry and go back to entry listing page,  the data fetched
from solr should be updated to READ.

Can't we achieve a requirement as described above using solr ? (without
manipulating the previously fetched results list from solr, because at some
point we'll have to go back to search results from solr and at that time it
should be updated).

Regards,
Lasitha

Lasitha Wattaladeniya
Software Engineer

Mobile : +6593896893
Blog : techreadme.blogspot.com

On Mon, Dec 19, 2016 at 6:37 PM, Shawn Heisey <apa...@elyograg.org> wrote:


On 12/18/2016 7:09 PM, Lasitha Wattaladeniya wrote:

@eric : thanks for the lengthy reply. So let's say I increase the
autosoftcommit time out to may be 100 ms. In that case do I have to
wait much that time from client side before calling search ?. What's
the correct way of achieving this?


Some of the following is covered by the links you've already received.
Some of it may be new information.

Before you can see a change you've just made, you will need to wait for
the commit to be fired (in this case, the autoSoftCommit time) plus
however long it actually takes to complete the commit and open a new
searcher.  Opening the searcher is the expensive part.

What I typically recommend that people do is have the autoSoftCommit
time as long as they can stand, with 60-300 seconds as a "typical"
value.  That's a setting of 6 to 30.  What you are trying to
achieve is much faster, and much more difficult.

100 milliseconds will typically be far too small a value unless your
index is extremely small or your hardware is incredibly fast and has a
lot of memory.  With a value of 100, you'll want each of those soft
commits (which do open a new searcher) to take FAR less than 100
milliseconds to complete.  This kind of speed can be difficult to
achieve, especially if the index is large.

To have any hope of fast commit times, you will need to set
autowarmCount on all Solr caches to zero.  If you are indexing
frequently enough, you might even want to completely disable Solr's
internal caches, because they may be providing no benefit.

You will want to have enough extra memory that your operating system can
cache the vast majority (or even maybe all) of your index.

https://wiki.apache.org/solr/SolrPerformanceProblems

Some other info that's helpful for understanding why plenty of *spare*
memory (not allocated by programs) is necessary for good performance:

https://en.wikipedia.org/wiki/Page_cache
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

The reason in a nutshell:  Disks are EXTREMELY slow.  Memory is very fast.

Thanks,
Shawn






--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Very long young generation stop the world GC pause

2016-12-09 Thread Ere Maijala
Then again, if the load characteristics on the Solr instance differ e.g. 
by time of day, G1GC, in my experience, may have trouble adapting. For 
instance if your query load reduces drastically during the night, it may 
take a while for G1GC to catch up in the morning. What I've found useful 
from experience, and your mileage will probably vary, is to limit the 
young generation size with a large heap. With Xmx31G something like 
these could work:


-XX:+UnlockExperimentalVMOptions \
-XX:G1MaxNewSizePercent=5 \

The aim here is to only limit the maximum and still allow some adaptation.

--Ere

8.12.2016, 16.07, Pushkar Raste kirjoitti:

Disable all the G1GC tuning your are doing except for ParallelRefProcEnabled

G1GC is an adaptive algorithm and would keep tuning to reach the default
pause goal of 250ms which should be good for most of the applications.

Can you also tell us how much RAM you have on your machine and if you have
swap enabled and being used?

On Dec 8, 2016 8:53 AM, "forest_soup"  wrote:


Besides, will those JVM options make it better?
-XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=10



--
View this message in context: http://lucene.472066.n3.
nabble.com/Very-long-young-generation-stop-the-world-GC-
pause-tp4308911p4308937.html
Sent from the Solr - User mailing list archive at Nabble.com.





How to attract attention to a patch?

2016-11-10 Thread Ere Maijala
I've posted a patch to fix core functionality in Solr MLT parsers (see 
https://issues.apache.org/jira/browse/SOLR-9644), but it or the 
associated pull request in GitHub don't seem to get any attention. The 
HowToContribute page says that "If no one responds to your patch after a 
few days, please make friendly reminders." but doesn't tell how to make 
a friendly reminder. Could someone point me to the right direction? Thanks!


--Ere

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Per-query boosts in MLT

2016-10-17 Thread Ere Maijala

Hi,

this comes quite late, but since we really need this too, I've now 
started to work on this and I believe I have proper fixes in a pull 
request. See https://issues.apache.org/jira/browse/SOLR-9644 for details.


--Ere

9.6.2016, 16.09, Marc Burt kirjoitti:

Hi,

Is it possible to assign boosts to the MLT similarity fields instead of
the defaults set in the config when making a MLT query?
I'm currently using a query parser and attempting /select?q={!mlt
qf=foo^10,bar^20,upc^50}/id /etc but it's taking the boost to be part of
the field name.



--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: qf boosts with MoreLikeThis query parser

2016-10-14 Thread Ere Maijala
I've now attached a proposed patch to a pre-existing issue 
https://issues.apache.org/jira/browse/SOLR-9267.


--Ere

13.10.2016, 2.19, Ere Maijala kirjoitti:

Answering to myself.. I did some digging and found out that boosts work
if qf is repeated in the local params, at least in Solr 6.2, like this:

{!mlt qf=title^100 qf=author=^50}recordid

However, it doesn't work properly with CloudMLTQParser used in SolrCloud
mode. I'm working on a proposed fix for this and will post a Jira issue
with a patch when done. There appears to be another problem with
CloudMLTQParser too where it includes extraneous terms in the final
query, and I'll take a stab at fixing that too.

--Ere

1.8.2016, 9.12, Ere Maijala kirjoitti:

Hi All,

I, too, would like to know the answer to these questions. I saw a
similar question by Nikaash Puri on 22 June with subject "help with
moreLikeThis" go unanswered. Any insight?

Regards,
Ere

11.7.2016, 18.32, Demian Katz kirjoitti:

Hello,

I am currently using field-specific boosts in the qf setting of the
MoreLikeThis request handler:

https://github.com/vufind-org/vufind/blob/master/solr/vufind/biblio/conf/solrconfig.xml#L410



I would like to accomplish the same effect using the MoreLikeThis
query parser, so that I can take advantage of such benefits as
sharding support.

I am currently using Solr 5.5.0, and in spite of trying many
syntactical variations, I can't seem to get it to work. Some
discussion on this JIRA ticket seems to suggest there may have been
some problems caused by parsing limitations:

https://issues.apache.org/jira/browse/SOLR-7143

However, I think my work on this ticket should have eliminated those
limitations:

https://issues.apache.org/jira/browse/SOLR-2798

Anyway, this brings up a few questions:


1.)Is field-specific boosting in qf supported by the MLT query
parser, and if so, what syntax should I use?

2.)If this functionality is supported, but not in v5.5.0,
approximately when was it fixed?

3.)If the functionality is still not working, would it be worth my
time to try to fix it, or is it being excluded for a specific reason?

Any and all insight is appreciated. Apologies if the answers are
already out there somewhere, but I wasn't able to find them!

thanks,
Demian





--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: qf boosts with MoreLikeThis query parser

2016-10-12 Thread Ere Maijala
Answering to myself.. I did some digging and found out that boosts work 
if qf is repeated in the local params, at least in Solr 6.2, like this:


{!mlt qf=title^100 qf=author=^50}recordid

However, it doesn't work properly with CloudMLTQParser used in SolrCloud 
mode. I'm working on a proposed fix for this and will post a Jira issue 
with a patch when done. There appears to be another problem with 
CloudMLTQParser too where it includes extraneous terms in the final 
query, and I'll take a stab at fixing that too.


--Ere

1.8.2016, 9.12, Ere Maijala kirjoitti:

Hi All,

I, too, would like to know the answer to these questions. I saw a
similar question by Nikaash Puri on 22 June with subject "help with
moreLikeThis" go unanswered. Any insight?

Regards,
Ere

11.7.2016, 18.32, Demian Katz kirjoitti:

Hello,

I am currently using field-specific boosts in the qf setting of the
MoreLikeThis request handler:

https://github.com/vufind-org/vufind/blob/master/solr/vufind/biblio/conf/solrconfig.xml#L410


I would like to accomplish the same effect using the MoreLikeThis
query parser, so that I can take advantage of such benefits as
sharding support.

I am currently using Solr 5.5.0, and in spite of trying many
syntactical variations, I can't seem to get it to work. Some
discussion on this JIRA ticket seems to suggest there may have been
some problems caused by parsing limitations:

https://issues.apache.org/jira/browse/SOLR-7143

However, I think my work on this ticket should have eliminated those
limitations:

https://issues.apache.org/jira/browse/SOLR-2798

Anyway, this brings up a few questions:


1.)Is field-specific boosting in qf supported by the MLT query
parser, and if so, what syntax should I use?

2.)If this functionality is supported, but not in v5.5.0,
approximately when was it fixed?

3.)If the functionality is still not working, would it be worth my
time to try to fix it, or is it being excluded for a specific reason?

Any and all insight is appreciated. Apologies if the answers are
already out there somewhere, but I wasn't able to find them!

thanks,
Demian



--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: qf boosts with MoreLikeThis query parser

2016-08-01 Thread Ere Maijala

Hi All,

I, too, would like to know the answer to these questions. I saw a 
similar question by Nikaash Puri on 22 June with subject "help with 
moreLikeThis" go unanswered. Any insight?


Regards,
Ere

11.7.2016, 18.32, Demian Katz kirjoitti:

Hello,

I am currently using field-specific boosts in the qf setting of the 
MoreLikeThis request handler:

https://github.com/vufind-org/vufind/blob/master/solr/vufind/biblio/conf/solrconfig.xml#L410

I would like to accomplish the same effect using the MoreLikeThis query parser, 
so that I can take advantage of such benefits as sharding support.

I am currently using Solr 5.5.0, and in spite of trying many syntactical 
variations, I can't seem to get it to work. Some discussion on this JIRA ticket 
seems to suggest there may have been some problems caused by parsing 
limitations:

https://issues.apache.org/jira/browse/SOLR-7143

However, I think my work on this ticket should have eliminated those 
limitations:

https://issues.apache.org/jira/browse/SOLR-2798

Anyway, this brings up a few questions:


1.)Is field-specific boosting in qf supported by the MLT query parser, and 
if so, what syntax should I use?

2.)If this functionality is supported, but not in v5.5.0, approximately 
when was it fixed?

3.)If the functionality is still not working, would it be worth my time to 
try to fix it, or is it being excluded for a specific reason?

Any and all insight is appreciated. Apologies if the answers are already out 
there somewhere, but I wasn't able to find them!

thanks,
Demian



Re: Error when searching with special characters

2016-07-01 Thread Ere Maijala
You need to make sure you encode things properly in the URL. You can't 
just place an ampersand there because it's the parameter delimiter in a 
URL. If you're unsure, use e.g. http://meyerweb.com/eric/tools/dencoder/ 
to encode your search terms. You'll see that "r" will become 
%22r%26d%22. Escaping the ampersand for Solr is another thing. If that's 
needed, you'll need to URL encode "r\" so that it becomes %22r%5C%26d%22.


--Ere

1.7.2016, 7.13, Zheng Lin Edwin Yeo kirjoitti:

Hi,

When I use defType=edismax, and using debug mode by setting debug=True, I
found that the search for "r" is actually done to search on just the
character "r".

http://localhost:8983/solr/collection1/highlight?q=
"r"=true=edismax

  "debug":{
"rawquerystring":"\"r",
"querystring":"\"r",
"parsedquery":"(+DisjunctionMaxQuery((text:r)))/no_coord",
"parsedquery_toString":"+(text:r)"


Even if I search with escape character, it is of no help.

http://localhost:8983/solr/collection1/highlight?q=
"r\"=true=edismax

  "debug":{
"rawquerystring":"\"r\\",
"querystring":"\"r\\",
"parsedquery":"(+DisjunctionMaxQuery((text:r)))/no_coord",
"parsedquery_toString":"+(text:r)",



But if I'm using other symbols like "r*d", then the search is ok.

http://localhost:8983/solr/collection1/highlight?q=
"r*d"=true=edismax

  "debug":{
"rawquerystring":"\"r*d\"",
"querystring":"\"r*d\"",
"parsedquery":"(+DisjunctionMaxQuery((text:\"r d\")))/no_coord",
"parsedquery_toString":"+(text:\"r d\")",


What could be the reason behind this?


Regards,
Edwin


On 20 June 2016 at 02:12, Ahmet Arslan <iori...@yahoo.com> wrote:


Hi,

It is better to create a failing junit test case before opening jira.

ahmet


On Sunday, June 19, 2016 4:44 PM, Zheng Lin Edwin Yeo <
edwinye...@gmail.com> wrote:


Yes, it throws the parse exception even if the query is properly escaped
for ampersand (&) for defType=lucene.

Should we treat this as a bug, and create a JIRA>

Regards,
Edwin



On 19 June 2016 at 08:07, Ahmet Arslan <iori...@yahoo.com> wrote:



If properly escaped ampersand throws parse exception, this could be a bug.



On Saturday, June 18, 2016 7:12 PM, Zheng Lin Edwin Yeo <
edwinye...@gmail.com> wrote:
Hi,

It does not work with the back slash too.

But I found that it does not work for defType=lucene.
It will work if the defType=dismax or edismax.

What could be the reason that it did not work with the default
defType=lucene?

Regards,
Edwin



On 18 June 2016 at 01:04, Ahmet Arslan <iori...@yahoo.com.invalid> wrote:


Hi,

May be URL encoding issue?
By the way, I would use back slash to escape special characters.

Ahmet

On Friday, June 17, 2016 10:08 AM, Zheng Lin Edwin Yeo <
edwinye...@gmail.com> wrote:



Hi,

I encountered this error when I tried to search with special characters,
like "&" and "#".

{
  "responseHeader":{
"status":400,
"QTime":0},
  "error":{
"msg":"org.apache.solr.search.SyntaxError: Cannot parse
'\"Research ': Lexical error at line 1, column 11.  Encountered: 
after : \"\\\"Research \"",
"code":400}}


I have done the search by putting inverted commands, like: q="Research &
Development"

What could be the issue here?

I'm facing this problem in both Solr 5.4.0 and Solr 6.0.1.


Regards,
Edwin











--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: error rendering solr spatial in geoserver

2016-06-30 Thread Ere Maijala
It would have been _really_ nice if this had been in the release notes. 
Made me also scratch my head for a while when upgrading to Solr 6. 
Additionally, this makes a rolling upgrade from Solr 5.x a bit more 
scary since you have to update the collection schema to make the Solr 6 
nodes work while making sure that no Solr 5 node reloads the configuration.


--Ere

30.6.2016, 3.46, David Smiley kirjoitti:

For polygons in 6.0 you need to set
spatialContextFactory="org.locationtech.spatial4j.context.jts.JtsSpatialContextFactory"
-- see
https://cwiki.apache.org/confluence/display/solr/Spatial+Search and the
example.  And of course as you probably already know, put the JTS jar on
Solr's classpath.  What likely tripped you up between 5x and 6x is the
change in value of the spatialContextFactory as a result in organizational
package moving "com.spatial4j.core" to "org.locationtech.spatial4j".

On Wed, Jun 29, 2016 at 12:44 PM tkg_cangkul <yuza.ras...@gmail.com> wrote:


hi erick, thx for your reply.

i've solve this problem.
i got this error when i use solr 6.0.0
so i try to downgrade my solr to version 5.5.0 and it's successfull


On 29/06/16 22:39, Erick Erickson wrote:

There is not nearly enough information here to say anything very helpful.
What does your schema look like for this field?
What does the input look like?
How are you pulling data from geoserver?

You might want to review:
http://wiki.apache.org/solr/UsingMailingLists

Best,
Erick

On Wed, Jun 29, 2016 at 2:31 AM, tkg_cangkul <yuza.ras...@gmail.com
<mailto:yuza.ras...@gmail.com>> wrote:

hi, i try to load data spatial from solr with geoserver.
when i try to show the layer preview i've got this error message.

error


anybody can help me pls?




--

Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com



--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Long STW GCs with Solr Cloud

2016-06-17 Thread Ere Maijala

17.6.2016, 11.05, Bernd Fehling kirjoitti:



Am 17.06.2016 um 09:06 schrieb Ere Maijala:

16.6.2016, 1.41, Shawn Heisey kirjoitti:

If you want to continue avoiding G1, you should definitely be using
CMS.  My recommendation right now would be to try the G1 settings on my
wiki page under the heading "Current experiments" or the CMS settings
just below that.


For what it's worth, we're currently running Shawn's G1 settings slightly 
modified for our workload on Java 1.8.0_91 25.91-b14:

GC_TUNE=" \
-XX:+UseG1GC \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=16m \
-XX:MaxGCPauseMillis=200 \
-XX:+UnlockExperimentalVMOptions \
-XX:G1NewSizePercent=3 \
-XX:ParallelGCThreads=12 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
"


-XX:G1NewSizePercent

... Sets the percentage of the heap to use as the minimum for the young 
generation size.
The default value is 5 percent of your Java heap. ...

So you are reducing the young heap generation size to get a smoother running 
system.
This is strange, like reducing the bottle below the bottleneck.


True, but it works. Perhaps that's due to the default being too much 
with our heap size (> 10 GB). In any case, these settings allow us to 
run with average pause of <150ms and max pause of <2s whiel we 
previously struggled with pauses exceeding 20s at worst. All this was 
inspired by 
https://software.intel.com/en-us/blogs/2014/06/18/part-1-tuning-java-garbage-collection-for-hbase.


Regards,
Ere


Re: Long STW GCs with Solr Cloud

2016-06-17 Thread Ere Maijala

16.6.2016, 1.41, Shawn Heisey kirjoitti:

If you want to continue avoiding G1, you should definitely be using
CMS.  My recommendation right now would be to try the G1 settings on my
wiki page under the heading "Current experiments" or the CMS settings
just below that.


For what it's worth, we're currently running Shawn's G1 settings 
slightly modified for our workload on Java 1.8.0_91 25.91-b14:


GC_TUNE=" \
-XX:+UseG1GC \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=16m \
-XX:MaxGCPauseMillis=200 \
-XX:+UnlockExperimentalVMOptions \
-XX:G1NewSizePercent=3 \
-XX:ParallelGCThreads=12 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
"

It seems that our highly varying loads during day vs. night caused some 
issues leading to long pauses until I added the G1NewSizePercent (which 
needs +UnlockExperimentalVMOptions). Things are running smoothly and 
there are reports that the warnings regarding G1 with Lucene tests don't 
happen anymore with the newer Java versions, but it's of course up to 
you if you're willing to take the chance.


Regards,
Ere


Re: SOLR ranking

2016-02-19 Thread Ere Maijala
ew this message in context:

http://lucene.472066.n3.nabble.com/SOLR-ranking-tp4257367p4257782.html

Sent from the Solr - User mailing list archive at Nabble.com.


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

--

Regards,
Binoy Dalal





--
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England



--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: SOLR 5.4.0?

2016-01-08 Thread Ere Maijala
Sorry for taking so long. I can confirm that SOLR-8418 is fixed for me 
in a self-built 5.5.0 snapshot. Now the next obvious question is, any 
ETA for a release?


Regards,
Ere

31.12.2015, 19.15, Erick Erickson kirjoitti:

Ere:

Can you help with testing the patch if it's important to you? Ramkumar
is working on it...


Best,
Erick

On Wed, Dec 30, 2015 at 11:07 PM, Ere Maijala <ere.maij...@helsinki.fi> wrote:

Well, for us SOLR-8418 is a major issue. I haven't encountered other issues,
but that one was sort of a show-stopper.

--Ere

31.12.2015, 7.27, William Bell kirjoitti:


How is SOLR 5.4.0 ? I heard there was a quick 5.4.1 coming out?

Any major issues?



--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: how to search miilions of record in solr query

2016-01-05 Thread Ere Maijala
Well, if you already know that you need to display only the first 20 
records, why not only search for them? Or if you don't know whether they 
already exist, search for, say, a hundred, then thousand and so on until 
you have enough.


Nevertheless, what's really needed for a good answer or ideas on how to 
do what you need, is where this requirement comes from. I guess the main 
question is: WHY do you need to search for millions of IDs?


--Ere

5.1.2016, 16.12, Mugeesh Husain kirjoitti:

Thanks for your reply @Ere Maijala,

one of my eCommerce based client have a requirement to search  some of
records based on ID's like

IP:8083/select?q=ID:(1,4,7,...upto 1 Millions), display only 10 to 20
records.

if i use above procedure it takes too much time or if i am going to use
solr-terms-query according to yonik blog http://yonik.com/solr-terms-query/,

then i am getting 122,119 microseconds with a Solr Terms Query.

I am looking for an response around 50-150ms.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-search-miilions-of-record-in-solr-query-tp4248360p4248657.html
Sent from the Solr - User mailing list archive at Nabble.com.



--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: how to search miilions of record in solr query

2016-01-05 Thread Ere Maijala
You might get better answers if you'd describe your use-case. If, for 
instance, you know all the IDs and you just need to be able to display a 
hundred records among those millions quickly, it would make sense to 
search for only a chunk of 100 IDs at a time. If you need to support 
more search terms than just the ID, then it's a whole different thing. 
If, instead, those millions of IDs define a static subset of records, 
you might be better of adding a category based on the ID at index time. 
So, can you describe why you need to search for millions of IDs first?


--Ere

5.1.2016, 12.05, Mugeesh Husain kirjoitti:

Still i am struck ,how to solve my problem, search millions of ID with
minimum response time.

@Upayavira  Please elaborate it.

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-search-miilions-of-record-in-solr-query-tp4248360p4248597.html
Sent from the Solr - User mailing list archive at Nabble.com.



--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: SOLR 5.4.0?

2015-12-30 Thread Ere Maijala
Well, for us SOLR-8418 is a major issue. I haven't encountered other 
issues, but that one was sort of a show-stopper.


--Ere

31.12.2015, 7.27, William Bell kirjoitti:

How is SOLR 5.4.0 ? I heard there was a quick 5.4.1 coming out?

Any major issues?



--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


SOLR-8418: Nasty bug in MoreLikeThis handler in Solr 5.4.0

2015-12-22 Thread Ere Maijala
Those of you who are planning to upgrade to Solr 5.4.0, be aware that 
there's a bug in the MoreLikeThis handler that makes it fail with 
boosting. There's a Solr issue with a patch thanks to Jens Wille: 
https://issues.apache.org/jira/browse/SOLR-8418. I really hope this gets 
into 5.4.1, for us it seems to be a showstopper.


--Ere

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: The time that init.d script waits before shutdown should be configurable

2015-11-10 Thread Ere Maijala
If you're still going the shell script way, I'd suggest incorporating my 
changes (patch attached to the issue). It allows waiting for a longer 
time but only as long as necessary (like it already does during startup).


--Ere

11.11.2015, 0.01, Upayavira kirjoitti:



On Tue, Nov 10, 2015, at 04:22 PM, Yago Riveiro wrote:

Patch attached to https://issues.apache.org/jira/browse/SOLR-8065





The windows script is voodo for me :D, I haven’t the knowledge to port
this to cmd script.


Great! I saw this!

Two things - firstly, when making a patch, please try to avoid
whitespace changes. The only changes that should show in the diff should
be material changes.

Secondly - I think there was a suggestion that this change could be
ported *inside* the SolrCLI - i.e. into Java code. Do you reckon you
could handle that change? Harder than just updating a shell script, I
know, but could be very useful.

Upayavira



--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: 5 second timeout in bin/solr stop command

2015-09-17 Thread Ere Maijala

16.9.2015, 16.16, Shawn Heisey kirjoitti:

I agree here.  I don't like the forceful termination unless it becomes
truly necessary.

I changed the timeout to 20 seconds in the script installed in
/etc/init.d ... a bit of a brute force approach.  When I find some time,
I will think about how to make this better, and choose a better default
value.  30 seconds is probably good.  It should also be configurable,
probably in the /var/solr/solr.in.sh config fragment.


Thanks, Shawn. Insprired by this I filed an issue and attached a patch 
in Jira, see https://issues.apache.org/jira/browse/SOLR-8065. The patch 
makes the stop function behave like start so that it waits up to 30 
seconds for the process to shut down and checks the status once a 
second. I didn't make the timeout configurable since I think 30 seconds 
should be enough in any situation (this may be a statement I'll regret 
later..) and the script doesn't wait any longer than necessary. But if 
you find that a necessity, it shouldn't be too difficult to add.


--Ere

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


5 second timeout in bin/solr stop command

2015-09-16 Thread Ere Maijala

Hi All,

There's currently a five second delay in the bin/solr script when 
stopping a Solr instance before it's forcefully killed. In our 
experience this is not enough to allow a graceful shutdown of an active 
SolrCloud node and it seems a bit brutal to kill the process in the 
middle of shutdown. Since the script already has the bits for checking 
process status, how about checking it once a second for 30 seconds or 
until the process has stopped and only kill it if it doesn't shut down 
in that time?


Thanks,
Ere

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Slow highlighting on Solr 5.0.0

2015-05-11 Thread Ere Maijala
Thanks for the pointers. Using hl.usePhraseHighlighter=false does indeed 
make it a lot faster. Obviously it's not really a solution, though, 
since in 4.10 it wasn't a problem and turning it off has consequences. 
I'm looking forward for the improvements in the next releases.


--Ere

8.5.2015, 19.06, Matt Hilt kirjoitti:

I¹ve been looking into this again. The phrase highlighter is much slower
than the default highlighter, so you might be able to add
hl.usePhraseHighlighter=false to your query to make it faster. Note that
web interface will NOT help here, because that param is true by default,
and the checkbox is basically broken in that respect. Also, the default
highlighter doesn¹t seem to work in all case the phrase highlighter does
though.

Also, the current development branch of 5x is much better than 5.1, but
not as good as 4.10. This ticket seems to be hitting on some of the issues
at hand:
https://issues.apache.org/jira/browse/SOLR-5855


I think this means they are getting there, but the performance is really
still much worse than 4.10, and its not obvious why.


On 5/5/15, 2:06 AM, Ere Maijala ere.maij...@helsinki.fi wrote:


I'm seeing the same with Solr 5.1.0 after upgrading from 4.10.2. Here
are my timings:

4.10.2:
process: 1432.0
highlight: 723.0

5.1.0:
process: 9570.0
highlight: 8790.0

schema.xml and solrconfig.xml are available at
https://github.com/NatLibFi/NDL-VuFind-Solr/tree/master/vufind/biblio/conf
.

A couple of jstack outputs taken when the query was executing are
available at http://pastebin.com/eJrEy2Wb

Any suggestions would be appreciated. Or would it make sense to just
file a JIRA issue?

--Ere

3.3.2015, 0.48, Matt Hilt kirjoitti:

Short form:
While testing Solr 5.0.0 within our staging environment, I noticed that
highlight enabled queries are much slower than I saw with 4.10. Are
there any obvious reasons why this might be the case? As far as I can
tell, nothing has changed with the default highlight search component or
its parameters.


A little more detail:
The bulk of the collection config set was stolen from the basic 4.X
example config set. I changed my schema.xml and solrconfig.xml just
enough to get 5.0 to create a new collection (removed non-trie fields,
some other deprecated response handler definitions, etc). I can provide
my version of the solr.HighlightComponent config, but it is identical to
the sample_techproducts_configs example in 5.0.  Are there any other
config files I could provide that might be useful?


Number on ³much slower²:
I indexed a very small subset of my data into the new collection and
used the /select interface to do a simple debug query. Solr 4.10 gives
the following pertinent info:
response: { numFound: 72628,
...
debug: {
timing: { time: 95, process: { time: 94, query: { time: 6 },
highlight: { time: 84 }, debug: { time: 4 } }
---
Whereas solr 5.0 is:
response: { numFound: 1093,
...
debug: {
timing: { time: 6551, process: { time: 6549, query: { time:
0 }, highlight: { time: 6524 }, debug: { time: 25 }






--
Ere Maijala
Kansalliskirjasto / The National Library of Finland



--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Slow highlighting on Solr 5.0.0

2015-05-05 Thread Ere Maijala
I'm seeing the same with Solr 5.1.0 after upgrading from 4.10.2. Here 
are my timings:


4.10.2:
process: 1432.0
highlight: 723.0

5.1.0:
process: 9570.0
highlight: 8790.0

schema.xml and solrconfig.xml are available at 
https://github.com/NatLibFi/NDL-VuFind-Solr/tree/master/vufind/biblio/conf.


A couple of jstack outputs taken when the query was executing are 
available at http://pastebin.com/eJrEy2Wb


Any suggestions would be appreciated. Or would it make sense to just 
file a JIRA issue?


--Ere

3.3.2015, 0.48, Matt Hilt kirjoitti:

Short form:
While testing Solr 5.0.0 within our staging environment, I noticed that
highlight enabled queries are much slower than I saw with 4.10. Are
there any obvious reasons why this might be the case? As far as I can
tell, nothing has changed with the default highlight search component or
its parameters.


A little more detail:
The bulk of the collection config set was stolen from the basic 4.X
example config set. I changed my schema.xml and solrconfig.xml just
enough to get 5.0 to create a new collection (removed non-trie fields,
some other deprecated response handler definitions, etc). I can provide
my version of the solr.HighlightComponent config, but it is identical to
the sample_techproducts_configs example in 5.0.  Are there any other
config files I could provide that might be useful?


Number on “much slower”:
I indexed a very small subset of my data into the new collection and
used the /select interface to do a simple debug query. Solr 4.10 gives
the following pertinent info:
response: { numFound: 72628,
...
debug: {
timing: { time: 95, process: { time: 94, query: { time: 6 },
highlight: { time: 84 }, debug: { time: 4 } }
---
Whereas solr 5.0 is:
response: { numFound: 1093,
...
debug: {
timing: { time: 6551, process: { time: 6549, query: { time:
0 }, highlight: { time: 6524 }, debug: { time: 25 }






--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Unsubscribe from Mailing list

2015-04-20 Thread Ere Maijala
There's a wiki page about possible issues and solutions for 
unsubscribing, see 
https://wiki.apache.org/solr/Unsubscribing%20from%20mailing%20lists.


Regards,
Ere

20.4.2015, 12.23, Isha Garg kirjoitti:

Hi ,

Can anyone tell me how to unsubscribe from Solr  mailing lists. I tried sending 
email on 'solr-user-unsubscr...@lucene.apache.org', 
'general-unsubscr...@lucene.apache.org'. But it is not working for me.

Thanks  Regards,
Isha Garg
RAGE Frameworks/CreditPointe Services Pvt. LTD
India Off: +91 (20) 4141 3000 Ext:3043
www.rageframeworks.comhttps://mail.creditpointe.com/owa/redir.aspx?C=AbOUZv82G0KZ33QPyLlosoGQ9j10yNAId3aeTWGDbSBxU0BlQqNxKdEXXNDwCVrRhHIk7yWMi_M.URL=http%3a%2f%2fwww.creditpointe.com%2f
 
www.creditpointe.comhttps://mail.creditpointe.com/owa/redir.aspx?C=AbOUZv82G0KZ33QPyLlosoGQ9j10yNAId3aeTWGDbSBxU0BlQqNxKdEXXNDwCVrRhHIk7yWMi_M.URL=http%3a%2f%2fwww.creditpointe.com%2f





*

NOTICE AND DISCLAIMER
This e-mail (including any attachments) is intended for the above-named 
person(s). If you are not the intended recipient, notify the sender 
immediately, delete this email from your system and do not disclose or use for 
any purpose. All emails are scanned for any virus and monitored as per the 
Company information security policies and practices.


*
---
  This email has been scanned for email related threats and delivered safely by 
Mimecast.
  For more information please visit http://www.mimecast.com
---




--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: MoreLikeThis (mlt) in sharded SolrCloud

2015-04-19 Thread Ere Maijala
Thanks, Anshum. Looks like there's no way for this to work in 5.1 for us 
so I'll just have to wait to for the fixes. Relieving to know it wasn't 
just me, though.


--Ere

18.4.2015, 2.45, Anshum Gupta kirjoitti:

The other issue that would fix half of your problems is:
https://issues.apache.org/jira/browse/SOLR-7143

On Fri, Apr 17, 2015 at 4:35 PM, Anshum Gupta ans...@anshumgupta.net
wrote:


Ah, I meant SOLR-7418 https://issues.apache.org/jira/browse/SOLR-7418.

On Fri, Apr 17, 2015 at 4:30 PM, Anshum Gupta ans...@anshumgupta.net
wrote:


Hi Ere,

Those seem like valid issues. I've created an issue : SOLR-7275
https://issues.apache.org/jira/browse/SOLR-7275 and will create more
as I find more of those.
I plan to get to them and fix over the weekend.

On Wed, Apr 15, 2015 at 5:13 AM, Ere Maijala ere.maij...@helsinki.fi
wrote:


Hi,

I'm trying to gather information on how mlt works or is supposed to work
with SolrCloud and a sharded collection. I've read issues SOLR-6248,
SOLR-5480 and SOLR-4414, and docs at 
https://wiki.apache.org/solr/MoreLikeThis, but I'm still struggling
with multiple issues. I've been testing with Solr 5.1 and the Getting
Started sample cloud. So, with a freshly extracted Solr, these are the
steps I've done:

bin/solr start -e cloud -noprompt
bin/post -c gettingstarted docs/
bin/post -c gettingstarted example/exampledocs/books.json

After this I've tried different variations of queries with limited
success:

http://localhost:8983/solr/gettingstarted/select?q={!mlt}non-existing
causes java.lang.NullPointerException at
org.apache.solr.search.mlt.CloudMLTQParser.parse(CloudMLTQParser.java:80)

http://localhost:8983/solr/gettingstarted/select?q={!mlt}978-0641723445



causes java.lang.NullPointerException at
org.apache.solr.search.mlt.CloudMLTQParser.parse(CloudMLTQParser.java:84)


http://localhost:8983/solr/gettingstarted/select?q={!mlt%20qf=title}978-0641723445
http://localhost:8983/solr/gettingstarted/select?q=%7B!mlt%20qf=title%7D978-0641723445



causes java.lang.NullPointerException at
org.apache.lucene.queries.mlt.MoreLikeThis.retrieveTerms(MoreLikeThis.java:759)


http://localhost:8983/solr/gettingstarted/select?q={!mlt%20qf=cat}978-0641723445
http://localhost:8983/solr/gettingstarted/select?q=%7B!mlt%20qf=cat%7D978-0641723445



actually gives results


http://localhost:8983/solr/gettingstarted/select?q={!mlt%20qf=author,cat}978-0641723445
http://localhost:8983/solr/gettingstarted/select?q=%7B!mlt%20qf=author,cat%7D978-0641723445



again causes Java.lang.NullPointerException at
org.apache.lucene.queries.mlt.MoreLikeThis.retrieveTerms(MoreLikeThis.java:759)


I guess the actual question is, how am I supposed to use the handler to
replicate behavior of non-distributed mlt that was formerly used with
qt=morelikethis and the following configuration in solrconfig.xml:

   requestHandler name=morelikethis class=solr.MoreLikeThisHandler
 lst name=defaults
   str
name=mlt.fltitle,title_short,callnumber-label,topic,language,author,publishDate/str
   str name=mlt.qf
 title^75
 title_short^100
 callnumber-label^400
 topic^300
 language^30
 author^75
 publishDate
   /str
   int name=mlt.mintf1/int
   int name=mlt.mindf1/int
   str name=mlt.boosttrue/str
   int name=mlt.count5/int
   int name=rows5/int
 /lst
   /requestHandler

Real-life full schema and config can be found at 
https://github.com/NatLibFi/NDL-VuFind-Solr/tree/master/vufind/biblio/conf

.


--Ere

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland





--
Anshum Gupta





--
Anshum Gupta








--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


MoreLikeThis (mlt) in sharded SolrCloud

2015-04-15 Thread Ere Maijala

Hi,

I'm trying to gather information on how mlt works or is supposed to work 
with SolrCloud and a sharded collection. I've read issues SOLR-6248, 
SOLR-5480 and SOLR-4414, and docs at 
https://wiki.apache.org/solr/MoreLikeThis, but I'm still struggling 
with multiple issues. I've been testing with Solr 5.1 and the Getting 
Started sample cloud. So, with a freshly extracted Solr, these are the 
steps I've done:


bin/solr start -e cloud -noprompt
bin/post -c gettingstarted docs/
bin/post -c gettingstarted example/exampledocs/books.json

After this I've tried different variations of queries with limited success:

http://localhost:8983/solr/gettingstarted/select?q={!mlt}non-existing
causes java.lang.NullPointerException at 
org.apache.solr.search.mlt.CloudMLTQParser.parse(CloudMLTQParser.java:80)


http://localhost:8983/solr/gettingstarted/select?q={!mlt}978-0641723445
causes java.lang.NullPointerException at 
org.apache.solr.search.mlt.CloudMLTQParser.parse(CloudMLTQParser.java:84)


http://localhost:8983/solr/gettingstarted/select?q={!mlt%20qf=title}978-0641723445
causes java.lang.NullPointerException at 
org.apache.lucene.queries.mlt.MoreLikeThis.retrieveTerms(MoreLikeThis.java:759)


http://localhost:8983/solr/gettingstarted/select?q={!mlt%20qf=cat}978-0641723445
actually gives results

http://localhost:8983/solr/gettingstarted/select?q={!mlt%20qf=author,cat}978-0641723445
again causes Java.lang.NullPointerException at 
org.apache.lucene.queries.mlt.MoreLikeThis.retrieveTerms(MoreLikeThis.java:759)



I guess the actual question is, how am I supposed to use the handler to 
replicate behavior of non-distributed mlt that was formerly used with 
qt=morelikethis and the following configuration in solrconfig.xml:


  requestHandler name=morelikethis class=solr.MoreLikeThisHandler
lst name=defaults
  str 
name=mlt.fltitle,title_short,callnumber-label,topic,language,author,publishDate/str

  str name=mlt.qf
title^75
title_short^100
callnumber-label^400
topic^300
language^30
author^75
publishDate
  /str
  int name=mlt.mintf1/int
  int name=mlt.mindf1/int
  str name=mlt.boosttrue/str
  int name=mlt.count5/int
  int name=rows5/int
/lst
  /requestHandler

Real-life full schema and config can be found at 
https://github.com/NatLibFi/NDL-VuFind-Solr/tree/master/vufind/biblio/conf.


--Ere

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Solr 5.1 ignores SOLR_JAVA_MEM setting

2015-04-15 Thread Ere Maijala
Folks, just a quick heads-up that apparently Solr 5.1 introduced a 
change in bin/solr that overrides SOLR_JAVA_MEM setting from solr.in.sh 
or environment. I just filed 
https://issues.apache.org/jira/browse/SOLR-7392. The problem can be 
circumvented by using SOLR_HEAP setting, e.g. SOLR_HEAP=32G, but it's 
not mentioned in solr.in.sh by default.


--Ere

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Solr 4.10.2 Found core but I get No cores available in dashboard page

2014-12-16 Thread Ere Maijala
Do you have the jts libraries (e.g. jts-1.13.jar) in Solr's classpath 
(quoting from https://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4 
it needs to be in WEB-INF/lib in Solr's war file, basically)?


--Ere

13.12.2014, 1.54, solr-user kirjoitti:

I did find out the cause of my problems.  Turns out the problem wasn't due to
the solrconfig.xml file; it was in the schema.xml file

I spent a fair bit of time making my solrconfig closer to the default
solrconfig.xml in the solr download; when that didnt get rid of the error I
went back to the only other file we had that was different

Turns out the line that was causing the problem was the middle line in this
location_rpt fieldtype definition:

 fieldType name=location_rpt
class=solr.SpatialRecursivePrefixTreeFieldType

spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory
   geo=true distErrPct=0.025 maxDistErr=0.09 units=degrees /

The spatialContextFactory line caused the core to not load even tho no
error/warning messages were shown.

I missed that extra line somehow; mea culpa.

Anyhow, I really appreciate the responses/help I got on this issue.  many
thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-10-2-Found-core-but-I-get-No-cores-available-in-dashboard-page-tp4173602p4174118.html
Sent from the Solr - User mailing list archive at Nabble.com.




--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Ping handler during initial wamup

2014-09-18 Thread Ere Maijala

So, is it possible to configure a ping handler to return quickly with
non-OK status if a search handler is not yet available? This would
allow the load balancer to quickly fail over to another server. I
couldn't find anything like this in the docs, but I'm still hopeful.

I'm aware of the possibility of using a health state file, but I'd
rather have a way of doing this automatically.


If it's not horribly messy to implement, returning a non-OK status
immediately when there is no available searcher seems like a good idea.
Please file an improvement issue in Jira.


Thanks, I've filed https://issues.apache.org/jira/browse/SOLR-6532.

--Ere



Ping handler during initial wamup

2014-09-17 Thread Ere Maijala
As far as I can see, when a Solr instance is started (whether standalone 
or SolrCloud), a PingRequestHandler will wait until index warmup is 
complete before returning (at least with useColdSearcher=false) which 
may take a while. This poses a problem in that a load balancer either 
needs to wait for the result or employ a short timeout for timely 
failover. Of course the request is eventually served, but it would be 
better to be able to switch over to another server until warmup is complete.


So, is it possible to configure a ping handler to return quickly with 
non-OK status if a search handler is not yet available? This would allow 
the load balancer to quickly fail over to another server. I couldn't 
find anything like this in the docs, but I'm still hopeful.


I'm aware of the possibility of using a health state file, but I'd 
rather have a way of doing this automatically.


--Ere


Re: range types in SOLR

2014-05-15 Thread Ere Maijala

David,

thanks, looking forward to LUCENE-5648. I added a comment about 
supporting BC dates. We currently use the spatial support to index date 
ranges with a precision of one day, ranging from year - to .


Just for the record, I had some issues converting bounding box 
Intersects queries to polygons with Solr 4.6.1. Polygon version found 
way more results than it should have. I upgraded to 4.8.0 (and to JTS 
1.13 from 1.12), and now the results are correct.


--Ere

6.5.2014 21.26, david.w.smi...@gmail.com kirjoitti:

Hi Era,

I appreciate the scattered documentation is confusing for users.  The use
of spatial for time durations is definitely not an official way to do it;
it’s clearly a hack/trick — one that works pretty well if you know the
issues to watch out for.  So I don’t see it getting documented on the
reference guide.  But, you should be happy to know about this:
https://issues.apache.org/jira/browse/LUCENE-5648  “Watch” that issue to
stay abreast of my development on it, and the inevitable Solr FieldType to
follow, and inevitable documentation in the reference guide.  With luck
it’ll get in by 4.9.

The “Intersects(POLYGON(…))” syntax is something I suggest using when you
have to — like when you have a polygon or linestring or if you are indexing
circles.  One of these days there will be a more Solr friendly query parser
— definitely for 4.something.  When that happens, it’ll get
deprecated/removed in trunk/5.

~ David

On Tue, May 6, 2014 at 4:22 AM, Ere Maijala ere.maij...@helsinki.fi wrote:


David,

I made a note about your mentioning the deprecation below to take it into
account in our software, but now that I tried to find out more about this I
ran into some confusion since the Solr documentation regarding spatial
searches is currently quite badly scattered and partly obsolete [1]. I'd
appreciate some clarification on what exactly is deprecated. We're
currently using spatial for both time duration and geographic searches, and
in the latter we also use e.g. Intersects(POLYGON(...)) in addition. Is
this also deprecated and if so, how should I rewrite it? Thanks!

--Ere

[1] It would be really nice if it was possible to find up to date
documentation of at least all this in one place:

https://cwiki.apache.org/confluence/display/solr/Spatial+Search
https://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4
http://wiki.apache.org/solr/SpatialForTimeDurations
https://people.apache.org/~hossman/spatial-for-non-
spatial-meetup-20130117/
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/
201212.mbox/%3c1355027722156-4025434.p...@n3.nabble.com%3E

3.3.2014 20.12, Smiley, David W. kirjoitti:


The main reference for this approach is here:
http://wiki.apache.org/solr/SpatialForTimeDurations


Hoss’s illustrations he developed for the meetup presentation are great.
However, there are bugs in the instruction — specifically it’s important
to slightly buffer the query and choose an appropriate maxDistErr.  Also,
it’s more preferable to use the rectangle range query style of spatial
query (e.g. field:[“minX minY” TO “maxX maxY”] as opposed to using
“Intersects(minX minY maxX maxY)”.  There’s no technical difference but
the latter is deprecated and will eventually be removed from Solr 5 /
trunk.

All this said, recognize this is a bit of a hack (one that works well).
There is a good chance a more ideal implementation approach is going to be
developed this year.

~ David


On 3/1/14, 2:54 PM, Shawn Heisey s...@elyograg.org wrote:

  On 3/1/2014 11:41 AM, Thomas Scheffler wrote:



Am 01.03.14 18:24, schrieb Erick Erickson:


I'm not clear what you're really after here.

Solr certainly supports ranges, things like time:[* TO date_spec] or
date_field:[date_spec TO date_spec] etc.


There's also a really creative use of spatial (of all things) to, say
answer questions involving multiple dates per record. Imagine, for
instance, employees with different hours on different days. You can
use spatial to answer questions like which employees are available
on Wednesday between 4PM and 8PM.

And if none of this is relevant, how about you give us some
use-cases? This could well be an XY problem.



Hi,

lets try this example to show the problem. You have some old text that
was written in two periods of time:

1.) 2nd half of 13th century: - 1250-1299
2.) Beginning of 18th century: - 1700-1715

You are searching for text that were written between 1300-1699, than
this document described above should not be hit.

If you make start date and end date multiple this results in:

start: [1250, 1700]
end: [1299, 1715]

A search for documents written between 1300-1699 would be:

(+start:[1300 TO 1699] +end:[1300-1699]) (+start:[* TO 1300] +end:[1300
TO *]) (+start:[*-1699] +end:[1700 TO *])

You see that the document above would obviously hit by (+start:[* TO
1300] +end:[1300 TO *])



This sounds exactly like the spatial use case that Erick just described.

http://wiki.apache.org/solr/SpatialForTimeDurations

Re: range types in SOLR

2014-05-06 Thread Ere Maijala

David,

I made a note about your mentioning the deprecation below to take it 
into account in our software, but now that I tried to find out more 
about this I ran into some confusion since the Solr documentation 
regarding spatial searches is currently quite badly scattered and partly 
obsolete [1]. I'd appreciate some clarification on what exactly is 
deprecated. We're currently using spatial for both time duration and 
geographic searches, and in the latter we also use e.g. 
Intersects(POLYGON(...)) in addition. Is this also deprecated and if so, 
how should I rewrite it? Thanks!


--Ere

[1] It would be really nice if it was possible to find up to date 
documentation of at least all this in one place:


https://cwiki.apache.org/confluence/display/solr/Spatial+Search
https://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4
http://wiki.apache.org/solr/SpatialForTimeDurations
https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201212.mbox/%3c1355027722156-4025434.p...@n3.nabble.com%3E

3.3.2014 20.12, Smiley, David W. kirjoitti:

The main reference for this approach is here:
http://wiki.apache.org/solr/SpatialForTimeDurations


Hoss’s illustrations he developed for the meetup presentation are great.
However, there are bugs in the instruction — specifically it’s important
to slightly buffer the query and choose an appropriate maxDistErr.  Also,
it’s more preferable to use the rectangle range query style of spatial
query (e.g. field:[“minX minY” TO “maxX maxY”] as opposed to using
“Intersects(minX minY maxX maxY)”.  There’s no technical difference but
the latter is deprecated and will eventually be removed from Solr 5 /
trunk.

All this said, recognize this is a bit of a hack (one that works well).
There is a good chance a more ideal implementation approach is going to be
developed this year.

~ David


On 3/1/14, 2:54 PM, Shawn Heisey s...@elyograg.org wrote:


On 3/1/2014 11:41 AM, Thomas Scheffler wrote:

Am 01.03.14 18:24, schrieb Erick Erickson:

I'm not clear what you're really after here.

Solr certainly supports ranges, things like time:[* TO date_spec] or
date_field:[date_spec TO date_spec] etc.


There's also a really creative use of spatial (of all things) to, say
answer questions involving multiple dates per record. Imagine, for
instance, employees with different hours on different days. You can
use spatial to answer questions like which employees are available
on Wednesday between 4PM and 8PM.

And if none of this is relevant, how about you give us some
use-cases? This could well be an XY problem.


Hi,

lets try this example to show the problem. You have some old text that
was written in two periods of time:

1.) 2nd half of 13th century: - 1250-1299
2.) Beginning of 18th century: - 1700-1715

You are searching for text that were written between 1300-1699, than
this document described above should not be hit.

If you make start date and end date multiple this results in:

start: [1250, 1700]
end: [1299, 1715]

A search for documents written between 1300-1699 would be:

(+start:[1300 TO 1699] +end:[1300-1699]) (+start:[* TO 1300] +end:[1300
TO *]) (+start:[*-1699] +end:[1700 TO *])

You see that the document above would obviously hit by (+start:[* TO
1300] +end:[1300 TO *])


This sounds exactly like the spatial use case that Erick just described.

http://wiki.apache.org/solr/SpatialForTimeDurations
https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117
/

I am not sure whether the following presentation covers time series with
spatial, but it does say deep dive.  It's over an hour long, and done by
David Smiley, who wrote most of the Spatial code in Solr:

http://www.lucenerevolution.org/2013/Lucene-Solr4-Spatial-Deep-Dive

Hopefully someone who has actually used this can hop in and give you
some additional pointers.

Thanks,
Shawn






--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Solr 4 spatial queries, jts and polygons

2012-10-03 Thread Ere Maijala

Thanks David,

this is very good news. :)

Regards,

Ere

1.10.2012 21.46, Smiley, David W. kirjoitti:

Looks like this important bug fix is making it into 4.0 !
http://lucene.472066.n3.nabble.com/VOTE-release-4-0-take-two-tp4010808p4011255.html

On Oct 1, 2012, at 10:26 AM, David Smiley (@MITRE.org) wrote:


Hi Ere,

You are using it correctly.  The problem is this:
https://issues.apache.org/jira/browse/LUCENE-
Sadly, this just missed the 4.0 release which appears to be imminent.  If
the release needs to be respun then I'll get this simple fix in it.  Yeah
this sucks and it's very frustrating to me that it didn't make 4.0.  You
could check out either the 4.0 release branch and apply the patch, to get
this change now. As an alternative option, perhaps it would be useful if I
created a super simple plugin that wraps the existing field type but adds
the fix, to be used in the mean time until the next release.

Updating that wiki page is on the top of my priority list right now,
although I'm leaving to speak at a conference today and won't be back till
Wednesday.

p.s. Normally I respond to these spatial inquiries sooner but my Google
Alerts didn't pick it up yet, which is odd since you used all the right
keywords.

~ David



-
Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-spatial-queries-jts-and-polygons-tp4010469p4011202.html
Sent from the Solr - User mailing list archive at Nabble.com.





--
Ere Maijala (Mr.)
The National Library of Finland


Solr 4 spatial queries, jts and polygons

2012-09-26 Thread Ere Maijala

Hi All,

I've been trying to get the brand new spatial search functionality 
working with Solr 4 snapshot apache-solr-4.1-2012-09-24_05-10-26 and 
also a trunk build from 22 Sep. I have added the jts and jts-io 
libraries and defined a field as follows:


fieldType name=geo class=solr.SpatialRecursivePrefixTreeFieldType 
spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory 
units=degrees distErrPct=0.025 maxDistErr=0.09 /


[...]

dynamicField name=*_geo  type=geo  indexed=true stored=true 
multiValued=true /


There are no errors during Solr startup, and I can successfully index 
and search rectangles, but I can't get the polygon search to work. E.g. 
this search:


q=*:*fq=location_geo:%22Intersects%28POLYGON%28%28-10%2030,%20-40%2040,%20-10%20-20,%2040%2020,%200%200,%20-10%2030%29%29%29%22

Results in the following error:

Unable to read: POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))

The call stack implies that JTS is not being used:

com.spatial4j.core.exception.InvalidShapeException: Unable to read: 
POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))

at 
com.spatial4j.core.io.ShapeReadWriter.readShape(ShapeReadWriter.java:48)
	at 
org.apache.lucene.spatial.query.SpatialArgsParser.parse(SpatialArgsParser.java:89)
	at 
org.apache.solr.schema.AbstractSpatialFieldType.getFieldQuery(AbstractSpatialFieldType.java:170)
	at 
org.apache.solr.search.SolrQueryParser.getFieldQuery(SolrQueryParser.java:171)

[...]

As far as I can see, ShapeReadWriter in spatial4j doesn't support 
polygons, and JTS is needed for that. I've read 
http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4 (which is a 
bit vague and partly outdated) and some of the JIRA issues too (e.g. 
SOLR-3304, SOLR-2268, LUCENE-3795), but did not notice what I'm missing.


Any pointers or hints on what to do to make this work would be highly 
appreciated. Thanks!


Regards,
Ere

--
Ere Maijala (Mr.)
The National Library of Finland