Minimum set of jars to run EmbeddedSolrServer

2020-09-28 Thread Alexandre Rafalovitch
Hello,

Does anybody know (or even experimented) with what the minimum set of
jars needed to run EmbeddedSolrServer.

If I just include solr-core, that pulls in a huge number of Jars. I
don't need - for example - Lucene analyzers for Korean and Japanese
for this application.

But what else do I not need. Can I just not have hadoop? calcite?
curator? Are these all loaded on demand or will something fail?

Any pointers would be appreciated.

Regards,
Alex.


ApacheCon at Home 2020 starts tomorrow!

2020-09-28 Thread Anshum Gupta
Hey everyone!

ApacheCon at Home 2020 starts tomorrow. The event is 100% virtual, and free
to register. What’s even better is that this year we have reintroduced the
Lucene/Solr/Search track at ApacheCon.

With 2 full days of sessions covering various Lucene, Solr, and Search, I
hope you are able to find some time to attend the sessions and learn
something new and interesting.

There are also various other tracks that span the 3 days of the conference.
The conference starts in just a few hours for our community in Asia and
tomorrow morning for the Americas and Europe. Check out the complete
schedule in the link below.

Here are a few resources you may find useful if you plan to attend
ApacheCon at Home.

ApacheCon website - https://www.apachecon.com/acna2020/index.html
Registration - https://hopin.to/events/apachecon-home
Slack - http://s.apache.org/apachecon-slack
Search Track - https://www.apachecon.com/acah2020/tracks/search.html

See you at ApacheCon.

-- 
Anshum Gupta


Re: solr performance with >1 NUMAs

2020-09-28 Thread Shawn Heisey

On 9/28/2020 12:17 PM, Wei wrote:

Thanks Shawn. Looks like Java 11 is the way to go with -XX:+UseNUMA. Do you
see any backward compatibility issue for Solr 8 with Java 11? Can we run
Solr 8 built with JDK 8 in Java 11 JRE, or need to rebuild solr with Java
11 JDK?


I do not know of any problems running the binary release of Solr 8 
(which is most likely built with the Java 8 JDK) with a newer release 
like Java 11 or higher.


I think Sun was really burned by such problems cropping up in the days 
of Java 5 and 6, and their developers have worked really hard to make 
sure that never happens again.


If you're running Java 11, you will need to pick a different garbage 
collector if you expect the NUMA flag to function.  The most recent 
releases of Solr are defaulting to G1GC, which as previously mentioned, 
did not gain NUMA optimizations until Java 14.


It is not clear to me whether the NUMA optimizations will work with any 
collector other than Parallel until Java 14.  You would need to check 
Java documentation carefully or ask someone involved with development of 
Java.


If you do see an improvement using the NUMA flag with Java 11, please 
let us know exactly what options Solr was started with.


Thanks,
Shawn


Vulnerabilities in SOLR 8.6.2

2020-09-28 Thread Narayanan, Lakshmi
Hello Solr-User Support team
We have installed the SOLR 8.6.2 package into docker container in our DEV 
environment. Prior to using it, our security team scanned the docker image 
using SysDig and found a lot of Critical/High/Medium vulnerabilities. The full 
list is in the attached spreadsheet

Scan Summary
30 STOPS 190 WARNS188 Vulnerabilities

Please advise or point us to how/where to get a package that has been patched 
for the Critical/High/Medium vulnerabilities in the attached spreadsheet
Your help will be gratefully received


Lakshmi Narayanan
Marsh & McLennan Companies
121 River Street, Hoboken,NJ-07030
201-284-3345
M: 845-300-3809
Email: lakshmi.naraya...@mmc.com






**
This e-mail, including any attachments that accompany it, may contain
information that is confidential or privileged. This e-mail is
intended solely for the use of the individual(s) to whom it was intended to be
addressed. If you have received this e-mail and are not an intended recipient,
any disclosure, distribution, copying or other use or
retention of this email or information contained within it are prohibited.
If you have received this email in error, please immediately
reply to the sender via e-mail and also permanently
delete all copies of the original message together with any of its attachments
from your computer or device.
**


SOLR862 Vulnerabilities.xlsx
Description: SOLR862 Vulnerabilities.xlsx


Re: Solr storage of fields <-> indexed data

2020-09-28 Thread Edward Turner
That's really good and helpful info, thank you. Perfect.

Best wishes,

Edd

On Mon, 28 Sep 2020, 5:53 pm Shawn Heisey,  wrote:

> On 9/28/2020 8:56 AM, Edward Turner wrote:
> > By removing the copyfields, we've found that our index sizes have reduced
> > by ~40% in some cases, which is great! We're just curious now as to
> exactly
> > how this can be ...
>
> That's not surprising.
>
> > My question is, given the following two schemas, if we index some data to
> > the "description" field, will the index for schema1 be twice as large as
> > the index of schema2? (I guess this relates to how, internally, Solr
> stores
> > field + index data)
> >
> > Old way -- schema1:
> > ===
> >  > multiValued="false"/>
> >  > multiValued="false" />
> >  > multiValued="false"/>
>
> If the only field in the indexed documents is "description", the index
> built with schema2 will be half the size of the index built with
> schema1.  Both fields referenced by "copyField" are the same type and
> have the same settings, so they would contain exactly the same data at
> the Lucene level.
>
> Having the same type for a source and destination field is normally only
> useful if multiple sources are copied to a destination, which requires
> multiValued="true" on the destination -- NOT the case in your example.
>
> There is one other use case for a copyField -- using the same data
> differently, with different type values.  For example you might have one
> type for faceting and one for searching.
>
> Thanks,
> Shawn
>


Re: solr performance with >1 NUMAs

2020-09-28 Thread Wei
Thanks Shawn. Looks like Java 11 is the way to go with -XX:+UseNUMA. Do you
see any backward compatibility issue for Solr 8 with Java 11? Can we run
Solr 8 built with JDK 8 in Java 11 JRE, or need to rebuild solr with Java
11 JDK?

Best,
Wei

On Sat, Sep 26, 2020 at 6:44 PM Shawn Heisey  wrote:

> On 9/26/2020 1:39 PM, Wei wrote:
> > Thanks Shawn! Currently we are still using the CMS collector for solr
> with
> > Java 8. When last evaluated with Solr 7, CMS performs better than G1 for
> > our case. When using G1, is it better to upgrade from Java 8 to Java 11?
> >  From
> https://lucene.apache.org/solr/guide/8_4/solr-system-requirements.html,
> > seems Java 14 is not officially supported for Solr 8.
>
> It has been a while since I was working with Solr every day, and when I
> was, Java 11 did not yet exist.  I have no idea whether Java 11 improves
> things beyond Java 8.  That said ... all software evolves and usually
> improves as time goes by.  It is likely that the newer version has SOME
> benefit.
>
> Regarding whether or not Java 14 is supported:  There are automated
> tests where all the important code branches are run with all major
> versions of Java, including pre-release versions, and those tests do
> include various garbage collectors.  Somebody notices when a combination
> doesn't work, and big problems with newer Java versions are something
> that gets discussed on our mailing lists.
>
> Java 14 has been out for a while, with no big problems being discussed
> so far.  So it is likely that it works with Solr.  Can I say for sure?
> No.  I haven't tried it myself.
>
> I don't have any hardware available where there is more than one NUMA,
> or I would look deeper into this myself.  It would be interesting to
> find out whether the -XX:+UseNUMA option makes a big difference in
> performance.
>
> Thanks,
> Shawn
>


Worker node / collection creation, parallelized streams

2020-09-28 Thread uyilmaz


Hi all,

Today I was fiddling with a streaming expression that takes too long to finish 
and times out. First of all, is it normal for it to time out, rather than just 
taking too long?

Then I read about the parallelized streaming expressions, which takes a worker 
number as parameter. We have 10 nodes in our cluster.

First question is, if I want to run it in 10 worker nodes, should I provide a 
partition key that takes exactly 10 different values, or Solr itself figures 10 
different values from it? "mod" function query with modulus 10 came into my 
mind, but I got various errors when using it as a partition key.

Second question is, how do I correctly create a worker collection? Should it be 
an empty collection with 10 shards with 1 replica each, or 1 shard with 10 
replicas? When I used the latter, I got array IndexOutOfBounds errors with 
workers parameter set to greater than 1.

~Regards


-- 
uyilmaz 


Add Hosts in SolrCloud

2020-09-28 Thread Massimiliano Randazzo
Hello everybody

I have a SolrCloud consisting of 4 Servers, I have a collection with 2
shars in replica 2

Collection: bookReaderAttilioHortis
Shard count: 2
configName: BookReader
replicationFactor: 2
maxShardsPerNode: 2
router: compositeId
autoAddReplicas: false

I would like to add 2 more servers bringing shards to 3 while keeping 2
replication to increase storage space and performance

Thank you in advance for your help

Thank you
Massimiliano Randazzo

-- 
Massimiliano Randazzo

Analista Programmatore,
Sistemista Senior
Mobile +39 335 6488039
email: massimiliano.randa...@gmail.com
pec: massimiliano.randa...@pec.net


Re: Solr storage of fields <-> indexed data

2020-09-28 Thread Shawn Heisey

On 9/28/2020 8:56 AM, Edward Turner wrote:

By removing the copyfields, we've found that our index sizes have reduced
by ~40% in some cases, which is great! We're just curious now as to exactly
how this can be ...


That's not surprising.


My question is, given the following two schemas, if we index some data to
the "description" field, will the index for schema1 be twice as large as
the index of schema2? (I guess this relates to how, internally, Solr stores
field + index data)

Old way -- schema1:
===





If the only field in the indexed documents is "description", the index 
built with schema2 will be half the size of the index built with 
schema1.  Both fields referenced by "copyField" are the same type and 
have the same settings, so they would contain exactly the same data at 
the Lucene level.


Having the same type for a source and destination field is normally only 
useful if multiple sources are copied to a destination, which requires 
multiValued="true" on the destination -- NOT the case in your example.


There is one other use case for a copyField -- using the same data 
differently, with different type values.  For example you might have one 
type for faceting and one for searching.


Thanks,
Shawn


Returning fields a specific order

2020-09-28 Thread gnandre
Hi,

I have a use-case where I want to compare stored fields values of Solr
documents from two different Solr instances. I can use a diff tool to
compare them but only if they returned the fields in specific order in the
response. I tried setting fl param with all the fields specified in
particular order. However, the results that are returned do not follow
specific order given in fl param. Is there any way to achieve this behavior
in Solr?


Re: Difference in q.op param behavior between Solr 6.3 and Solr 8.5.2

2020-09-28 Thread gnandre
Thanks, this is helpful. I agree. q.op param should not affect fq
parameter. I think this is a feature and not a bug.

On Wed, Sep 23, 2020 at 4:39 PM Erik Hatcher  wrote:

> In 6.3 it did that?   It shouldn't have.  q and fq shouldn't share
> parameters.  fq's themselves shouldn't, IMO, have global defaults.  fq's
> need to be stable and often uniquely specified kinds of constraining query
> parsers ({!terms/term/field,etc}) or rely on basic Lucene query parser
> syntax and be able to stably rely on AND/OR.
>
> Relevancy tuning on q and friends, tweaking those parameters, shouldn't
> affect fq's, to say it a little differently.
>
> One can fq={!lucene q.op=AND}id:(1 2 3)
>
> Erik
>
>
> > On Sep 23, 2020, at 4:23 PM, gnandre  wrote:
> >
> > Is there a way to set default operator as AND for fq parameter in Solr
> > 8.5.2 now?
> >
> > On Tue, Sep 22, 2020 at 7:44 PM gnandre  wrote:
> >
> >> In 6.3, q.op param used to affect q as well fq param behavior. E.g. if
> >> q.op is set to AND and fq is set to id:(1 2 3), no results will show up
> but
> >> if it is set to OR then all 3 results will show up. This does not
> happen in
> >> Solr 8.5.2 anymore.
> >>
> >> Is this a bug? What does one need to do in Solr 8.5.2 to achieve the
> same
> >> behavior besides passing the operator directly in fq param i.e. id:(1
> OR 2
> >> OR 3)
> >>
>
>


Re: Solr storage of fields <-> indexed data

2020-09-28 Thread Erick Erickson
Fields are placed in the index totally separately from each
other, so it’s no wonder that removing
the copyField results in this kind of savings.

And they have to be separate. Consider what comes out of the end of the
analysis chain. The same input could produce totally different output. 
As a trivial example, imagine two fields:

whitespacetokenizer
lowercasefilter

whitespacetokenizer
lowercasefilter
edgengramfilterfactory

and identical input "fleas”. The output of the first would be “fleas”, and the
output of the second would be something like “f”, “fl”, “fle”, “flea”, “fleas”.

Trying to share the tokens between fields would be a nightmare.

And that’s only one of many ways the output of two different analysis
chains could be different…

Best,
Erick



> On Sep 28, 2020, at 10:56 AM, Edward Turner  wrote:
> 
> Hi all,
> 
> We have recently switched to using edismax + qf fields, and no longer use
> copyfields to allow us to easily search over values in multiple fields (by
> copying multiple fields' values to the copyfield destinations, and then
> performing queries over the destination field).
> 
> By removing the copyfields, we've found that our index sizes have reduced
> by ~40% in some cases, which is great! We're just curious now as to exactly
> how this can be ...
> 
> My question is, given the following two schemas, if we index some data to
> the "description" field, will the index for schema1 be twice as large as
> the index of schema2? (I guess this relates to how, internally, Solr stores
> field + index data)
> 
> Old way -- schema1:
> ===
>  multiValued="false"/>
>  multiValued="false" />
>  multiValued="false"/>
> 
> Many thanks and kind regards,
> 
> Edd



Solr storage of fields <-> indexed data

2020-09-28 Thread Edward Turner
Hi all,

We have recently switched to using edismax + qf fields, and no longer use
copyfields to allow us to easily search over values in multiple fields (by
copying multiple fields' values to the copyfield destinations, and then
performing queries over the destination field).

By removing the copyfields, we've found that our index sizes have reduced
by ~40% in some cases, which is great! We're just curious now as to exactly
how this can be ...

My question is, given the following two schemas, if we index some data to
the "description" field, will the index for schema1 be twice as large as
the index of schema2? (I guess this relates to how, internally, Solr stores
field + index data)

Old way -- schema1:
===




Many thanks and kind regards,

Edd


Re: SOLR Cursor Pagination Issue

2020-09-28 Thread Erick Erickson
I said nothing about docId changing. _any_ sort criteria changing is an issue. 
You’re sorting by score. Well, as you index documents, the new docs change the 
values used to calculate scores for _all_ documents will change, thus changing 
the sort order and potentially causing unexpected results when using 
cursormark. That said, I don’t think you’re getting any different scores at all 
if you’re really searching for “(* AND *)", try returning score in the fl list, 
are they different?

You still haven’t given an example of the results you’re seeing that are 
unexpected. And my assumption is that you are seeing odd results when you call 
this query again with a cursorMark returned by a previous call. Or are you 
saying that you don’t think facet.query is returning the correct count? Be 
aware that Solr doesn’t support true Boolean logic, see: 
https://lucidworks.com/post/why-not-and-or-and-not/

There’s special handling for the form "fq=NOT something” to change it to 
"fq=*:* NOT something” that’s not present in something like "q=NOT something”. 
How that plays in facet.query I’m not sure, but try “facet.query=*:* NOT 
something” if the facet count is what the problem is.

l have no idea what you’re trying to accomplish with (* AND *) unless those are 
just placeholders and you put real text in them. That’s rather odd. *:* is 
“select everything”...

BTW, returning 10,000 docs is somewhat of an anti-pattern, if you really 
require that many documents consider streaming.

> On Sep 28, 2020, at 10:21 AM, vmakov...@xbsoftware.by wrote:
> 
> Hi, Erick
> 
> I have a python script that sends requests with CursorMark. This script 
> checks data against the following Expected series criteria:
> Collected series:
> Number of requests:
> Collected unique series:
> The request looks like this: 
> select?indent=off=edismax=json={!key=NUM_DOCS}NOT 
> SERIES_ID:0=NOT 
> SERIES_ID:0=true=true=true=-1=(*
>  AND *)=all_text_stemming all_text=facet_db_code:( "CN" 
> )=-SERIES_CODE:( "TEST" )=SERIES_ID=score desc,docId 
> asc=SERIES_STATUS:T^5=KEY_SERIES_FLAG:1^5=accuracy_name:0=SERIES_STATUS:C^-30=1=*
> 
> DocId does not change during data update.During data updating process in 
> solrCloud skript returnd incorect Number of requests and Collected series.
> 
> Best,
> Vlad
> 
> 
> 
> Mon, 28 Sep 2020 08:54:57 -0400, Erick Erickson  
> писал(а):
> 
>> Define “incorrect” please. Also, showing the exact query you use would be 
>> helpful.
>> That said, indexing data at the same time you are using CursorMark is not 
>> guaranteed do find all documents. Consider a sort with date asc, id asc. 
>> doc53 has a date of 2001 and you’re already returned the doc.
>> Next, you update doc53 to 2020. It now appears sometime later in the results 
>> due to the changed data. Or the other way, doc53 starts with 2020, and while 
>> your cursormark label is in 2010, you change doc53 to have a date of 2001. 
>> It will never be returned.
>> Similarly for anything else you change that’s relevant to the sort criteria 
>> you’re using.
>> CursorMark doesn’t remember _documents_, just, well, call it the fingerprint 
>> (i.e. sort criteria values) of the last document returned so far.
>> Best,
>> Erick
>>> On Sep 28, 2020, at 3:32 AM, vmakov...@xbsoftware.by wrote:
>>> Good afternoon,
>>> Could you please suggest us a solution: during data updating process in 
>>> solrCloud, requests with cursor mark return incorrect data. I suppose that 
>>> the results do not follow each other during the indexation process, because 
>>> the data doesn't have enough time to be replicated between the nodes.
>>> Kind regards,
>>> Vladislav Makovski
> Vladislav Makovski
> Developer
> XB Software Ltd. | Minsk, Belarus
> Site: https://xbsoftware.com
> Skype: vlad__makovski
> Cell:  +37529 6484100



Re: SOLR Cursor Pagination Issue

2020-09-28 Thread vmakovsky

Hi, Erick

I have a python script that sends requests with CursorMark. This script 
checks data against the following Expected series criteria:

Collected series:
Number of requests:
Collected unique series:
The request looks like this: 
select?indent=off=edismax=json={!key=NUM_DOCS}NOT 
SERIES_ID:0=NOT 
SERIES_ID:0=true=true=true=-1=(* 
AND *)=all_text_stemming all_text=facet_db_code:( "CN" 
)=-SERIES_CODE:( "TEST" )=SERIES_ID=score desc,docId 
asc=SERIES_STATUS:T^5=KEY_SERIES_FLAG:1^5=accuracy_name:0=SERIES_STATUS:C^-30=1=*


DocId does not change during data update.During data updating process in 
solrCloud skript returnd incorect Number of requests and Collected series.


Best,
Vlad



Mon, 28 Sep 2020 08:54:57 -0400, Erick Erickson  
писал(а):


Define “incorrect” please. Also, showing the exact query you use 
would be helpful.


That said, indexing data at the same time you are using CursorMark 
is not guaranteed do find all documents. Consider a sort with date 
asc, id asc. doc53 has a date of 2001 and you’re already returned the 
doc.


Next, you update doc53 to 2020. It now appears sometime later in the 
results due to the changed data. 

Or the other way, doc53 starts with 2020, and while your cursormark 
label is in 2010, you change doc53 to have a date of 2001. It will 
never be returned.


Similarly for anything else you change that’s relevant to the sort 
criteria you’re using.


CursorMark doesn’t remember _documents_, just, well, call it the 
fingerprint (i.e. sort criteria values) of the last document returned 
so far.


Best,
Erick


On Sep 28, 2020, at 3:32 AM, vmakov...@xbsoftware.by wrote:

Good afternoon,
Could you please suggest us a solution: during data updating process 
in solrCloud, requests with cursor mark return incorrect data. I 
suppose that the results do not follow each other during the 
indexation process, because the data doesn't have enough time to be 
replicated between the nodes.

Kind regards,
Vladislav Makovski




Vladislav Makovski
Developer
XB Software Ltd. | Minsk, Belarus
Site: https://xbsoftware.com
Skype: vlad__makovski
Cell:  +37529 6484100


Re: Unable to upload updated solr config set

2020-09-28 Thread Erick Erickson
Until then, you can use

bin/solr zk upconfig….

Best,
Erick

> On Sep 28, 2020, at 10:06 AM, Houston Putman  wrote:
> 
> Until the next Solr minor version is released you will not be able to
> overwrite an existing configSet with a new configSet of the same name.
> 
> The ticket for this feature is SOLR-10391
> , and it will be included
> in the 8.7.0 release.
> 
> Until then you will have to create a configSet with a new name, and then
> update your collections to point to that new configSet.
> 
> - Houston
> 
> On Sun, Sep 27, 2020 at 6:56 PM Deepu  wrote:
> 
>> Hi,
>> 
>> I was able to upload solr  configs using solr/admin/configs?action=UPLOAD
>> api but getting below error when reupload same config set with same.
>> 
>> {
>> 
>>  "responseHeader":{
>> 
>>"status":400,
>> 
>>"QTime":51},
>> 
>>  "error":{
>> 
>>"metadata":[
>> 
>>  "error-class","org.apache.solr.common.SolrException",
>> 
>>  "root-error-class","org.apache.solr.common.SolrException"],
>> 
>>"msg":"The configuration sampleConfigSet already exists in zookeeper",
>> 
>>"code":400}}
>> 
>> 
>> how we re upload same config with few schema & solr config changes ?
>> 
>> 
>> 
>> Thanks,
>> 
>> Deepu
>> 



Re: Unable to upload updated solr config set

2020-09-28 Thread Houston Putman
Until the next Solr minor version is released you will not be able to
overwrite an existing configSet with a new configSet of the same name.

The ticket for this feature is SOLR-10391
, and it will be included
in the 8.7.0 release.

Until then you will have to create a configSet with a new name, and then
update your collections to point to that new configSet.

- Houston

On Sun, Sep 27, 2020 at 6:56 PM Deepu  wrote:

> Hi,
>
> I was able to upload solr  configs using solr/admin/configs?action=UPLOAD
> api but getting below error when reupload same config set with same.
>
> {
>
>   "responseHeader":{
>
> "status":400,
>
> "QTime":51},
>
>   "error":{
>
> "metadata":[
>
>   "error-class","org.apache.solr.common.SolrException",
>
>   "root-error-class","org.apache.solr.common.SolrException"],
>
> "msg":"The configuration sampleConfigSet already exists in zookeeper",
>
> "code":400}}
>
>
> how we re upload same config with few schema & solr config changes ?
>
>
>
> Thanks,
>
> Deepu
>


Re: Corrupted records after successful commit

2020-09-28 Thread Mr Havercamp
Yes, id is unique key. 

> I bet that if you redefined your updateHandler to give it some name other
than “/update” in solrconfig.xml two things would happen:

Hmm, nice. I didn't think of that but that would definitely identify the
problem. We do have other scripts writing to the index but they are not of
type "talk"; talk is handled completely by a single script (although my
suspicion has been that we have a rogue script somewhere).

Will definitely give a rename a try, at least for my own sanity.

Thanks again.

On Mon, 28 Sep 2020 at 21:26, Erick Erickson 
wrote:

> Is your “id” field is your , and is it tokenized? It shouldn’t
> be, use something like “string” or keywordTokenizer. Definitely do NOT use,
> say, text_general.
>
> It’s very unlikely that records are not being flushed on commit, I’m
> 99.99% certain that’s a red herring and that this is a problem in your
> environment.
>
> Or that some process you don’t know about is sending documents that don’t
> have the information you expect. The fact that you say you’ve disabled your
> update scripts but see this second record being indexed 3 minutes later is
> strong evidence that _someone_ is updating records, is there a cron job
> somewhere that’s sending docs? Other??
>
> I bet that if you redefined your updateHandler to give it some name other
> than “/update” in solrconfig.xml two things would happen:
> 1> this problem will go away
> 2> you’ll get some error report from somewhere telling you that Solr is
> broken because it isn’t accepting documents for update ;)
>
> > On Sep 28, 2020, at 9:01 AM, Mr Havercamp  wrote:
> >
> > Thanks Eric. My knowledge is fairly limited but 1) sounds feasible. Some
> > logs:
> >
> > I write a bunch of recods to Solr:
> >
> > 2020-09-28 11:01:01.255 INFO  (qtp918312414-21) [   x:vnc]
> > o.a.s.u.p.LogUpdateProcessorFactory [vnc]  webapp=/solr path=/update
> > params={json.nl=flat=false=json}{add=[
> > talk.tq0rkem4pc.jaydeep.pan...@dev.vnc.de (1679075166122934272),
> > talk.tq0rkem4pc.dmitry.zolotni...@dev.vnc.de (1679075166123982848),
> > talk.tq0rkem4pc.hayden.yo...@dev.vnc.de (1679075166125031424),
> > talk.tq0rkem4pc.nishant.j...@dev.vnc.de (1679075166125031425),
> > talk.tq0rkem4pc.macanh@dev.vnc.de (167907516612608),
> > talk.tq0rkem4pc.kapil.nadiyap...@dev.vnc.de (1679075166126080001),
> > talk.tq0rkem4pc.sanjay.domad...@dev.vnc.de (1679075166126080002),
> > talk.tq0rkem4pc.umesh.sarva...@dev.vnc.de
> (1679075166127128576)],commit=} 0
> > 8
> >
> > Selecting records looks good:
> >
> >  {
> >"talk_id_s":"tq0rkem4pc",
> >"talk_internal_id_s":"29896",
> >"from_s":"from address",
> >"content_txt":["test_116"],
> >"raw_txt":["http://www.w3.org/1999/xhtml\
> > ">test_116"],
> >"created_dt":"2020-09-28T11:00:02Z",
> >"updated_dt":"2020-09-28T11:00:02Z",
> >"type_s":"talk",
> >"talk_type_s":"groupchat",
> >"title_s":"role__change__1_talk@conference",
> >"to_ss":["bunch", "of", "names"],
> >"owner_s":"owner address",
> >"id":"talk.tq0rkem4pc.email@address",
> >"_version_":1679075166127128576}
> >
> > Then, a few minutes later:
> >
> > 2020-09-28 11:04:33.070 INFO  (qtp918312414-21) [   x:vnc]
> > o.a.s.u.p.LogUpdateProcessorFactory [vnc]  webapp=/solr path=/update
> > params={wt=json}{add=[talk.tq0rkem4pc.hayden.yo...@dev.vnc.de
> > (1679075388234399744)]} 0 1
> > 2020-09-28 11:04:33.150 INFO  (qtp918312414-21) [   x:vnc]
> > o.a.s.u.p.LogUpdateProcessorFactory [vnc]  webapp=/solr path=/update
> > params={wt=json}{add=[talk.tq0rkem4pc (1679075388318285824)]} 0 0
> >
> > Checking the record again:
> >
> > {
> >"id":"talk.tq0rkem4pc.email@address",
> >"_version_":1679075388234399744},
> >  {
> >"id":"talk.tq0rkem4pc",
> >"_version_":1679075388318285824}
> >
> > A couple of strange things here:
> >
> > 1. my talk.tq0rkem4pc.email@address record no longer has any data in it.
> > Just id and version.
> >
> > 2. The second entry is really strange; this isn't a valid record at all
> and
> > I don't have any record of creating it.
> >
> > I've ruled out reindexing items both from my indexing script (I just
> don't
> > run it) and an external code snippet updating the record at a later time.
> >
> > Not sure if I've got the terminology right but would I be correct in
> > assuming that it is possible records are not being flushed from the
> buffer
> > when added? I'm assuming there is some kind of buffering or caching going
> > on before records are commttted? Is it possible they are getting
> corrupted
> > under higher than usual load?
> >
> >
> > On Mon, 28 Sep 2020 at 20:41, Erick Erickson 
> > wrote:
> >
> >> There are several possibilities:
> >>
> >> 1> you simply have some process incorrectly updating documents.
> >>
> >> 2> you’ve changed your schema sometime without completely deleting your
> >> old index and re-indexing all documents from scratch. I 

Re: Corrupted records after successful commit

2020-09-28 Thread Erick Erickson
Is your “id” field is your , and is it tokenized? It shouldn’t be, 
use something like “string” or keywordTokenizer. Definitely do NOT use, say, 
text_general. 

It’s very unlikely that records are not being flushed on commit, I’m 99.99% 
certain that’s a red herring and that this is a problem in your environment.

Or that some process you don’t know about is sending documents that don’t have 
the information you expect. The fact that you say you’ve disabled your update 
scripts but see this second record being indexed 3 minutes later is strong 
evidence that _someone_ is updating records, is there a cron job somewhere 
that’s sending docs? Other?? 

I bet that if you redefined your updateHandler to give it some name other than 
“/update” in solrconfig.xml two things would happen: 
1> this problem will go away
2> you’ll get some error report from somewhere telling you that Solr is broken 
because it isn’t accepting documents for update ;)

> On Sep 28, 2020, at 9:01 AM, Mr Havercamp  wrote:
> 
> Thanks Eric. My knowledge is fairly limited but 1) sounds feasible. Some
> logs:
> 
> I write a bunch of recods to Solr:
> 
> 2020-09-28 11:01:01.255 INFO  (qtp918312414-21) [   x:vnc]
> o.a.s.u.p.LogUpdateProcessorFactory [vnc]  webapp=/solr path=/update
> params={json.nl=flat=false=json}{add=[
> talk.tq0rkem4pc.jaydeep.pan...@dev.vnc.de (1679075166122934272),
> talk.tq0rkem4pc.dmitry.zolotni...@dev.vnc.de (1679075166123982848),
> talk.tq0rkem4pc.hayden.yo...@dev.vnc.de (1679075166125031424),
> talk.tq0rkem4pc.nishant.j...@dev.vnc.de (1679075166125031425),
> talk.tq0rkem4pc.macanh@dev.vnc.de (167907516612608),
> talk.tq0rkem4pc.kapil.nadiyap...@dev.vnc.de (1679075166126080001),
> talk.tq0rkem4pc.sanjay.domad...@dev.vnc.de (1679075166126080002),
> talk.tq0rkem4pc.umesh.sarva...@dev.vnc.de (1679075166127128576)],commit=} 0
> 8
> 
> Selecting records looks good:
> 
>  {
>"talk_id_s":"tq0rkem4pc",
>"talk_internal_id_s":"29896",
>"from_s":"from address",
>"content_txt":["test_116"],
>"raw_txt":["http://www.w3.org/1999/xhtml\
> ">test_116"],
>"created_dt":"2020-09-28T11:00:02Z",
>"updated_dt":"2020-09-28T11:00:02Z",
>"type_s":"talk",
>"talk_type_s":"groupchat",
>"title_s":"role__change__1_talk@conference",
>"to_ss":["bunch", "of", "names"],
>"owner_s":"owner address",
>"id":"talk.tq0rkem4pc.email@address",
>"_version_":1679075166127128576}
> 
> Then, a few minutes later:
> 
> 2020-09-28 11:04:33.070 INFO  (qtp918312414-21) [   x:vnc]
> o.a.s.u.p.LogUpdateProcessorFactory [vnc]  webapp=/solr path=/update
> params={wt=json}{add=[talk.tq0rkem4pc.hayden.yo...@dev.vnc.de
> (1679075388234399744)]} 0 1
> 2020-09-28 11:04:33.150 INFO  (qtp918312414-21) [   x:vnc]
> o.a.s.u.p.LogUpdateProcessorFactory [vnc]  webapp=/solr path=/update
> params={wt=json}{add=[talk.tq0rkem4pc (1679075388318285824)]} 0 0
> 
> Checking the record again:
> 
> {
>"id":"talk.tq0rkem4pc.email@address",
>"_version_":1679075388234399744},
>  {
>"id":"talk.tq0rkem4pc",
>"_version_":1679075388318285824}
> 
> A couple of strange things here:
> 
> 1. my talk.tq0rkem4pc.email@address record no longer has any data in it.
> Just id and version.
> 
> 2. The second entry is really strange; this isn't a valid record at all and
> I don't have any record of creating it.
> 
> I've ruled out reindexing items both from my indexing script (I just don't
> run it) and an external code snippet updating the record at a later time.
> 
> Not sure if I've got the terminology right but would I be correct in
> assuming that it is possible records are not being flushed from the buffer
> when added? I'm assuming there is some kind of buffering or caching going
> on before records are commttted? Is it possible they are getting corrupted
> under higher than usual load?
> 
> 
> On Mon, 28 Sep 2020 at 20:41, Erick Erickson 
> wrote:
> 
>> There are several possibilities:
>> 
>> 1> you simply have some process incorrectly updating documents.
>> 
>> 2> you’ve changed your schema sometime without completely deleting your
>> old index and re-indexing all documents from scratch. I recommend in fact
>> indexing into a new collection and using collection aliasing if you can’t
>> delete and recreate the collection before re-indexing. There’s some support
>> for this idea because you say that the doc in question not only changes one
>> way, but then changes back mysteriously. So seg1 (old def) merges with seg2
>> (new def) into seg10 using the old def because merging saw seg1 first. Then
>> sometime later seg3 (new def) merges with seg10 and the data mysteriously
>> comes back because that merge uses seg3 (new def) as a template for how the
>> index “should” look.
>> 
>> But I’ve never heard of Solr (well, Lucene actually) doing this by itself,
>> and I have heard of the merging process doing “interesting” stuff with
>> segments 

Re: Corrupted records after successful commit

2020-09-28 Thread Mr Havercamp
Thanks Eric. My knowledge is fairly limited but 1) sounds feasible. Some
logs:

I write a bunch of recods to Solr:

2020-09-28 11:01:01.255 INFO  (qtp918312414-21) [   x:vnc]
o.a.s.u.p.LogUpdateProcessorFactory [vnc]  webapp=/solr path=/update
params={json.nl=flat=false=json}{add=[
talk.tq0rkem4pc.jaydeep.pan...@dev.vnc.de (1679075166122934272),
talk.tq0rkem4pc.dmitry.zolotni...@dev.vnc.de (1679075166123982848),
talk.tq0rkem4pc.hayden.yo...@dev.vnc.de (1679075166125031424),
talk.tq0rkem4pc.nishant.j...@dev.vnc.de (1679075166125031425),
talk.tq0rkem4pc.macanh@dev.vnc.de (167907516612608),
talk.tq0rkem4pc.kapil.nadiyap...@dev.vnc.de (1679075166126080001),
talk.tq0rkem4pc.sanjay.domad...@dev.vnc.de (1679075166126080002),
talk.tq0rkem4pc.umesh.sarva...@dev.vnc.de (1679075166127128576)],commit=} 0
8

Selecting records looks good:

  {
"talk_id_s":"tq0rkem4pc",
"talk_internal_id_s":"29896",
"from_s":"from address",
"content_txt":["test_116"],
"raw_txt":["http://www.w3.org/1999/xhtml\
">test_116"],
"created_dt":"2020-09-28T11:00:02Z",
"updated_dt":"2020-09-28T11:00:02Z",
"type_s":"talk",
"talk_type_s":"groupchat",
"title_s":"role__change__1_talk@conference",
"to_ss":["bunch", "of", "names"],
"owner_s":"owner address",
"id":"talk.tq0rkem4pc.email@address",
"_version_":1679075166127128576}

Then, a few minutes later:

2020-09-28 11:04:33.070 INFO  (qtp918312414-21) [   x:vnc]
o.a.s.u.p.LogUpdateProcessorFactory [vnc]  webapp=/solr path=/update
params={wt=json}{add=[talk.tq0rkem4pc.hayden.yo...@dev.vnc.de
(1679075388234399744)]} 0 1
2020-09-28 11:04:33.150 INFO  (qtp918312414-21) [   x:vnc]
o.a.s.u.p.LogUpdateProcessorFactory [vnc]  webapp=/solr path=/update
params={wt=json}{add=[talk.tq0rkem4pc (1679075388318285824)]} 0 0

Checking the record again:

{
"id":"talk.tq0rkem4pc.email@address",
"_version_":1679075388234399744},
  {
"id":"talk.tq0rkem4pc",
"_version_":1679075388318285824}

A couple of strange things here:

1. my talk.tq0rkem4pc.email@address record no longer has any data in it.
Just id and version.

2. The second entry is really strange; this isn't a valid record at all and
I don't have any record of creating it.

I've ruled out reindexing items both from my indexing script (I just don't
run it) and an external code snippet updating the record at a later time.

Not sure if I've got the terminology right but would I be correct in
assuming that it is possible records are not being flushed from the buffer
when added? I'm assuming there is some kind of buffering or caching going
on before records are commttted? Is it possible they are getting corrupted
under higher than usual load?


On Mon, 28 Sep 2020 at 20:41, Erick Erickson 
wrote:

> There are several possibilities:
>
> 1> you simply have some process incorrectly updating documents.
>
> 2> you’ve changed your schema sometime without completely deleting your
> old index and re-indexing all documents from scratch. I recommend in fact
> indexing into a new collection and using collection aliasing if you can’t
> delete and recreate the collection before re-indexing. There’s some support
> for this idea because you say that the doc in question not only changes one
> way, but then changes back mysteriously. So seg1 (old def) merges with seg2
> (new def) into seg10 using the old def because merging saw seg1 first. Then
> sometime later seg3 (new def) merges with seg10 and the data mysteriously
> comes back because that merge uses seg3 (new def) as a template for how the
> index “should” look.
>
> But I’ve never heard of Solr (well, Lucene actually) doing this by itself,
> and I have heard of the merging process doing “interesting” stuff with
> segments created with changed schema definitions.
>
> Best,
> Erick
>
> > On Sep 28, 2020, at 8:26 AM, Mr Havercamp  wrote:
> >
> > Hi,
> >
> > We're seeing strange behaviour when records have been committed. It
> doesn't
> > happen all the time but enough that the index is very inconsistent.
> >
> > What happens:
> >
> > 1. We commit a doc to Solr,
> > 2. The doc shows in the search results,
> > 3. Later (may be immediate, may take minutes, may take hours), the same
> > document is emptied of all data except version and id.
> >
> > We have custom scripts which add to the index but even without them being
> > executed we see records being updated in this way.
> >
> > For example committing:
> >
> > { id: talk.1234, from: "me", to: "you", "content": "some content", title:
> > "some title"}
> >
> > will suddenly end up as after an initial successful search:
> >
> > { id: talk.1234, version: 1234}
> >
> > Not sure how to proceed on debugging this issue. It seems to settle in
> > after Solr has been running for a while but can just as quickly rectify
> > itself.
> >
> > At a loss how to debug and proceed.
> >
> > Any help much appreciated.
>
>


Re: SOLR Cursor Pagination Issue

2020-09-28 Thread Erick Erickson
Define “incorrect” please. Also, showing the exact query you use would be 
helpful.

That said, indexing data at the same time you are using CursorMark is not 
guaranteed do find all documents. Consider a sort with date asc, id asc. doc53 
has a date of 2001 and you’re already returned the doc.

Next, you update doc53 to 2020. It now appears sometime later in the results 
due to the changed data. 

Or the other way, doc53 starts with 2020, and while your cursormark label is in 
2010, you change doc53 to have a date of 2001. It will never be returned.

Similarly for anything else you change that’s relevant to the sort criteria 
you’re using.

CursorMark doesn’t remember _documents_, just, well, call it the fingerprint 
(i.e. sort criteria values) of the last document returned so far.

Best,
Erick

> On Sep 28, 2020, at 3:32 AM, vmakov...@xbsoftware.by wrote:
> 
> Good afternoon,
> Could you please suggest us a solution: during data updating process in 
> solrCloud, requests with cursor mark return incorrect data. I suppose that 
> the results do not follow each other during the indexation process, because 
> the data doesn't have enough time to be replicated between the nodes.
> Kind regards,
> Vladislav Makovski
> 



Re: Corrupted records after successful commit

2020-09-28 Thread Erick Erickson
There are several possibilities:

1> you simply have some process incorrectly updating documents.

2> you’ve changed your schema sometime without completely deleting your old 
index and re-indexing all documents from scratch. I recommend in fact indexing 
into a new collection and using collection aliasing if you can’t delete and 
recreate the collection before re-indexing. There’s some support for this idea 
because you say that the doc in question not only changes one way, but then 
changes back mysteriously. So seg1 (old def) merges with seg2 (new def) into 
seg10 using the old def because merging saw seg1 first. Then sometime later 
seg3 (new def) merges with seg10 and the data mysteriously comes back because 
that merge uses seg3 (new def) as a template for how the index “should” look.

But I’ve never heard of Solr (well, Lucene actually) doing this by itself, and 
I have heard of the merging process doing “interesting” stuff with segments 
created with changed schema definitions.

Best,
Erick

> On Sep 28, 2020, at 8:26 AM, Mr Havercamp  wrote:
> 
> Hi,
> 
> We're seeing strange behaviour when records have been committed. It doesn't
> happen all the time but enough that the index is very inconsistent.
> 
> What happens:
> 
> 1. We commit a doc to Solr,
> 2. The doc shows in the search results,
> 3. Later (may be immediate, may take minutes, may take hours), the same
> document is emptied of all data except version and id.
> 
> We have custom scripts which add to the index but even without them being
> executed we see records being updated in this way.
> 
> For example committing:
> 
> { id: talk.1234, from: "me", to: "you", "content": "some content", title:
> "some title"}
> 
> will suddenly end up as after an initial successful search:
> 
> { id: talk.1234, version: 1234}
> 
> Not sure how to proceed on debugging this issue. It seems to settle in
> after Solr has been running for a while but can just as quickly rectify
> itself.
> 
> At a loss how to debug and proceed.
> 
> Any help much appreciated.



Corrupted records after successful commit

2020-09-28 Thread Mr Havercamp
Hi,

We're seeing strange behaviour when records have been committed. It doesn't
happen all the time but enough that the index is very inconsistent.

What happens:

1. We commit a doc to Solr,
2. The doc shows in the search results,
3. Later (may be immediate, may take minutes, may take hours), the same
document is emptied of all data except version and id.

We have custom scripts which add to the index but even without them being
executed we see records being updated in this way.

For example committing:

{ id: talk.1234, from: "me", to: "you", "content": "some content", title:
"some title"}

will suddenly end up as after an initial successful search:

{ id: talk.1234, version: 1234}

Not sure how to proceed on debugging this issue. It seems to settle in
after Solr has been running for a while but can just as quickly rectify
itself.

At a loss how to debug and proceed.

Any help much appreciated.


Re: What does current mean?

2020-09-28 Thread Kayak28
Hello, Wei-san

Thank you for answering my question.
That pretty makes sense to me.

Sincerely, Kaya

2020年9月27日(日) 9:16 Wei :

> My understanding is that current means whether there is data pending to be
> committed.
>
> Best,
> Wei
>
> On Sat, Sep 26, 2020 at 5:09 PM Kayak28  wrote:
>
> > Hello, Solr community:
> >
> >
> >
> > I would like to ask a question about the current icon on the
> core-overview
> >
> > under statistics.
> >
> > I thought previously that the current tag tells users whether it is
> >
> > searchable or not (commit or not commit) because if I send a
> >
> > commit request, it becomes an OK-ish icon from NG-ish icon.
> >
> >
> >
> > If anyone knows the meaning of the icon, I would like to hear about.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > --
> >
> >
> >
> > Sincerely,
> >
> > Kaya
> >
> > github: https://github.com/28kayak
> >
> >
>


-- 

Sincerely,
Kaya
github: https://github.com/28kayak


SOLR Cursor Pagination Issue

2020-09-28 Thread vmakovsky

Good afternoon,
Could you please suggest us a solution: during data updating process in 
solrCloud, requests with cursor mark return incorrect data. I suppose that 
the results do not follow each other during the indexation process, because 
the data doesn't have enough time to be replicated between the nodes.

Kind regards,
Vladislav Makovski



Re: Solr waitForMerges() causing leaderless shard during shutdown

2020-09-28 Thread Andrzej Białecki
Hi Ramsey,

This is an interesting scenario, I vaguely remember someone (Cao Manh Dat?) on 
a similar issue - I’m not sure if newer versions of Solr already fixed that but 
it would be helpful to create a Jira issue to investigate it and verify that 
it’s indeed fixed in a more recent Solr release.


> On 16 Sep 2020, at 13:42, Ramsey Haddad (BLOOMBERG/ LONDON) 
>  wrote:
> 
> Hi Solr community,
> 
> We have been investigating an issue in our solr (7.5.0) setup where the 
> shutdown of our solr node takes quite some time (3-4 minutes) during which we 
> are effectively leaderless.
> After investigating and digging deeper we were able to track it down to 
> segment merges which happen before a solr core is closed.
> 
>  stack trace when killing the node 
> 
> 
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode):
> 
> "Attach Listener" #150736 daemon prio=9 os_prio=0 tid=0x7f6da4002000 
> nid=0x13292 waiting on condition [0x]
> java.lang.Thread.State: RUNNABLE
> 
> "coreCloseExecutor-22-thread-1" #150733 prio=5 os_prio=0 
> tid=0x7f6d54020800 nid=0x11b61 in Object.wait() [0x7f6c98564000]
> java.lang.Thread.State: TIMED_WAITING (on object monitor)
> ~at java.lang.Object.wait(Native Method)
> ~at org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:4672)
> ~- locked <0x0005499908c0> (a org.apache.solr.update.SolrIndexWriter)
> ~at org.apache.lucene.index.IndexWriter.waitForMerges(IndexWriter.java:2559)
> ~- locked <0x0005499908c0> (a org.apache.solr.update.SolrIndexWriter)
> ~at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1036)
> ~at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1078)
> ~at org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:286)
> ~at 
> org.apache.solr.update.DirectUpdateHandler2.closeWriter(DirectUpdateHandler2.java:892)
> ~at 
> org.apache.solr.update.DefaultSolrCoreState.closeIndexWriter(DefaultSolrCoreState.java:105)
> ~at 
> org.apache.solr.update.DefaultSolrCoreState.close(DefaultSolrCoreState.java:399)
> ~- locked <0x00054e150cc0> (a org.apache.solr.update.DefaultSolrCoreState)
> ~at 
> org.apache.solr.update.SolrCoreState.decrefSolrCoreState(SolrCoreState.java:83)
> ~at org.apache.solr.core.SolrCore.close(SolrCore.java:1574)
> ~at org.apache.solr.core.SolrCores.lambda$close$0(SolrCores.java:141)
> ~at org.apache.solr.core.SolrCores$$Lambda$443/1058423472.call(Unknown Source)
> ~at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 
> 
> 
> 
> The situation is as follows -
> 
> 1. The first thing that happens is the request handlers being closed at -
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/core/SolrCore.java#L1588
> 
> 2. Then it tries to close the index writer via -
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/core/SolrCore.java#L1610
> 
> 3. When closing the index writer, it waits for any pending merges to finish 
> at -
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L1236
> 
> Now, if this waitForMerges() takes a long time (3-4 minutes), the instance 
> won't shut down for the whole of that time, but because of *Step 1* it will 
> stop
> accepting any requests.
> 
> This becomes a problem when this node has a leader replica and it is stuck on 
> waitForMerges() after closing its reqHandlers. We are in a situation where
> the leader is not accepting requests but has not given away the leadership, 
> so we are in a leaderless phase.
> 
> 
> This issue triggers when we turnaround our nodes which causes a brief period 
> of leaderless shards which leads to potential data losses.
> 
> My question is -
> 1. How to avoid this situation given that we have big segment sizes and the 
> merging the largest segments is going to take some time.
> We do not want to reduce the segment size as it will impact our search 
> performance which is crucial.
> 2. Should Solr ideally not do the waitForMerges() step before closing the 
> request handlers?
> 
> 
> Merge Policy config and segment size -
> 
> 
> time_of_arrival desc
> inner
> org.apache.solr.index.TieredMergePolicyFactory
> 
> 16
> 20480
> 
> 
>