Re: Cybersecurity Incident Report

2020-07-23 Thread Man with No Name
Any help on this.?

On Wed, Jul 22, 2020 at 4:25 PM Man with No Name 
wrote:

> The image is pulled from docker hub. After scanning the image from docker
> hub, without any modification, this is the list of CVE we're getting.
>
>
> Image  ID  CVE Package
> Version SeverityStatus
>CVSS
> -  --  --- ---
> --- --
>
> solr:8.4.1-slim57561b4889690532CVE-2019-16335  
> com.fasterxml.jackson.core_jackson-databind2.4.0   critical   
>  fixed in 2.9.10  9.8
> solr:8.4.1-slim57561b4889690532CVE-2020-8840   
> com.fasterxml.jackson.core_jackson-databind2.4.0   critical   
>   9.8
> solr:8.4.1-slim57561b4889690532CVE-2020-11620  
> com.fasterxml.jackson.core_jackson-databind2.4.0   critical   
>  fixed in 2.9.10.49.8
> solr:8.4.1-slim57561b4889690532CVE-2020-9546   
> com.fasterxml.jackson.core_jackson-databind2.4.0   critical   
>  fixed in 2.9.10.49.8
> solr:8.4.1-slim57561b4889690532CVE-2020-9547   
> com.fasterxml.jackson.core_jackson-databind2.4.0   critical   
>  fixed in 2.9.10.49.8
> solr:8.4.1-slim57561b4889690532CVE-2019-20445  
> io.netty_netty-codec   4.1.29.Finalcritical   
>  fixed in 4.1.44  9.1
> solr:8.4.1-slim57561b4889690532CVE-2020-9548   
> com.fasterxml.jackson.core_jackson-databind2.4.0   critical   
>  fixed in 2.9.10.49.8
> solr:8.4.1-slim57561b4889690532CVE-2017-15095  
> com.fasterxml.jackson.core_jackson-databind2.4.0   critical   
>  fixed in 2.9.1, 2.8.10   9.8
> solr:8.4.1-slim57561b4889690532CVE-2018-14718  
> com.fasterxml.jackson.core_jackson-databind2.4.0   critical   
>  fixed in 2.9.7   9.8
> solr:8.4.1-slim57561b4889690532CVE-2019-16942  
> com.fasterxml.jackson.core_jackson-databind2.4.0   critical   
>   9.8
> solr:8.4.1-slim57561b4889690532CVE-2019-14893  
> com.fasterxml.jackson.core_jackson-databind2.4.0   critical   
>  fixed in 2.10.0, 2.9.10  9.8
> solr:8.4.1-slim57561b4889690532CVE-2018-7489   
> com.fasterxml.jackson.core_jackson-databind2.4.0   critical   
>  fixed in 2.9.5, 2.8.11.1, 2.7.9.39.8
> solr:8.4.1-slim57561b4889690532CVE-2019-20444  
> io.netty_netty-codec   4.1.29.Finalcritical   
>  fixed in 4.1.44  9.1
> solr:8.4.1-slim57561b4889690532CVE-2019-14540  
> com.fasterxml.jackson.core_jackson-databind2.4.0   critical   
>  fixed in 2.9.10  9.8
> solr:8.4.1-slim57561b4889690532CVE-2019-16943  
> com.fasterxml.jackson.core_jackson-databind2.4.0   critical   
>   9.8
> solr:8.4.1-slim57561b4889690532CVE-2020-11612  
> io.netty_netty-codec   4.1.29.Finalcritical   
>  fixed in 4.1.46  9.8
> solr:8.4.1-slim57561b4889690532CVE-2019-20330  
> com.fasterxml.jackson.core_jackson-databind2.4.0   critical   
>  fixed in 2.9.10.29.8
> solr:8.4.1-slim57561b4889690532CVE-2019-17267  
> com.fasterxml.jackson.core_jackson-databind2.4.0   critical   
>  fixed in 2.9.10  9.8
>
>
> On Tue, Jul 21, 2020 at 5:06 PM Erick Erickson 
> wrote:
>
>> Not sure where the Docker image came from, but according to:
>> https://issues.apache.org/jira/browse/SOLR-13818
>>
>> Jackson was upgraded to 2.10.0 in Solr 8.4.
>>
>> > On Jul 21, 2020, at 2:59 PM, Man with No Name <
>> pinkeshsharm...@gmail.com> wrote:
>> >
>> > Hey Guys,
>> > Our team is using Solr 8.4.1 in a kubernetes cluster using the public
>> image
>> > from docker hub. The containers before getting deployed to the cluster
>> > get whitescanned and it lists all the CVEs in the container. This is
>> list
>> > of CVE we have for Solr
>> >
>> > CVE-2020-11619, CVE-2020-11620, CVE-2020-8840, CVE-2019-10088,
>> > CVE-2020-10968, CVE-2020-10969, CVE-2020-1, CVE-2020-2,
>> > CVE-2020-3, CVE-2020-14060, CVE-2020-14061, CVE-2020-14062,
>> > CVE-2020-14195, CVE-2019-10094, CVE-2019-12402
>> >
>> > Most of the CVEs are because of the old version of Jackson-databind,
>> and it
>> > has been fixed in the 2.9.10.4 version. So what would 

Re: tlog keeps growing

2020-07-23 Thread Carlos Ugarte
Hello folks,

We see similar behavior from time to time.  The main difference seems to be
that you see it while using NRT replication and we see it while using TLOG
replication.

* Solr 7.5.0.
* 1 collection with 12 shards, each with 2 TLOG and 2 PULL replicas.
* 12 machines, each machine hosting one node/JVM.  Each node contains 4
replicas (different shards).
* No explicit commits in the update requests.  AutoCommit=15s,
AutoSoftCommit=1s.

The symptoms we observe are as follows:
* It's on a TLOG replica that is not currently the leader.
* For that replica, there is a single transaction log that keeps on growing.
* For that replica, new segments are not being fetched from that shard's
TLOG leader.

In this configuration, one node contains four TLOG cores.  We have observed
the problem occurring on a single one of the cores as well as on multiple
cores in one node.

Anecdotally, it seems to occur more frequently on those collections that
are large (number of documents, size on disk) and that have a higher ingest
rate.  These are vague terms and I don't know that I'm allowed to share
specifics, but I can say that we run a number of different clouds with a
similar setup and that this problem occurs more frequently for the more
loaded clouds.

Initially we couldn't tell that this was occurring (queries were directed
to PULL replicas so not evident to the applications, the TLOG cores with
this problem reported as active and in a good state so nothing obviously
wrong).  Our early alarm system now consists of checking for large
transaction logs.  When we see this, we restart the problem node.  Upon
restart it recovers from its leader (fetching whatever segments it had
missed - hours, days, ...).  Eventually the large transaction log
disappears and that core starts to cycle through a series of smaller
transaction logs (the normal behavior).

We noticed that the ctime of the large transaction log seemed to be
slightly later than that node's restart time.  After that discovery, we saw
the following pattern every time we observed the problem:
* Everything is in a good state at the beginning.  A is a node containing a
TLOG replica that is leader for its shard and B is a node containing a TLOG
replica that is not a leader.  Ingest is ongoing.
* A is stopped for a short period of time (< 30s) and then is started up
again.  If it makes a difference, our way of stopping this relies on
systemd's default behavior - send SIGTERM, wait for 5s, send SIGKILL.
* The TLOG replica in B emerges from this as the leader for its shard.
Everything about B appears to suggest B is operating correctly.  The TLOG
replica in A has the ever-growing transaction log and never fetches new
segments.
* The malfunctioning TLOG replica in A can be "fixed" by restarting A.
* As noted earlier, this can affect cores (in a single node) individually.
It can be a problem for one and not the others (or for all of the cores in
a node).

It was suggested to us that this might be
https://issues.apache.org/jira/browse/SOLR-13486.

On Wed, Jul 22, 2020 at 3:42 PM Gael Jourdan-Weil <
gael.jourdan-w...@kelkoogroup.com> wrote:

> Hello,
>
> I'm facing a situation where a transaction log file keeps growing and is
> never deleted.
>
> The setup is as follow:
> - Solr 8.4.1
> - SolrCloud with 2 nodes
> - 1 collection, 1 shard
>
> On one of the node I can see the tlog files having the expected behavior,
> that is new tlog files being created and old ones being deleted at a
> frequency that matches the autocommit settings.
> For instance, there is currently two files tlog.0003226 and
> tlog.0003227, each of them is around 1G (size).
>
> But on the other node, I see two files tlog.298 and
> tlog.299, the later being now 20G and has been created 10
> hours ago.
>
> It already happened a few times, restarting the server seems to make
> things go right but it's obviously not a durable solution.
>
> Do you have any idea what could cause this behavior?
>
> solrconfig.xml:
>   
> 
>   ${solr.ulog.dir:}
>   1000
>   100
> 
> 
>   90
>   false
> 
> 
> 18
> 
>   
>
> Kind regards,
> Gaël
>
>


RE: How to measure search performance

2020-07-23 Thread Webster Homer
I forgot to mention, the fields being used in the function query are indexed 
fields. They are mostly text fields that cannot have DocValues

-Original Message-
From: Webster Homer 
Sent: Thursday, July 23, 2020 2:07 PM
To: solr-user@lucene.apache.org
Subject: RE: How to measure search performance

Hi Erick,

This is an example of a pseudo field: wdim_pno_:if(gt(query({!edismax 
qf=searchmv_pno v=$q}),0),1,0) I get your point that it would only be applied 
to the results returned and not to all the results. The intent is to be able to 
identify which of the fields matched the search. Our business people are keen 
to know, for internal reasons.

I have not done a lot of function queries like this, does using edismax make 
this less performant? My tests have a lot of variability but I do see an effect 
on the QTime for adding these, but it is hard to quantify. It could be as much 
as 10%

Thank you for your quick response.
Webster

-Original Message-
From: Erick Erickson 
Sent: Thursday, July 23, 2020 12:52 PM
To: solr-user@lucene.apache.org
Subject: Re: How to measure search performance

This isn’t usually a cause for concern. Clearing the caches doesn’t necessarily 
clear the OS caches for instance. I think you’re already aware that Lucene uses 
MMapDirectory, meaning the index pages are mapped to OS memory space. Whether 
those pages are actually _in_ the OS physical memory or not is anyone’s guess 
so depending on when they’re needed they might have to be read from disk. This 
is entirely independent of Solr’s caches, and could come into play even if you 
restarted Solr.

Then there’s your function queries for the pseudo fields. This is read from the 
docValues sections of the index. Once again the relevant parts of the index may 
or may not be in the OS memory.

So comparing individual queries is “fraught” with uncertainties. I suppose you 
could reboot the machines each time ;) I’ve only ever had luck averaging a 
bunch of unique queries when trying to measure perf differences.

Do note that function queries for pseudo fields is not something I’d expect to 
add much overhead at all. The reason is that they’re only called for the top N 
docs that you’re returning, not part of the search at all. Consider a function 
query involved in scoring. That one must be called for every document that 
matches. But a function query for a pseudo field is only called for the docs 
returned in the packet, i.e. the “rows” parameter.

Best,
Erick

> On Jul 23, 2020, at 11:49 AM, Webster Homer 
>  wrote:
>
> I'm trying to determine the overhead of adding some pseudo fields to one of 
> our standard searches. The pseudo fields are simply function queries to 
> report if certain fields matched the query or not. I had thought that I could 
> run the search without the change and then re-run the searches with the 
> fields added.
> I had assumed that the QTime in the query response would be a good metric to 
> use when comparing the performance of the two search queries. However I see 
> that the QTime for a query can vary by more than 10%. When testing I cleared 
> the query cache between tests. Usually the QTime would be within a few 
> milliseconds of each other, however in some cases there was a 10X or more 
> difference between them.
> Even cached queries vary in their QTime, though much less.
>
> I am running Solr 7.7.2 in a solrcloud configuration with 2 shards and 2 
> replicas/shard. Our nodes have 32Gb memory and 16GB of heap allocated to solr.
>
> I am concerned that these discrepancies indicate that our system is not tuned 
> well enough.
> Should I expect that a query's QTime really is a measure of the query's 
> inherent performance? Is there a better way to measure query performance?
>
>
>
>
>
> This message and any attachment are confidential and may be privileged or 
> otherwise protected from disclosure. If you are not the intended recipient, 
> you must not copy this message or attachment or disclose the contents to any 
> other person. If you have received this transmission in error, please notify 
> the sender immediately and delete the message and any attachment from your 
> system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> accept liability for any omissions or errors in this message which may arise 
> as a result of E-Mail-transmission or for damages resulting from any 
> unauthorized changes of the content of this message and any attachment 
> thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> guarantee that this message is free of viruses and does not accept liability 
> for any damages caused by any virus transmitted therewith.
>
>
>
> Click http://www.merckgroup.com/disclaimer to access the German, French, 
> Spanish and Portuguese versions of this disclaimer.



This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy 

RE: How to measure search performance

2020-07-23 Thread Webster Homer
Hi Erick,

This is an example of a pseudo field: wdim_pno_:if(gt(query({!edismax 
qf=searchmv_pno v=$q}),0),1,0)
I get your point that it would only be applied to the results returned and not 
to all the results. The intent is to be able to identify which of the fields 
matched the search. Our business people are keen to know, for internal reasons.

I have not done a lot of function queries like this, does using edismax make 
this less performant? My tests have a lot of variability but I do see an effect 
on the QTime for adding these, but it is hard to quantify. It could be as much 
as 10%

Thank you for your quick response.
Webster

-Original Message-
From: Erick Erickson 
Sent: Thursday, July 23, 2020 12:52 PM
To: solr-user@lucene.apache.org
Subject: Re: How to measure search performance

This isn’t usually a cause for concern. Clearing the caches doesn’t necessarily 
clear the OS caches for instance. I think you’re already aware that Lucene uses 
MMapDirectory, meaning the index pages are mapped to OS memory space. Whether 
those pages are actually _in_ the OS physical memory or not is anyone’s guess 
so depending on when they’re needed they might have to be read from disk. This 
is entirely independent of Solr’s caches, and could come into play even if you 
restarted Solr.

Then there’s your function queries for the pseudo fields. This is read from the 
docValues sections of the index. Once again the relevant parts of the index may 
or may not be in the OS memory.

So comparing individual queries is “fraught” with uncertainties. I suppose you 
could reboot the machines each time ;) I’ve only ever had luck averaging a 
bunch of unique queries when trying to measure perf differences.

Do note that function queries for pseudo fields is not something I’d expect to 
add much overhead at all. The reason is that they’re only called for the top N 
docs that you’re returning, not part of the search at all. Consider a function 
query involved in scoring. That one must be called for every document that 
matches. But a function query for a pseudo field is only called for the docs 
returned in the packet, i.e. the “rows” parameter.

Best,
Erick

> On Jul 23, 2020, at 11:49 AM, Webster Homer 
>  wrote:
>
> I'm trying to determine the overhead of adding some pseudo fields to one of 
> our standard searches. The pseudo fields are simply function queries to 
> report if certain fields matched the query or not. I had thought that I could 
> run the search without the change and then re-run the searches with the 
> fields added.
> I had assumed that the QTime in the query response would be a good metric to 
> use when comparing the performance of the two search queries. However I see 
> that the QTime for a query can vary by more than 10%. When testing I cleared 
> the query cache between tests. Usually the QTime would be within a few 
> milliseconds of each other, however in some cases there was a 10X or more 
> difference between them.
> Even cached queries vary in their QTime, though much less.
>
> I am running Solr 7.7.2 in a solrcloud configuration with 2 shards and 2 
> replicas/shard. Our nodes have 32Gb memory and 16GB of heap allocated to solr.
>
> I am concerned that these discrepancies indicate that our system is not tuned 
> well enough.
> Should I expect that a query's QTime really is a measure of the query's 
> inherent performance? Is there a better way to measure query performance?
>
>
>
>
>
> This message and any attachment are confidential and may be privileged or 
> otherwise protected from disclosure. If you are not the intended recipient, 
> you must not copy this message or attachment or disclose the contents to any 
> other person. If you have received this transmission in error, please notify 
> the sender immediately and delete the message and any attachment from your 
> system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> accept liability for any omissions or errors in this message which may arise 
> as a result of E-Mail-transmission or for damages resulting from any 
> unauthorized changes of the content of this message and any attachment 
> thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> guarantee that this message is free of viruses and does not accept liability 
> for any damages caused by any virus transmitted therewith.
>
>
>
> Click http://www.merckgroup.com/disclaimer to access the German, French, 
> Spanish and Portuguese versions of this disclaimer.



This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability 

Re: How to measure search performance

2020-07-23 Thread Erick Erickson
This isn’t usually a cause for concern. Clearing the caches doesn’t necessarily 
clear the OS caches for instance. I think you’re already aware that Lucene uses 
MMapDirectory, meaning the index pages are mapped to OS memory space. Whether 
those pages are actually _in_ the OS physical memory or not is anyone’s guess 
so depending on when they’re needed they might have to be read from disk. This 
is entirely independent of Solr’s caches, and could come into play even if you 
restarted Solr.

Then there’s your function queries for the pseudo fields. This is read from the 
docValues sections of the index. Once again the relevant parts of the index may 
or may not be in the OS memory.

So comparing individual queries is “fraught” with uncertainties. I suppose you 
could reboot the machines each time ;) I’ve only ever had luck averaging a 
bunch of unique queries when trying to measure perf differences.

Do note that function queries for pseudo fields is not something I’d expect to 
add much overhead at all. The reason is that they’re only called for the top N 
docs that you’re returning, not part of the search at all. Consider a function 
query involved in scoring. That one must be called for every document that 
matches. But a function query for a pseudo field is only called for the docs 
returned in the packet, i.e. the “rows” parameter.

Best,
Erick

> On Jul 23, 2020, at 11:49 AM, Webster Homer 
>  wrote:
> 
> I'm trying to determine the overhead of adding some pseudo fields to one of 
> our standard searches. The pseudo fields are simply function queries to 
> report if certain fields matched the query or not. I had thought that I could 
> run the search without the change and then re-run the searches with the 
> fields added.
> I had assumed that the QTime in the query response would be a good metric to 
> use when comparing the performance of the two search queries. However I see 
> that the QTime for a query can vary by more than 10%. When testing I cleared 
> the query cache between tests. Usually the QTime would be within a few 
> milliseconds of each other, however in some cases there was a 10X or more 
> difference between them.
> Even cached queries vary in their QTime, though much less.
> 
> I am running Solr 7.7.2 in a solrcloud configuration with 2 shards and 2 
> replicas/shard. Our nodes have 32Gb memory and 16GB of heap allocated to solr.
> 
> I am concerned that these discrepancies indicate that our system is not tuned 
> well enough.
> Should I expect that a query's QTime really is a measure of the query's 
> inherent performance? Is there a better way to measure query performance?
> 
> 
> 
> 
> 
> This message and any attachment are confidential and may be privileged or 
> otherwise protected from disclosure. If you are not the intended recipient, 
> you must not copy this message or attachment or disclose the contents to any 
> other person. If you have received this transmission in error, please notify 
> the sender immediately and delete the message and any attachment from your 
> system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> accept liability for any omissions or errors in this message which may arise 
> as a result of E-Mail-transmission or for damages resulting from any 
> unauthorized changes of the content of this message and any attachment 
> thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> guarantee that this message is free of viruses and does not accept liability 
> for any damages caused by any virus transmitted therewith.
> 
> 
> 
> Click http://www.merckgroup.com/disclaimer to access the German, French, 
> Spanish and Portuguese versions of this disclaimer.



Re: tlog keeps growing

2020-07-23 Thread Walter Underwood
This is a long shot, but look in the overseer queue to see if stuff is stuck. 
We ran into that with 6.x.
We restarted the instance that was the overseer and the newly-elected overseer 
cleared the queue.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 23, 2020, at 10:43 AM, Erick Erickson  wrote:
> 
> Yes, you should have seen a new tlog after:
> - a doc was indexed
> - 15 minutes had passed
> - another doc was indexed
> 
> Well, yes, a leader can be in recovery. It looks like this:
> 
> - You’re indexing and docs are written to the tlog.
> - Solr un-gracefully shuts down so the segments haven’t been closed. Note, 
> these are thrown away on restart.
> - Solr is restarted and starts replaying the tlog.
> 
> But, the node shouldn’t be active during this time.
> 
> Of course it’s possible that for some strange reason, the tlog gets set to 
> the buffering state and never gets back to active, which is what the message 
> you posted seems to be indicating.
> 
> So I’m puzzled, let us know what you find…
> 
> Erick
> 
>> On Jul 23, 2020, at 12:56 PM, Gael Jourdan-Weil 
>>  wrote:
>> 
>>> Note that for my previous e-mail you’d have to wait 15 minutes after you 
>>> started indexing to see a new tlog and also wait until at least 1,000 new 
>>> document after _that_ before the large tlog went away. I don't think that’s 
>>> your issue though.
>> Indeed I did wait 15 minutes but not sure 1000 documents were indexed in the 
>> meantime. Though I should've seen a new tlog even if the large one was still 
>> there, right?
>> 
>>> So I think that’s the place to focus. Did the node recover completely and 
>>> go active? Just checking the admin UI and seeing it be green is sometimes 
>>> not enough. Check the state.json znode and see if the state is also 
>>> “active” there.
>> On ZooKeeper (through the Solr UI or directly connecting to ZK) I can see 
>> "state":"active" in the state.json. This seems fine.
>> To be more weird, this is the leader node. Can a leader be in recovery??
>> 
>>> Next, try sending a request directly to that replica. Frankly I’m not sure 
>>> what to expect, but if you get something weird that’d be a “smoking gun” 
>>> that no matter what the admin UI says, the replica isn’t really active. 
>>> Something like “http://blah blah 
>>> blah/solr/collection1_shard1_replica_n1?q=some_query=false. The 
>>> “distrib=false” is important, otherwise the request will be forwarded to a 
>>> truly active node.
>> The request works fine, I don't see anything weird at that time in the logs.
>> 
>> I will investigate further and take a look at all what you mentionned.
>> 
>> Kind regards,
>> Gaël
> 



Re: IndexSchema is not mutable error Solr Cloud 7.7.1

2020-07-23 Thread Shawn Heisey

On 7/23/2020 8:56 AM, Porritt, Ian wrote:
Note: the solrconfig has class="ClassicIndexSchemaFactory"/> defined.



org.apache.solr.common.SolrException: *This IndexSchema is not mutable*.

     at 
org.apache.solr.update.processor.AddSchemaFieldsUpdateProcessorFactory$AddSchemaFieldsUpdateProcessor.processAdd(AddSchemaFieldsUpdateProcessorFactory.java:376)


Your config contains an update processor chain using the 
AddSchemaFieldsUpdateProcessorFactory.


This config requires a mutable schema, but you have changed to the 
classic schema factory, which is not mutable.


You'll either have to remove the config for the update processor, or 
change back to the mutable schema.  I would recommend the former.


Thanks,
Shawn


Re: tlog keeps growing

2020-07-23 Thread Erick Erickson
Yes, you should have seen a new tlog after:
- a doc was indexed
- 15 minutes had passed
- another doc was indexed

Well, yes, a leader can be in recovery. It looks like this:

- You’re indexing and docs are written to the tlog.
- Solr un-gracefully shuts down so the segments haven’t been closed. Note, 
these are thrown away on restart.
- Solr is restarted and starts replaying the tlog.

But, the node shouldn’t be active during this time.

Of course it’s possible that for some strange reason, the tlog gets set to the 
buffering state and never gets back to active, which is what the message you 
posted seems to be indicating.

So I’m puzzled, let us know what you find…

Erick

> On Jul 23, 2020, at 12:56 PM, Gael Jourdan-Weil 
>  wrote:
> 
>> Note that for my previous e-mail you’d have to wait 15 minutes after you 
>> started indexing to see a new tlog and also wait until at least 1,000 new 
>> document after _that_ before the large tlog went away. I don't think that’s 
>> your issue though.
> Indeed I did wait 15 minutes but not sure 1000 documents were indexed in the 
> meantime. Though I should've seen a new tlog even if the large one was still 
> there, right?
> 
>> So I think that’s the place to focus. Did the node recover completely and go 
>> active? Just checking the admin UI and seeing it be green is sometimes not 
>> enough. Check the state.json znode and see if the state is also “active” 
>> there.
> On ZooKeeper (through the Solr UI or directly connecting to ZK) I can see 
> "state":"active" in the state.json. This seems fine.
> To be more weird, this is the leader node. Can a leader be in recovery??
> 
>> Next, try sending a request directly to that replica. Frankly I’m not sure 
>> what to expect, but if you get something weird that’d be a “smoking gun” 
>> that no matter what the admin UI says, the replica isn’t really active. 
>> Something like “http://blah blah 
>> blah/solr/collection1_shard1_replica_n1?q=some_query=false. The 
>> “distrib=false” is important, otherwise the request will be forwarded to a 
>> truly active node.
> The request works fine, I don't see anything weird at that time in the logs.
> 
> I will investigate further and take a look at all what you mentionned.
> 
> Kind regards,
> Gaël



RE: tlog keeps growing

2020-07-23 Thread Gael Jourdan-Weil
> Note that for my previous e-mail you’d have to wait 15 minutes after you 
> started indexing to see a new tlog and also wait until at least 1,000 new 
> document after _that_ before the large tlog went away. I don't think that’s 
> your issue though.
Indeed I did wait 15 minutes but not sure 1000 documents were indexed in the 
meantime. Though I should've seen a new tlog even if the large one was still 
there, right?

> So I think that’s the place to focus. Did the node recover completely and go 
> active? Just checking the admin UI and seeing it be green is sometimes not 
> enough. Check the state.json znode and see if the state is also “active” 
> there.
On ZooKeeper (through the Solr UI or directly connecting to ZK) I can see 
"state":"active" in the state.json. This seems fine.
To be more weird, this is the leader node. Can a leader be in recovery??

> Next, try sending a request directly to that replica. Frankly I’m not sure 
> what to expect, but if you get something weird that’d be a “smoking gun” that 
> no matter what the admin UI says, the replica isn’t really active. Something 
> like “http://blah blah 
> blah/solr/collection1_shard1_replica_n1?q=some_query=false. The 
> “distrib=false” is important, otherwise the request will be forwarded to a 
> truly active node.
The request works fine, I don't see anything weird at that time in the logs.

I will investigate further and take a look at all what you mentionned.

Kind regards,
Gaël

Re: Solr 7.4 and log4j2 JSONLayout

2020-07-23 Thread t spam
Hi,

I'm having the exact same issue. Were you able to resolve this?

Kind regards,

Tijmen


How to measure search performance

2020-07-23 Thread Webster Homer
I'm trying to determine the overhead of adding some pseudo fields to one of our 
standard searches. The pseudo fields are simply function queries to report if 
certain fields matched the query or not. I had thought that I could run the 
search without the change and then re-run the searches with the fields added.
I had assumed that the QTime in the query response would be a good metric to 
use when comparing the performance of the two search queries. However I see 
that the QTime for a query can vary by more than 10%. When testing I cleared 
the query cache between tests. Usually the QTime would be within a few 
milliseconds of each other, however in some cases there was a 10X or more 
difference between them.
Even cached queries vary in their QTime, though much less.

I am running Solr 7.7.2 in a solrcloud configuration with 2 shards and 2 
replicas/shard. Our nodes have 32Gb memory and 16GB of heap allocated to solr.

I am concerned that these discrepancies indicate that our system is not tuned 
well enough.
Should I expect that a query's QTime really is a measure of the query's 
inherent performance? Is there a better way to measure query performance?





This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


IndexSchema is not mutable error Solr Cloud 7.7.1

2020-07-23 Thread Porritt, Ian
Hi All,

 

I made a change to schema to add new fields in a
collection, this was uploaded to Zookeeper via the
below command:

 

For the Schema

solr zk cp
file:E:\SolrCloud\server\solr\configsets\COLLECTIO
N\conf\schema.xml
zk:/configs/COLLECTION/schema.xml -z
SERVERNAME1.uleaf.site

 

For the Solrconfig

solr zk cp
file:E:\SolrCloud\server\solr\configsets\COLLECTIO
N\conf\solrconfig.xml
zk:/configs/COLLECTION/solrconfig.xml -z
SERVERNAME1.uleaf.site

Note: the solrconfig has  defined.

 

 

When I then go to update a record with the new
field in you get the following error: 

 

org.apache.solr.common.SolrException: This
IndexSchema is not mutable.

at
org.apache.solr.update.processor.AddSchemaFieldsUp
dateProcessorFactory$AddSchemaFieldsUpdateProcesso
r.processAdd(AddSchemaFieldsUpdateProcessorFactory
.java:376)

at
org.apache.solr.update.processor.UpdateRequestProc
essor.processAdd(UpdateRequestProcessor.java:55)

at
org.apache.solr.update.processor.FieldMutatingUpda
teProcessor.processAdd(FieldMutatingUpdateProcesso
r.java:118)

at
org.apache.solr.update.processor.UpdateRequestProc
essor.processAdd(UpdateRequestProcessor.java:55)

at
org.apache.solr.update.processor.FieldMutatingUpda
teProcessor.processAdd(FieldMutatingUpdateProcesso
r.java:118)

at
org.apache.solr.update.processor.UpdateRequestProc
essor.processAdd(UpdateRequestProcessor.java:55)

at
org.apache.solr.update.processor.FieldMutatingUpda
teProcessor.processAdd(FieldMutatingUpdateProcesso
r.java:118)

at
org.apache.solr.update.processor.UpdateRequestProc
essor.processAdd(UpdateRequestProcessor.java:55)

at
org.apache.solr.update.processor.FieldMutatingUpda
teProcessor.processAdd(FieldMutatingUpdateProcesso
r.java:118)

at
org.apache.solr.update.processor.UpdateRequestProc
essor.processAdd(UpdateRequestProcessor.java:55)

at
org.apache.solr.update.processor.FieldNameMutating
UpdateProcessorFactory$1.processAdd(FieldNameMutat
ingUpdateProcessorFactory.java:75)

at
org.apache.solr.update.processor.UpdateRequestProc
essor.processAdd(UpdateRequestProcessor.java:55)

at
org.apache.solr.update.processor.FieldMutatingUpda
teProcessor.processAdd(FieldMutatingUpdateProcesso
r.java:118)

at
org.apache.solr.update.processor.UpdateRequestProc
essor.processAdd(UpdateRequestProcessor.java:55)

at
org.apache.solr.update.processor.AbstractDefaultVa
lueUpdateProcessorFactory$DefaultValueUpdateProces
sor.processAdd(AbstractDefaultValueUpdateProcessor
Factory.java:92)

at
org.apache.solr.handler.loader.JavabinLoader$1.upd
ate(JavabinLoader.java:110)

at
org.apache.solr.client.solrj.request.JavaBinUpdate
RequestCodec$StreamingCodec.readOuterMostDocIterat
or(JavaBinUpdateRequestCodec.java:327)

at
org.apache.solr.client.solrj.request.JavaBinUpdate
RequestCodec$StreamingCodec.readIterator(JavaBinUp
dateRequestCodec.java:280)

at
org.apache.solr.common.util.JavaBinCodec.readObjec
t(JavaBinCodec.java:333)

at
org.apache.solr.common.util.JavaBinCodec.readVal(J
avaBinCodec.java:278)

at
org.apache.solr.client.solrj.request.JavaBinUpdate
RequestCodec$StreamingCodec.readNamedList(JavaBinU
pdateRequestCodec.java:235)

at
org.apache.solr.common.util.JavaBinCodec.readObjec
t(JavaBinCodec.java:298)

at
org.apache.solr.common.util.JavaBinCodec.readVal(J
avaBinCodec.java:278)

at
org.apache.solr.common.util.JavaBinCodec.unmarshal
(JavaBinCodec.java:191)

at
org.apache.solr.client.solrj.request.JavaBinUpdate
RequestCodec.unmarshal(JavaBinUpdateRequestCodec.j
ava:126)

at
org.apache.solr.handler.loader.JavabinLoader.parse
AndLoadDocs(JavabinLoader.java:123)

at
org.apache.solr.handler.loader.JavabinLoader.load(
JavabinLoader.java:70)

at
org.apache.solr.handler.UpdateRequestHandler$1.loa
d(UpdateRequestHandler.java:97)

at
org.apache.solr.handler.ContentStreamHandlerBase.h
andleRequestBody(ContentStreamHandlerBase.java:68)

at
org.apache.solr.handler.RequestHandlerBase.handleR
equest(RequestHandlerBase.java:199)

at
org.apache.solr.core.SolrCore.execute(SolrCore.jav
a:2551)

at
org.apache.solr.servlet.HttpSolrCall.execute(HttpS
olrCall.java:710)

at
org.apache.solr.servlet.HttpSolrCall.call(HttpSolr
Call.java:516)

at
org.apache.solr.servlet.SolrDispatchFilter.doFilte
r(SolrDispatchFilter.java:395)

at
org.apache.solr.servlet.SolrDispatchFilter.doFilte
r(SolrDispatchFilter.java:341)

at
org.eclipse.jetty.servlet.ServletHandler$CachedCha
in.doFilter(ServletHandler.java:1602)

at

Re: tlog keeps growing

2020-07-23 Thread Erick Erickson
Hmmm, now we’re getting somewhere. Here’s the code block in 
DistributedUpdateProcessor

if (ulog == null || ulog.getState() == UpdateLog.State.ACTIVE || 
(cmd.getFlags() & UpdateCommand.REPLAY) != 0) {
  super.processCommit(cmd);
} else {
  if (log.isInfoEnabled()) {
log.info("Ignoring commit while not ACTIVE - state: {} replay: {}"
, ulog.getState(), ((cmd.getFlags() & UpdateCommand.REPLAY) != 0));
  }
}

Why you’re buffering is the mystery.

Note that for my previous e-mail you’d have to wait 15 minutes after you 
started indexing to see a new tlog and also wait until at least 1,000 new 
document after _that_ before the large tlog went away. I don't think that’s 
your issue though.

On a _very_ quick look at the code (and this is not code I’m intimately 
familiar with), the only time the state should be BUFFERING is if the node is 
in recovery. Once recovery is complete, the tlog state should change.

So I think that’s the place to focus. Did the node recover completely and go 
active? Just checking the admin UI and seeing it be green is sometimes not 
enough. Check the state.json znode and see if the state is also “active” there.

Next, try sending a request directly to that replica. Frankly I’m not sure what 
to expect, but if you get something weird that’d be a “smoking gun” that no 
matter what the admin UI says, the replica isn’t really active. Something like 
“http://blah blah 
blah/solr/collection1_shard1_replica_n1?q=some_query=false. The 
“distrib=false” is important, otherwise the request will be forwarded to a 
truly active node.

I’d tail the log on that replica at the same time to gather clues.

Your Solr log should also indicate that the replica went into recovery and, 
eventually, completed. The scenario seems to be that 
- the replica goes into recovery
- the replica either never catches up _or_ it would eventually catch up but is 
processing so much data that it just seems like it’s stuck.

If the replica never catches up, especially if you slow down/stop indexing, 
that’s certainly a bug. In days long ago the tlog replay could be very 
inefficient, but that hasn’t been the case since well before 8.4. Regressions 
are always possible of course.

Since it’s expected that the tlog will grow until recovery is complete, it 
feels like this is somewhat on the right track.

You should see some message at WARN level like 
"Starting log replay…” and "Log replay finished…”

and INFO level messages every 1,000 docs replayed like
"log replay status…"

I’d be grepping my log for anything that mentions “replay” (case-insensitive!) 
If you’re interested in code spelunking, see LogReplayer, run and doReplay in 
UpdateLog.java are where you can find the messages I’d epect to see in the log.

If you want to enable DEBUG level for UpdateLog.java you’ll see info about the 
individual entries from the tlog that are replayed, but I’d only go there if 
the progress every 1,000 docs doesn’t show anything useful.

Good Luck!
Erick

> On Jul 23, 2020, at 10:03 AM, Gael Jourdan-Weil 
>  wrote:
> 
> Ignoring commit while not ACTIVE - state: BUFFERING



SolrCould Atomic Update

2020-07-23 Thread Kayak28
Hello, Solr Community

I am currently using Solr 8.3.0 with SolrCloud mode.
When I took the following steps, I happened to have super-large
index(approx ), and the process got stopped.

1. indexed hundred of thousands of documents
1.5 one of solrcloud servers had around 650GB.
2. updated(indexed) most of the documents by atomic update
2.5 index kept growing +400GB.
3.  process got stopped due to disk full.

Honestly, I did not expect that disk size would grow this much, and I
thought that eventually around 650GB.

I did not explicitly commit during updating, and I have not edited merge
policy(i.e. merge policy is the default)
Also, disk usage growth is linearly increasing.
I could not see the merge policy have been worked...

Here are my questions
- is it possible to limit disk size while atomic-updating?
--- if so, I would like to know how to stop disk usage growth.
- Does anyone know the identical situation?


Any clue will be appreciated.


-- 

Sincerely,
Kaya
github: https://github.com/28kayak


RE: tlog keeps growing

2020-07-23 Thread Gael Jourdan-Weil
Thanks for all the details.
Everytime I go back to this article and everytime I learn something new (or 
should I say I remember something that I had forgotten!).

The scenario you are describing could match our experience except the last step 
"you stop indexing entirely and the tlog never gets rotated".
It's very likely that we stopped indexing for a significant amount of time 
(maybe ~1hour) but after I'm quite sure that there was some indexation going on 
at a lower rate and this didn't trigger anything.

Sorry if I'm not answering some of your questions, I'm trying to reproduce and 
test some stuff before giving some definitive answers and apply your 
suggestions.

I can give you some additionals info though:
- We are working with quite small documents (<5KB)
- On a node that is currently having the issue, I tried to force a commit 
(http://.../update?commit=true) and nothing is happening from the tlog POV: it 
still is very large and constantly growing.

I put the logs in DEBUG through the Solr UI and grepped "commit" at the time of 
the forced commit and here's what we have:
  "method": "processCommit",
  "message": "PRE_UPDATE 
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
 {commit=true=files-update-processor=_text_}",
  "method": "distribCommit",
  "message": "Distrib commit to: [] params: 
update.chain=files-update-processor=TOLEADER_end_point=leaders=http://srv1:8083/solr/col_blue_shard1/;,
  "method": "distribCommit",
  "message": "Distrib commit to: [StdNode: 
http://srv2:8083/solr/col_blue_shard1/] params: 
update.chain=files-update-processor=FROMLEADER_end_point=replicas=http://srv2:8083/solr/col_blue_shard1/;,
  "message": "sending update to http://srv2:8083/solr/col_blue_shard1/ retry:0 
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
 
params:update.chain=files-update-processor=FROMLEADER_end_point=replicas=http://srv1:8083/solr/col_blue_shard1/=true=false=true=false=true;,
  "method": "doLocalCommit",
  "message": "Ignoring commit while not ACTIVE - state: BUFFERING replay: 
false",
  "message": "PRE_UPDATE FINISH 
{commit=true=files-update-processor=_text_}",
  "message": "[col_blue_shard1]  webapp=/shoppingfeedses path=/update 
params={commit=true}{commit=} 0 119",
  "message": "Closing out SolrRequest: 
{commit=true=files-update-processor=_text_}",

"Ignoring commit while not ACTIVE" feels strange but maybe It's doesn't mean 
what I think it means.
Do you see something strange in there?


Kind regards,
Gaël


De : Erick Erickson 
Envoyé : jeudi 23 juillet 2020 13:52
À : solr-user@lucene.apache.org 
Objet : Re: tlog keeps growing 
 
Hmmm, this doesn't account for the tlog growth, but a 15 minute hard
commit is excessive and accounts for your down time on restart if Solr
is forcefully shut down. I’d shorten it to a minute or less. You also
shouldn’t have any replay if you shut down your Solr gracefully.


Here’s lots of background:

https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

It is _vaguely_ possible (and I’m really reaching here) that if you’re indexing 
at a
rapid rate when you start Solr. Here’s the scenario:

- Solr un-gracefully shuts down
- you start Solr and it starts replaying the tlog
- more documents come in and are, guess what, written to the tlog
- the tlog grows as long as this goes on.
- you stop indexing entirely and the tlog never gets rotated.

I consider this unlikely-but-possible. One implication here is that it is _not_
a requirement that a replica be “active” to start accumulating docs in the
tlog, they certainly can grow when a replica is recovering. There should also
be messages about starting tlog replay (although I don’t remember the
exact wording).

When you shut Solr down, do you get any message on the screen about 
“forcefully killing Solr”? If so, the tlog gets replayed on startup. This 
shouldn’t
happen and if it is it’d be good to find out why...

Also note that the contract is that enough closed tlogs are kept around to
satisfy numRecordsToKeep. So in this scenario, your large tlog will remain
until enough _new_ tlogs are created to contain over 1,000 documents.

What none of this really explains is why the tlog goes away on restart.
Does it disappear immediately or does it disappear after a while (i.e.
after you index another 1,000 documents and it can be rolled off)?

BTW, numRecordsToKeep really only is useful for “peer sync”. If a follower
needs to recover and there are enough records in the tlog to catch it up
from when it started to recover, the records are sent from the tlog
rather than the full index replicating. If you’re not getting replicas going
into recovery (except of course on startup), this isn’t doing you much
good and I’d reduce it (the default is 100).

Yeah, the open searcher is the soft commit. Do you see any references to
“commit” or 

Re: tlog keeps growing

2020-07-23 Thread Erick Erickson
Hmmm, this doesn't account for the tlog growth, but a 15 minute hard
commit is excessive and accounts for your down time on restart if Solr
is forcefully shut down. I’d shorten it to a minute or less. You also
shouldn’t have any replay if you shut down your Solr gracefully.


Here’s lots of background:

https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

It is _vaguely_ possible (and I’m really reaching here) that if you’re indexing 
at a
rapid rate when you start Solr. Here’s the scenario:

- Solr un-gracefully shuts down
- you start Solr and it starts replaying the tlog
- more documents come in and are, guess what, written to the tlog
- the tlog grows as long as this goes on.
- you stop indexing entirely and the tlog never gets rotated.

I consider this unlikely-but-possible. One implication here is that it is _not_
a requirement that a replica be “active” to start accumulating docs in the
tlog, they certainly can grow when a replica is recovering. There should also
be messages about starting tlog replay (although I don’t remember the
exact wording).

When you shut Solr down, do you get any message on the screen about 
“forcefully killing Solr”? If so, the tlog gets replayed on startup. This 
shouldn’t
happen and if it is it’d be good to find out why...

Also note that the contract is that enough closed tlogs are kept around to
satisfy numRecordsToKeep. So in this scenario, your large tlog will remain
until enough _new_ tlogs are created to contain over 1,000 documents.

What none of this really explains is why the tlog goes away on restart.
Does it disappear immediately or does it disappear after a while (i.e.
after you index another 1,000 documents and it can be rolled off)?

BTW, numRecordsToKeep really only is useful for “peer sync”. If a follower
needs to recover and there are enough records in the tlog to catch it up
from when it started to recover, the records are sent from the tlog
rather than the full index replicating. If you’re not getting replicas going
into recovery (except of course on startup), this isn’t doing you much
good and I’d reduce it (the default is 100).

Yeah, the open searcher is the soft commit. Do you see any references to
“commit” or “commitTracker”? Some of those will be the autocommit kicking
in, and there should be info in the message about whether a new searcher
is being opened, which will indicat whether it;s a hard or soft commit.

So I’d shorten your hard commit interval significantly. That shouldn’t matter, 
or
more correctly the large tlog should eventually clear itself out but it’ll 
provide
some data.

One thing to check is if, and assuming you’re indexing docs, does a new tlog 
get created 15+ minutes later (and after the replica is active)? If so, I’d 
expect
the large tlog to be deleted 1,000 docs later.

Best,
Erick

> On Jul 22, 2020, at 4:14 PM, Erick Erickson  wrote:
> 
> I’m assuming you do not have CDCR configured, correct?
> 
> This is weird. Every hard commit should close the current tlog, open a new 
> one and delete old ones respecting numRecordsToKeep.
> 
> Are these NRT replicas or TLOG replicas? That shouldn’t make a lot of 
> difference, but might be a clue.
> 
> Your solr log in the one with 20G tlogs should show commits, is there 
> anything that points up?
> 
> It’s also a bit weird that the numbers are so very different. While not 
> lock-step, I’d expect that they were reasonably close. When you restart the 
> server, does Solr roll over the logs for some period or does it just start 
> accumulating the tlog?
> 
> Are both replicas in the “active” state? And is the replica with the large 
> tlogs the follower or the leader?
> 
> Mainly asking a bunch of questions because I haven’t seen this happen, the 
> answers to the above might give a clue where to look next.
> 
> Best,
> Erick
> 
>> On Jul 22, 2020, at 3:39 PM, Gael Jourdan-Weil 
>>  wrote:
>> 
>> Hello,
>> 
>> I'm facing a situation where a transaction log file keeps growing and is 
>> never deleted.
>> 
>> The setup is as follow:
>> - Solr 8.4.1
>> - SolrCloud with 2 nodes
>> - 1 collection, 1 shard
>> 
>> On one of the node I can see the tlog files having the expected behavior, 
>> that is new tlog files being created and old ones being deleted at a 
>> frequency that matches the autocommit settings.
>> For instance, there is currently two files tlog.0003226 and 
>> tlog.0003227, each of them is around 1G (size).
>> 
>> But on the other node, I see two files tlog.298 and 
>> tlog.299, the later being now 20G and has been created 10 
>> hours ago.
>> 
>> It already happened a few times, restarting the server seems to make things 
>> go right but it's obviously not a durable solution.
>> 
>> Do you have any idea what could cause this behavior?
>> 
>> solrconfig.xml:
>> 
>>   
>> ${solr.ulog.dir:}
>> 1000
>> 100
>>   
>>   
>> 90
>> false
>>   
>>   
>>   

Re: Can't search formatted text in solr

2020-07-23 Thread Erick Erickson
There’s a space between “l” and “oad” in your second doc. Or perhaps it has
markup etc. If you do what I mentioned and use the /terms endpoint to examine
what’s actually in your index, I’m pretty sure you’ll see “l” and “oad” so not 
finding it is perfectly understandable.

What this is is that however you turn the doc into your xml format breaks it up
like this. I’ve seen this happen with other markups.

In other words, this has nothing to do with Solr and everything to do with 
whatever
extracts the text from the original document.

If you’re using ExtractingRequestHandler to process this, you’re getting the 
defaults
that Tika uses, which can be tweaked if you run Tika outisde Solr, see the Tika
website.

And you’ll never get this 100%. Every document format does weird things, and 
docs
produced by one version don’t necessarily match another version even in the same
format (say PDF). Extracting the plain text is correctly for every version of 
every
format is near impossible unless you do them one-by-one.

Best,
Erick

> On Jul 23, 2020, at 2:48 AM, Khare, Kushal (MIND) 
>  wrote:
> 
> I did this debug query thing and everything seems good but still am unable to 
> get the desired doc in my result.
> 
> "debug":{  "rawquerystring":"load",
>"querystring":"load",
>   "parsedquery":"_text_:load",
>   "parsedquery_toString":"_text_:load",
> 
> Actually , CASE 2  in  my previous mail is the same text : "Doing load test 
> for Solr" but the diff I forgot to mention was here the text is formatted to 
> BOLD & Text color is RED.
> In case 1, it was simple text.
> What I observed is while parsing, if I print the the textHandler String...I 
> get this
> 
> [Content_Types].xml
> 
> _rels/.rels
> 
> word/document.xml
>  Thi  s   docum  ent is being used for the QDMS l 
>  oad testing  .
> 
> 
> So, I don't know what goes wrong when i have same text but formatted.
> Please help me with this as it is critical and needs to be delivered very 
> soon.
> 
> Thanks !
> 
> From: Erick Erickson 
> Sent: Thursday, July 23, 2020 1:49 AM
> To: solr-user@lucene.apache.org 
> Subject: Re: Can't search formatted text in solr
> 
>  This email originated from an external source i.e. outside of the 
> organization. Please do not click on links or open any attachment unless you 
> recognize the sender and know the content is safe 
> 
> There’s not much info to go on here. Try attaching =query to the 
> queries and see if the parsed query returned is what you expect. If it is, 
> the next thing I’d do is attach 
> =true=id:id_of_doc_that_isnt_showing_up
> 
> This last will show you how scoring was done whether or not the doc is 
> returned in the result set.
> 
> Finally, you can use the admin UI to look at the actual tokens indexed.
> 
> My bet is that your doc format isn’t being analyzed properly, perhaps to do 
> markup and the second case doesn’t get indexed the way you think it should. 
> You can use the terms handler to examine exactly what’s in the index
> 
> Best,
> Erick
> 
>> On Jul 22, 2020, at 12:42 PM, Khare, Kushal (MIND) 
>>  wrote:
>> 
>> Hello guys,
>> I have been using solr for my java application to carry out content search 
>> from the saved docs.
>> I am facing a problem in searching for a word - 'load'
>> There are 2 cases, in 1st search is working good but in second case with the 
>> same doc and same query - 'load' am not getting the result
>> 
>> CASE 1 :
>> 
>> "Doing load test for Solr"  - Simple text in doc format.
>> Works fine
>> 
>> CASE 2 :
>> 
>> "Doing load test for Solr"  - Simple text in doc format.
>> In this case, the solr search fails. I don't get the result when I search 
>> for the term load.
>> 
>> 
>> Please help me with this as am unable to get any help with this
>> 
>> 
>> Thanks !
>> Regards,
>> Kushal Khare
>> 
> 



Re: Can't search formatted text in solr

2020-07-23 Thread Khare, Kushal (MIND)
I did this debug query thing and everything seems good but still am unable to 
get the desired doc in my result.

"debug":{  "rawquerystring":"load",
"querystring":"load",
   "parsedquery":"_text_:load",
   "parsedquery_toString":"_text_:load",

Actually , CASE 2  in  my previous mail is the same text : "Doing load test for 
Solr" but the diff I forgot to mention was here the text is formatted to BOLD & 
Text color is RED.
In case 1, it was simple text.
What I observed is while parsing, if I print the the textHandler String...I get 
this

[Content_Types].xml

_rels/.rels

word/document.xml
  Thi  s   docum  ent is being used for the QDMS l  
oad testing  .


So, I don't know what goes wrong when i have same text but formatted.
Please help me with this as it is critical and needs to be delivered very soon.

Thanks !

From: Erick Erickson 
Sent: Thursday, July 23, 2020 1:49 AM
To: solr-user@lucene.apache.org 
Subject: Re: Can't search formatted text in solr

 This email originated from an external source i.e. outside of the 
organization. Please do not click on links or open any attachment unless you 
recognize the sender and know the content is safe 

There’s not much info to go on here. Try attaching =query to the queries 
and see if the parsed query returned is what you expect. If it is, the next 
thing I’d do is attach 
=true=id:id_of_doc_that_isnt_showing_up

This last will show you how scoring was done whether or not the doc is returned 
in the result set.

Finally, you can use the admin UI to look at the actual tokens indexed.

My bet is that your doc format isn’t being analyzed properly, perhaps to do 
markup and the second case doesn’t get indexed the way you think it should. You 
can use the terms handler to examine exactly what’s in the index

Best,
Erick

> On Jul 22, 2020, at 12:42 PM, Khare, Kushal (MIND) 
>  wrote:
>
> Hello guys,
> I have been using solr for my java application to carry out content search 
> from the saved docs.
> I am facing a problem in searching for a word - 'load'
> There are 2 cases, in 1st search is working good but in second case with the 
> same doc and same query - 'load' am not getting the result
>
> CASE 1 :
>
> "Doing load test for Solr"  - Simple text in doc format.
> Works fine
>
> CASE 2 :
>
> "Doing load test for Solr"  - Simple text in doc format.
> In this case, the solr search fails. I don't get the result when I search for 
> the term load.
>
>
> Please help me with this as am unable to get any help with this
>
>
> Thanks !
> Regards,
> Kushal Khare
>



Re: Question on sorting

2020-07-23 Thread Saurabh Sharma
Hi,
It is because field is string and numbers are getting sorted
lexicographically.It has nothing to do with number of digits.

Thanks
Saurabh


On Thu, Jul 23, 2020, 11:24 AM Srinivas Kashyap
 wrote:

> Hello,
>
> I have schema and field definition as shown below:
>
>  omitNorms="true"/>
>
>
>   />
>
> TRACK_ID field contains "NUMERIC VALUE".
>
> When I use sort on track_id (TRACK_ID desc) it is not working properly.
>
> ->I have below values in Track_ID
>
> Doc1: "84806"
> Doc2: "124561"
>
> Ideally, when I use sort command, query result should be
>
> Doc2: "124561"
> Doc1: "84806"
>
> But I'm getting:
>
> Doc1: "84806"
> Doc2: "124561"
>
> Is this because, field type is string and doc1 has 5 digits and doc2 has 6
> digits?
>
> Please provide solution for this.
>
> Thanks,
> Srinivas
>
>
> 
> DISCLAIMER:
> E-mails and attachments from Bamboo Rose, LLC are confidential.
> If you are not the intended recipient, please notify the sender
> immediately by replying to the e-mail, and then delete it without making
> copies or using it in any way.
> No representation is made that this email or any attachments are free of
> viruses. Virus scanning is recommended and is the responsibility of the
> recipient.
>
> Disclaimer
>
> The information contained in this communication from the sender is
> confidential. It is intended solely for use by the recipient and others
> authorized to receive it. If you are not the recipient, you are hereby
> notified that any disclosure, copying, distribution or taking action in
> relation of the contents of this information is strictly prohibited and may
> be unlawful.
>
> This email has been scanned for viruses and malware, and may have been
> automatically archived by Mimecast Ltd, an innovator in Software as a
> Service (SaaS) for business. Providing a safer and more useful place for
> your human generated data. Specializing in; Security, archiving and
> compliance. To find out more visit the Mimecast website.
>