Re: Caffeine Cache Metrics Broken?

2021-03-02 Thread Shawn Heisey

On 3/2/2021 3:47 PM, Stephen Lewis Bianamara wrote:

I'm investigating a weird behavior I've observed in the admin page for
caffeine cache metrics. It looks to me like on the older caches, warm-up
queries were not counted toward hit/miss ratios, which of course makes
sense, but on Caffeine cache it looks like they are. I'm using solr 8.3.

Obviously this makes measuring its true impact a little tough. Is this by
any chance a known issue and already fixed in later versions?


The earlier cache implementations are entirely native to Solr -- all the 
source code is include in the Solr codebase.


Caffeine is a third-party cache implementation that has been integrated 
into Solr.  Some of the metrics might come directly from Caffeine, not 
Solr code.


I would expect warming queries to be counted on any of the cache 
implementations.  One of the reasons that the warming capability exists 
is to pre-populate the caches before actual queries begin.  If warming 
queries are somehow excluded, then the cache metrics would not be correct.


I looked into the code and did not find anything that would keep warming 
queries from affecting stats.  But it is always possible that I just 
didn't know what to look for.


In the master branch (Solr 9.0), CaffeineCache is currently the only 
implementation available.


Thanks,
Shawn


Re: Zookeeper 3.4.5 with Solr 8.8.0

2021-03-01 Thread Shawn Heisey

On 3/1/2021 9:45 PM, Subhajit Das wrote:

That is not possible at this time.

Will it be ok, if remote the zookeeper dependencies (jars) from solr and 
replace it with 3.5.5 jars?
Thanks in advance.


Maybe.  But I cannot say for sure.

I know that when we upgraded to ZK 3.5, some fairly significant code 
changes in Solr were required.  I did not see whether more changes were 
needed when we upgraded again.


It would not surprise me to learn that a jar swap won't work.  Upgrades 
are far more likely to work than downgrades.


Thanks,
Shawn


Re: Zookeeper 3.4.5 with Solr 8.8.0

2021-03-01 Thread Shawn Heisey

On 3/1/2021 6:51 AM, Subhajit Das wrote:

I noticed, that Solr 8.8.0 uses Zookeeper 3.6.2 client, while Solr 6.3.0 uses 
Zookeeper 3.4.6 client. Is this a client bug or mismatch issue?
If so, how to fix this?


The ZK project guarantees that each minor version (X.Y.Z, where Y is the 
same) will work with the previous minor version or the next minor version.


3.4 and 3.6 are two minor versions apart, and thus compatibility cannot 
be guaranteed.


See the "backward compatibility" matrix here:

https://cwiki.apache.org/confluence/display/ZOOKEEPER/ReleaseManagement

I think you'll need to upgrade your ZK server ensemble to fix it.

Thanks,
Shawn


Re: Dynamic starting or stoping of zookeepers in a cluster

2021-02-24 Thread Shawn Heisey

On 2/24/2021 9:04 AM, DAVID MARTIN NIETO wrote:

If I'm not mistaken the number of zookeepers must be odd. Having 3 zoos on 3 
different machines, if we temporarily lost one of the three machines, we would 
have only two running and it would be an even number.Would it be advisable in 
this case to raise a third party in one of the 2 active machines or with only 
two zookeepers there would be no blockages in their internal votes?


It does not HAVE to be an odd number.  But increasing the total by one 
doesn't add any additional fault tolerance, and exposes an additional 
point of failure.


If you have 3 servers, 2 of them have to be running to maintain quorum. 
 If you have 4 servers, 3 of them have to be running for the cluster to 
be fully operational.


So a 3-server cluster and a 4-server cluster can survive the failure of 
one machine.  This holds true for larger numbers as well -- with 5 
servers or with 6 servers, you can lose two and stay fully operational. 
 Having that extra server that makes the total even is just wasteful.


Thanks,
Shawn


Re: Caffeine Cache and Filter Cache in 8.3

2021-02-22 Thread Shawn Heisey

On 2/22/2021 1:50 PM, Stephen Lewis Bianamara wrote:




(a) At what version did the caffeine cache reach production stability?
(b) Is the caffeine cache, and really all implementations, able to be used
on any cache, or are the restrictions about which cache implementations may
be used for which cache? If the latter, can you provide some guidance?


The caffiene-based cache was introduced in Solr 8.3.  It was considered 
viable for production from the time it was introduced.


https://issues.apache.org/jira/browse/SOLR-8241

Something was found and fixed in 8.5.  I do not know what the impact of 
that issue was:


https://issues.apache.org/jira/browse/SOLR-14239

The other cache implementations were deprecated at some point.  Those 
implementations have been removed from the master branch, but still 
exist in the code for 8.x versions.


If you want to use one of the older implementations like FastLRUCache, 
you still can, and will be able to for all future 8.x versions.  When 
9.0 is released at some future date, that will no longer be possible.


The Caffeine-based implementation is probably the best option, but I do 
not have any concrete data to give you.


Thanks,
Shawn


Re: defragmentation can improve performance on SATA class 10 disk ~10000 rpm ?

2021-02-22 Thread Shawn Heisey

On 2/22/2021 12:52 AM, Danilo Tomasoni wrote:

we are running a solr instance with around 41 MLN documents on a SATA class 10 
disk with around 10.000 rpm.
We are experiencing very slow query responses (in the order of hours..) with an 
average of 205 segments.
We made a test with a normal pc and an SSD disk, and there the same solr 
instance with the same data and the same number of segments was around 45 times 
faster.
Force optimize was also tried to improve the performances, but it was very 
slow, so we abandoned it.

Since we still don't have enterprise server ssd disks, we are now wondering if 
in the meanwhile defragmenting the solrdata folder can help.
The idea is that due to many updates, each segment file is fragmented across 
different phisical blocks.
Put in another way, each segment file is non-contiguous on disk, and this can 
slow-down the solr response.


The absolute best thing you can do to improve Solr performance is add 
memory.


The OS automatically uses unallocated memory to cache data on the disk. 
 Because memory is far faster than any disk, even SSD, it performs better.


I wrote a wiki page about it:

https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems

If you have sufficient memory, the speed of your disks will have little 
effect on performance.  It's only in cases where there is not enough 
memory that disk performance will matter.


Thanks,
Shawn



Re: HTML sample.html not indexing in Solr 8.8

2021-02-21 Thread Shawn Heisey

On 2/21/2021 3:07 PM, cratervoid wrote:

Thanks Shawn, I copied the solrconfig.xml file from the gettingstarted
example on 7.7.3 installation to the 8.8.0 installation, restarted the
server and it now works. Comparing the two files it looks like as you said
this section was left out of the _default/solrconfig.xml file in version
8.8.0:


 
   true
   ignored_
   _text_
 
   

So those trying out the tutorial will need to add this section to get it to
work for sample.html.



This line from that config also is involved:

  regex=".*\.jar" />


That loads the contrib jars needed for the ExtractingRequestHandler to 
work right.  There are a LOT of jars there.  Tika is a very heavyweight 
piece of software.


Thanks,
Shawn


Re: HTML sample.html not indexing in Solr 8.8

2021-02-20 Thread Shawn Heisey

On 2/20/2021 3:58 PM, cratervoid wrote:

SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url:
http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html


The problem here is that the solrconfig.xml in use by the index named 
"gettingstarted" does not define a handler at /update/extract.


Typically a handler defined at that URL path will utilize the extracting 
request handler class.  This handler uses Tika (another Apache project) 
to extract usable data from rich text formats like PDF, HTML, etc.


  
  

  true
  ignored_
  _text_

  

Note that using this handler will require adding some contrib jars to Solr.

Tika can become very unstable because it deals with undocumented file 
formats, so we do not recommend using that handler in production.  If 
the functionality is important, Tika should be included in a program 
that's separate from Solr, so that if it crashes, it does not take Solr 
down with it.


Thanks,
Shawn


Re: Dynamic starting or stoping of zookeepers in a cluster

2021-02-18 Thread Shawn Heisey

On 2/18/2021 8:20 AM, DAVID MARTIN NIETO wrote:

We've a solr cluster with 4 solr servers and 5 zookeepers in HA mode.
We've tested about if our cluster can mantain the service with only the half of 
the cluster, in case of disaster os similar, and we've a problem with the 
zookepers config and its static configuration.

In the start script of the 4 solrs servers there are a list of 5 ip:port of the 5 
zookeepers of the cluster, so when we "lost" the half of machines (we've 2 zoos 
in one machine and 3 on another) in the worst case we lost 3 of these 5 zookeepers. We 
can start a sixth zookeeper (to have 3 with the half of cluster stopped) but to add in 
the solr server we need to stop and restart with a new list of ip:port adding it and 
that's not an automatic or dynamic thing.


In order to have a highly available zookeeper, you must have at least 
three separate physical servers for ZK.  Running multiple zookeepers on 
one physical machine gains you nothing ... because if the whole machine 
fails, you lose all of those zookeepers.  If you have three physical 
servers, one can fail with no problems.  If you have five separate 
physical servers running ZK, then two of the machines can fail without 
taking the cluster down.



¿Somebody knows another configuration or workaround to have a dynamic list of 
zoos and start or stop some of thems without changes in the config and 
start/stop the solr server?


The Zookeeper client was upgraded to 3.5 in Solr 8.2.0.

https://issues.apache.org/jira/browse/SOLR-8346

If you're running at least Solr 8.2.0, and your ZK servers are at least 
version 3.5, then ZK should support dynamic cluster reconfiguration. 
The ZK status page in the admin UI may have some problems after ZK 
undergoes a dynamic reconfiguration, but SolrCloud's core functionality 
should work fine.


Thanks,
Shawn


Re: Cannot find Solr 7.4.1 release

2021-02-18 Thread Shawn Heisey

On 2/18/2021 1:05 AM, Olivier Tavard wrote:

I wanted to download Solr 7.4.1, but I cannot find the 7.4.1 release into
http://archive.apache.org/dist/lucene/solr/ : there are Solr 7.4 and after
directly 7.5.
Of course I can build from source code, but this is frustrating because I
can see that in the 7_4_branch there is a fix that I need (SOLR-12594) with
the status fixed into 7.4.1 and 7.5 versions. Everythings seems to have
been prepared to release the 7.4.1, but I cannot find it.
Does this release exist ?


That release does not exist.  There was never any discussion about it on 
the dev list.


7.4.1 was added to Jira for tracking purposes, and the code change for 
that issue was saved to branch_7_4 just in case somebody felt a 7.4.1 
release was required.  That issue deals with a problem in metrics, which 
is outside of basic Solr functionality -- not critical enough to warrant 
a point release.


The release process for 7.5.0 was underway about a month after that 
issue was committed.


If 7.5.0 (or one of the many later releases) will not work for your 
needs, then you will need to compile branch_7_4 yourself.  I have used 
custom-compiled versions before in production because we needed a bugfix 
that was not deemed severe enough for a new point release.


You can create binary packages similar to what is available for download 
by running "ant package" in the solr directory of your code checkout.  I 
think that build target only works on *NIX systems -- Windows is missing 
some of the required pieces.


Thanks,
Shawn


Re: Solr 8.0 query length limit

2021-02-18 Thread Shawn Heisey

On 2/18/2021 3:38 AM, Anuj Bhargava wrote:

Solr 8.0 query length limit

We are having an issue where queries are too big, we get no result. And if
we remove a few keywords we get the result.


The best option is to convert the request to POST, as Thomas suggested. 
 With that, the query parameters could be up to 2 megabytes in size 
with no config changes.


The limit for this is enforced by Jetty -- the servlet container that 
Solr ships with.  If you cannot switch your requests to POST, then you 
can find the following line in server/etc/jetty.xml, adjust it, and 
restart Solr:


name="solr.jetty.request.header.size" default="8192" />


A header limit of 8KB is found in nearly all web servers and related 
software, like load balancers.


Thanks,
Shawn


Re: Is 8.8.x going be stabilized and finalized?

2021-02-16 Thread Shawn Heisey

On 2/16/2021 7:57 PM, Subhajit Das wrote:

I am planning to use 8.8 line-up for production use.

But recently, a lot of people are complaining on 8.7 and 8.8. Also, there is a 
clearly known issue on 8.8 as well.

Following trends of earlier versions (5.x, 6.x and 7.x), will 8.8 will also be 
finalized?
For 5.x, 5.5.x was last version. For 6.x, 6.6.x was last version. For 7.x, 
7.7.x was last version. It would match the pattern, it seems.
And 9.x is already planned and under development.
And it seems, we require some stability.


All released versions are considered stable.  Sometimes problems are 
uncovered after release.  Sometimes BIG problems.  We try our very best 
to avoid bugs, but achieving that kind of perfection is nearly 
impossible for any software project.


8.8.0 is the most current release.  The 8.8.1 release is underway, but 
there's no way I can give you a concrete date.  The announcement MIGHT 
come in the next few days, but it's always possible it could get pushed 
back.  At this time, the changelog for 8.8.1 has five bugfixes 
mentioned.  It should be more stable than 8.8.0, but it's impossible for 
me to tell you whether you will have any problems with it.


On the dev list, the project is discussing the start of work on the 9.0 
release, but that work has not yet begun.  Even if it started tomorrow, 
it would be several weeks, maybe even a few months, before 9.0 is 
actually released.  On top of the "normal" headaches involved in any new 
major version release, there are some other things going on that might 
further delay 9.0 and future 8.x versions:


* Solr is being promoted from a subproject of Lucene to it's own 
top-level project at Apache.  This involves a LOT of work.  Much of that 
work is administrative in nature, which is going to occupy us and take 
away from time that we might spend working on the code and new releases.
* The build system for the master branch, which is currently versioned 
as 9.0.0-SNAPSHOT, was recently switched from Ant+Ivy to Gradle.  It's 
going to take some time to figure out all the fallout from that migration.
* Some of the devs have been involved in an effort to greatly simplify 
and rewrite how SolrCloud does internal management of a cluster.  The 
intent is much better stability and better performance.  You might have 
seen public messages referring to a "reference implementation."  At this 
time, it is unclear how much of that work will make it into 9.0 and how 
much will be revealed in later releases.  We would like very much to 
include at least the first phase in 9.0 if we can.


From what I have seen over the last several years as one of the 
developers on this project, it is likely that 8.9 and possibly even 8.10 
and 8.11 will be released before we see 9.0.  Releases are NOT made on a 
specific schedule, so I cannot tell you which versions you will see or 
when they might happen.


I am fully aware that despite typing quite a lot of text here, that I 
provided almost nothing in the way of concrete information that you can 
use.  Sorry about that.


Thanks,
Shawn


Re: SolrJ: SolrInputDocument.addField()

2021-02-16 Thread Shawn Heisey

On 2/15/2021 10:17 AM, Steven White wrote:

Yes, I have managed schema enabled like so:

   
 true
 cp-schema.xml
   

The reason why I enabled it is so that I can dynamically customize the
schema based on what's in the DB.  So that I can add fields to the schema
dynamically.


A managed/mutable schema is a configuration detail that's separate from 
(and required by) the update processor that guesses unknown fields.  It 
has been the default schema factory used in out-of-the box 
configurations for quite a while.



I guess a better question, to meet my need, is this: how do I tell Solr, in
schema-less mode, to use *my* defined field-type whenever it needs to
create a new field?


The config for that is described here:

https://lucene.apache.org/solr/guide/8_6/schemaless-mode.html#enable-field-class-guessing

It is a bad idea to rely on field guessing for a production index.  Even 
the most carefully designed configuration cannot get it right every 
time.  You're very likely to run into situations where the software's 
best guess turns out to be wrong for your needs.  And then you're forced 
into what you should have done in the first place -- manually fixing the 
definition for that field, which usually also requires reindexing from 
scratch.


One counter-argument to what I stated in the last paragraph that 
frequently comes up is "my data is very well curated and consistent." 
But if that is the case, then you will know what fields and types are 
required *in advance* and you can easily construct a schema yourself 
before sending any data for indexing -- no guessing required.


Thanks,
Shawn


Re: Meaning of "Index" flag under properties and schema

2021-02-16 Thread Shawn Heisey

On 2/16/2021 9:16 AM, ufuk yılmaz wrote:

I didn’t realise that, sorry. The table is like:

Flags   Indexed Tokenized   Stored  UnInvertible

Properties  YesYesYes Yes
Schema  YesYesYes Yes
Index   YesYesYes NO

Problematic collection has a Index row under Schema row. No other collection 
has it. I was asking about what the “Index” meant


I am not completely sure, but I think that row means the field was found 
in the actual Lucene index.


In the original message you mentioned "weird exceptions" but didn't 
include any information about them.  Can you give us those exceptions, 
and the requests that caused them?


Thanks,
Shawn


Re: SolrJ: SolrInputDocument.addField()

2021-02-15 Thread Shawn Heisey

On 2/15/2021 6:52 AM, Steven White wrote:

It looks to me that SolrInputDocument.addField() is either missnamed or
isn't well implemented.

When it is called on a field that doesn't exist in the schema, it will
create that field and give it a type based on the data.  Not only that, it
will set default values.  For example, this call

 SolrInputDocument doc = new SolrInputDocument();
 doc.addField("Company", "ACM company");

Will create the following:

 
 


That SolrJ code does not make those changes to your schema.  At least 
not in the way you're thinking.


It sounds to me like your solrconfig.xml includes what we call 
"schemaless mode" -- an update processor that adds unknown fields when 
they are indexed.  You should disable it.  We strongly recommend never 
using it in production, because it can make the wrong guess about which 
fieldType is required.  The fieldType chosen has very little to do with 
the SolrJ code.  It is controlled by what's in solrconfig.xml.


Thanks,
Shawn


Re: SolrJ: SolrInputDocument.addField()

2021-02-14 Thread Shawn Heisey

On 2/14/2021 9:00 AM, Steven White wrote:

It looks like I'm misusing SolrJ API  SolrInputDocument.addField() thus I
need clarification.

Here is an example of what I have in my code:

 SolrInputDocument doc = new SolrInputDocument();
 doc.addField("MyFieldOne", "some data");
 doc.addField("MyFieldTwo", 100);

The above code is creating 2 fields for me (if they don't exist already)
and then indexing the data to those fields.  The data is "some data" and
the number 100  However, when the field is created, it is not using the
field type that I custom created in my schema.  My question is, how do I
tell addField() to use my custom field type?


There is no way in SolrJ code to control which fieldType is used.  That 
is controlled solely by the server-side schema definition.


How do you know that Solr is not using the correct fieldType?  If you 
are looking at the documents returned by a search and aren't seeing the 
transformations described in the schema, you're looking in the wrong place.


Solr search results always returns what was originally sent in for 
indexing.  Only Update Processors (defined in solrconfig.xml, not the 
schema) can affect what gets returned in results, fieldType definitions 
NEVER affect data returned in search results.


Thanks,
Shawn


Re: CVE-2019-17558 on SOLR 6.1

2021-02-12 Thread Shawn Heisey

On 2/12/2021 11:17 AM, Rick Tham wrote:

I am trying to figure out if the following is an additioanal valid
mitigation step for CVE-2019-17558 on SOLR 6.1. None of our solrconfig.xml
contains the lib references to the velocity jar files as follows:


l

It doesn't appear that you can add these jars references using the config
API. Without these references, you are not able to flip the
params.resource.loader.enabled to true using the config API. If you are not
able to flip the flag and none of your cores have these lib references then
is the risk present?


In order to be vulnerable to that problem, all of the following things 
must be true.  If any of them is NOT true, then this vulnerability does 
not apply:


* The velocity jars must be loaded.  A common way for this is the  
configuration you mentioned, but there are other ways.  Those other ways 
require human intervention to move the actual files.

* Your config must *use* the jars, by containing a velocity config.
* The params resource loader must be enabled in the velocity config. 
Note that the "velocity.params.resource.loader.enabled" flag only 
applies if the velocity config in solrconfig.xml *references* that flag.
* Your Solr server must be reachable to unauthorized parties who would 
exploit the vulnerability.


I have no idea whether any of this config can be changed remotely.  I 
have never used the config API.  But if your Solr server is not 
reachable to bad guys, it won't matter.


Simply controlling who can reach the Solr server is the easiest way to 
avoid being vulnerable to anything.  Although there are security 
mechanisms available, Solr is not designed to be publicly reachable.  It 
should be heavily firewalled.


The velocity response writer usually requires end users to have direct 
access to the Solr server for it to be worth something.  We STRONGLY 
discourage leaving Solr exposed.


Thanks,
Shawn


Re: Extremely Small Segments

2021-02-12 Thread Shawn Heisey

On 2/12/2021 4:30 AM, yasoobhaider wrote:

Note: Nothing out of the ordinary in logs. Only /update request logs.


Can you share your logs?  The best option would be to include everything 
in the logs directory.  Hopefully you have not altered the default 
logging config, which sets the detail to INFO.


Can you also include everything in that's in the ZK configuration path?

If you need to remove sensitive information, please do so in a 
consistent way, and replace it with something else rather than just 
deleting it.


Note that this mailing list has a tendency to eat attachments.  So 
you're going to need to use a file-sharing site and give us one or more 
URLs.  Dropbox is a good choice, but not the only one.


Thanks,
Shawn


Re: NRT - Indexing

2021-02-01 Thread Shawn Heisey

On 2/1/2021 12:08 AM, haris.k...@vnc.biz wrote:
Hope you're doing good. I am trying to configure NRT - Indexing in my 
project. For this reason, I have configured *autoSoftCommit* to execute 
every second and *autoCommit* to execute every 5 minutes. Everything 
works as expected on the dev and test server. But on the production 
server, there are more than 6 million documents indexed in Solr, so 
whenever a new document is indexed it takes 2-3 minutes before appearing 
in the search despite the setting I have described above. Since the 
target is to develop a real-time system, this delay of 2-3 minutes is 
not acceptable. How can I reduce this time window?


Setting autoSoftCommit with a max time of 1000 (one second) does not 
mean you will see changes within one second.  It means that one second 
after indexing begins, Solr will start a soft commit operation.  That 
commit operation must fully complete and the new searcher must come 
online before changes are visible.  Those steps may take much longer 
than one second, which seems to be happening on your system.


With the information available, I cannot tell you why your commits are 
taking so long.  One of the most common reasons for poor Solr 
performance is a lack of free memory on the system for caching purposes.


Thanks,
Shawn


Re: Solr 8.7.0 memory leak?

2021-01-28 Thread Shawn Heisey

On 1/27/2021 9:00 PM, Luke wrote:

it's killed by OOME exception. The problem is that I just created empty
collections and the Solr JVM keeps growing and never goes down. there is no
data at all. at the beginning, I set Xxm=6G, then 10G, now 15G, Solr 8.7
always use all of them and it will be killed by oom.sh once jvm usage
reachs 100%.


We are stuck until we know what resource is running out and causing the 
OOME.  To know that we will need to see the actual exception.


Thanks,
Shawn


Re: Solr 8.7.0 memory leak?

2021-01-27 Thread Shawn Heisey

On 1/27/2021 5:08 PM, Luke Oak wrote:

I just created a few collections and no data, memory keeps growing but never go 
down, until I got OOM and solr is killed

Any reason?


Was Solr killed by the operating system's oom killer or did the death 
start with a Java OutOfMemoryError exception?


If it was the OS, then the entire system doesn't have enough memory for 
the demands that are made on it.  The problem might be Solr, or it might 
be something else.  You will need to either reduce the amount of memory 
used or increase the memory in the system.


If it was a Java OOME exception that led to Solr being killed, then some 
resource (could be heap memory, but isn't always) will be too small and 
will need to be increased.  To figure out what resource, you need to see 
the exception text.  Such exceptions are not always recorded -- it may 
occur in a section of code that has no logging.


Thanks,
Shawn


Re: Cannot start solr because oom

2021-01-23 Thread Shawn Heisey

On 1/23/2021 6:41 PM, Luke wrote:

I don't see any log in solr.log, but there is OutOfMemory error in
solr-8983-console.log file.


Do you have the entire text of that exception?  Can you share it?  That 
is the real information that I am after here.


I only asked how Solr was installed and started so I would be able to 
help you figure out where the log files are, if that became necessary. 
It seems that you know where they are.


Thanks,
Shawn


Re: Cannot start solr because oom

2021-01-23 Thread Shawn Heisey

On 1/23/2021 6:29 AM, Luke Oak wrote:

I use default settings to start solr , I set heap to 6G, I created 10 
collections with 1node and 1 replica, however, there is not much data at all, 
just 100 documents.

My server is 32 G memory and 4 core cpu, ssd drive 300g

It was ok when i created 5 collections. It got oom killed when 10 collections 
are created. Please, no data in new collections.


What version of Solr?  How is it installed and started?  What OS?  What 
Java version?


Do you have the actual OutOfMemoryError text?  If I remember correctly 
from my own reading, there are eight possible causes for OOME, and not 
all of them are related to memory.  The actual exception, which will be 
recorded in the main Solr logfile if it is even recorded (sometimes it's 
not), will contain the reason for the error.


A 6GB heap is definitely enough for a handful of empty cores.  So my 
best guess is that another resource, possibly thread count or open 
files, is running out.



Also I found that solr doesn’t do garbage collection when the 6G is used ( from 
dashboard, jvm usage is reached 6 g)


Sorry to be pedantic, but Solr doesn't EVER do Garbage Collection.  Java 
does.  And it is completely normal for the entire Java heap to be 
consumed on occasion, no matter what's happening.  Solr does not expose 
any way to force a GC.


Thanks,
Shawn


Re: Queries Regarding Cold searcher

2021-01-22 Thread Shawn Heisey

On 1/21/2021 3:42 AM, Parshant Kumar wrote:

Do value(true or false) of cold searcher play any role during the
completion of replication on slave server.If not please tell in which
process in solr its applied?


The setting to use a cold searcher applies whenever a new searcher is 
opened.  It determines what happens while the new searcher is warming. 
If it's false, queries will be answered by the old searcher until all of 
the warming work is complete on the new searcher, at which time Solr 
will switch to the new one and work on dismantling the old one.  If it's 
true, then the new searcher will be used immediately, before warming is 
finished.


In order for Solr to do queries on an index that has changed for any 
reason, including replication, a new searcher is required.  If Solr 
doesn't open a new searcher, it will still be querying the index that 
existed before the change.


Thanks,
Shawn


Re: Effects of shards and replicas on performance

2021-01-19 Thread Shawn Heisey

On 1/19/2021 4:19 PM, ufuk yılmaz wrote:

Lets say I had only 1 replica for each collection but I split it to 6 shards, 1 
for every node.
Or I had 2 shards (1 shard is too big for a single node I think) but I had 3 
replicas, 3x2=6, 1 on every node.

How would it affect the performance?


It all depends on how many queries you're expecting to occur at the same 
time -- your query rate.


More replicas will generally make your system capable of handling a 
higher query load than fewer replicas, as long as the replicas are 
running on different physical hardware.


With a low query load, more shards CAN make things faster because it 
throws more system capacity at the problem -- assuming the different 
shards are on different physical hardware.  But as the number of queries 
increases, the systems get busier, and that advantage disappears.


Don't assign your heap size as a ratio of total memory size.  Your heap 
should be as big as it needs to be, and no bigger, leaving as much 
memory as possible for disk caching.  I can't say for sure, but with 20 
indexes the size you're talking about, 50 GB of memory per node is 
probably nowhere near enough.


Thanks,
Shawn


Re: Solrcloud - Reads on specific nodes

2021-01-18 Thread Shawn Heisey

On 1/17/2021 11:12 PM, Doss wrote:

Thanks Michael Gibney , Shawn Heisey for pointing in the right direction.

1. Will there be any performance degrade if we use shards.preference?
2. How about leader election if we decided to use NRT + PULL ? TLOG has the
advantage of participating in leader election correct?
3. NRT + TLOG is there any parameter which can reduce the TLOG replication
time


I have no idea what kind of performance degradation you might expect 
from using shards.preference.  I wouldn't expect any, but I do not know 
enough details about your environment to comment.


A TLOG replica that is elected leader functions exactly like NRT.  TLOG 
replicas that are not leaders replicate the transaction log, which makes 
them capable of becoming leader.


PULL and TLOG non-leaders do not index.  They use the old replication 
feature, copying exact segment data from the leader.


If you want SolrCloud to emulate the old master/slave paradigm, my 
recommendation would be to create two TLOG replicas per shard and make 
the rest PULL.  Then use shards.preference on queries to prefer PULL 
replicas.  The PULL replicas can never become leader, so you can be sure 
that they will never do any indexing.


Thanks,
Shawn


Re: Solrcloud - Reads on specific nodes

2021-01-15 Thread Shawn Heisey

On 1/15/2021 7:56 AM, Doss wrote:

1. Suppose we have 10 node SOLR Cloud setup, is it possible to dedicate 4
nodes for writes and 6 nodes for selects?

2. We have a SOLR cloud setup for our customer facing applications, and we
would like to have two more SOLR nodes for some backend jobs. Is it good
idea to form these nodes as slave nodes and making one node in the cloud as
Master?


SolrCloud does not have masters or slaves.

One thing you could do is set the replica types on four of those nodes 
to one type, and on the other nodes, use a different replica type.  For 
instance, the four nodes could be TLOG and the six nodes could be PULL.


Then you can use the shards.preference parameter on your queries to only 
query the type of replica that you want.


https://lucene.apache.org/solr/guide/8_7/distributed-requests.html#shards-preference-parameter

Thanks,
Shawn


Re: Replicaton SolrCloud

2021-01-15 Thread Shawn Heisey

On 1/15/2021 7:20 AM, Jae Joo wrote:

Is non CDCR replication in SolrCloud still working in Solr 9.0?


Solr 9 doesn't exist yet.  Probably won't for at least a few months. 
The latest version is 8.7.0.


Solr's replication feature is used by SolrCloud internally for recovery 
operations, but the user doesn't configure it at all.  SolrCloud uses 
its own mechanisms to replicate indexes.  I doubt that those mechanisms 
will disappear when version 9.0 comes out.


Thanks,
Shawn


Re: Getting error "Bad Message 414 reason: URI Too Long"

2021-01-15 Thread Shawn Heisey

On 1/14/2021 2:31 AM, Abhay Kumar wrote:
I am trying to post below query to Solr but getting error as “Bad 
Message 414reason: URI Too Long”.


I am sending query using SolrNet library. Please suggest how to resolve 
this issue.


*Query :* 
http://localhost:8983/solr/documents/select?q=%22Geisteswissenschaften


If your query is a POST request rather than a GET, then you won't get 
that error.  And if the request is identical to the REALLY long URL that 
you included (which seemed to be incomplete), then it's definitely not a 
POST.  If it were, everything after the ? would be in the request body, 
not on the URL itself.


There is a note on the SolrNET FAQ about this.

https://github.com/SolrNet/SolrNet/blob/master/Documentation/FAQ.md#im-getting-a-uri-too-long-error

If you want more info on that, you'll need to ask SolrNET.  It's a 
completely different project.


Thanks,
Shawn


Re: Apache Solr in High Availability Primary and Secondary node.

2021-01-11 Thread Shawn Heisey

On 1/11/2021 4:02 AM, Kaushal Shriyan wrote:

Thanks, David for the quick response. Is there any use-case to use HAProxy
or Nginx webserver or any other application to load balance both Solr
primary and secondary nodes?


I had a setup with haproxy and two copies of a Solr index.

Four of the nodes with Solr on them were running a pacemaker setup for 
high availability on the haproxy load balancer.  If any single system 
were to die, everything kept on working.


My homegrown indexing system kept both copies of the index up to date 
independently -- no replication.   I had to abandon replication because 
version 3.x and later cannot replicate from 1.x.  I kept that paradigm 
even after I was running version with compatible replication because it 
was very flexible.


I really like haproxy, but going into further detail would be off topic 
for this list.


Thanks,
Shawn


Re: maxBooleanClauses change in solr.xml not reflecting in solr 8.4.1

2021-01-05 Thread Shawn Heisey

On 1/5/2021 8:26 AM, dinesh naik wrote:

Hi all,
I want to update the maxBooleanClauses to 2048 (from default value 1024).
Below are the steps tried:
1. updated solrconfig.xml :
${solr.max.booleanClauses:2048}


You need to update EVERY solrconfig.xml that the JVM is loading for this 
to actually work.


maxBooleanClauses is an odd duck.  At the Lucene level, where this 
matters, it is a global (JVM-wide) variable.  So whenever Solr sets this 
value, it applies to ALL of the Lucene indexes that are being accessed 
by that JVM.


When you havet multiple Solr cores, the last core that was loaded will 
set the max clauses value for ALL cores.  If any of your solrconfig.xml 
files don't have that config, then it will be set to the default of 1024 
when that core is loaded or reloaded.  Leaving the config out is not a 
solution.


So if any of your configs don't have the setting or set it to something 
lower than you need, you run the risk of having the max value 
incorrectly set across the board.


Here are the ways that I think this could be fixed:

1) Make the value per-index in Lucene, (or maybe even per-query) instead 
of global.
2) Have Solr only change the global Lucene value if the config is 
*higher* than the current global value.
3) Eliminate the limit entirely.  Remove the config option from Solr and 
have Solr hard-set it to the maximum value.

4) Move the maxBooleanClauses config to solr.xml instead of solrconfig.xml

I think that option 1 is the best way to do it, but this problem has 
been around for many years, so it's probably not easy to do.  I don't 
think it's going to happen.  There are a number of existing issues in 
the Solr bug tracker for changing how the limit is configured.



2. updated  solr.xml :
${solr.max.booleanClauses:2048}


I don't think it's currently possible to set the value with solr.xml.

Thanks,
Shawn


Re: Data Import Blocker - Solr

2020-12-19 Thread Shawn Heisey

On 12/18/2020 12:03 AM, basel altameme wrote:

While trying to Import & Index data from MySQL DB custom view i am facing the 
error below:
Data Config problem: The value of attribute "query" associated with an element type 
"entity" must not contain the '<' character.
Please note that in my SQL statements i am using '<>' as an operator for 
comparing only.
sample line:
         when (`v`.`live_type_id` <> 1) then 100


These configurations are written in XML.  So you must encode the 
character using XML-friendly notation.


Instead of <> it should say  to be correct.  Or you could use != 
which is also correct SQL notation for "not equal to".


Thanks,
Shawn


Re: 8.6.1 configuring ssl on centos 7

2020-12-13 Thread Shawn Heisey

On 12/13/2020 7:21 AM, Bogdan C. wrote:

Solr is installed and working on http (8983). I (think I) have the keystore 
configured properly and solr.in.sh modified for the SOLR_SSL_* config settings.
Not sure how to modify the service startup to listen on 8984 for ssl. solr 
documentation says to start it using bin/solr -p 8984 its configured to start 
as a service so nt sure that applies here... I modified solr.in.sh with 
SOLR_PORT=8984 but it still starts up on 8983.


If you installed Solr as a service, then you'll need to edit 
/etc/default/solr.in.sh ... the one that's in the bin directory is ignored.


If that's the one you did edit, then I do not know why it isn't working 
... unless maybe /etc/init.d/solr has also been modified.  If that has 
happened, you would need to consult with whoever modified it.


Thanks,
Shawn



Re: DIH and UUIDProcessorFactory

2020-12-12 Thread Shawn Heisey

On 12/12/2020 2:30 PM, Dmitri Maziuk wrote:
Right, ```Every update request received by Solr is run through a chain 
of plugins known as Update Request Processors, or URPs.```


The part I'm missing is whether DIH's 'name="/dataimport"' counts as an "Update Request", my reading is it 
doesn't and URP chain applies only to '

If you define an update chain as default, then it will be used for all 
updates made where a different chain is not specifically requested.


I have used this personally to have my custom update chain apply even 
when the indexing comes from DIH.  I know for sure that this works on 
4.x and 5.x versions; it should work on newer versions as well.


Thanks,
Shawn


Re: Copyfields, will there be any difference between source and dest if they are switched?

2020-12-12 Thread Shawn Heisey

On 12/11/2020 2:38 PM, ufuk yılmaz wrote:



My question is, will there be any difference on the resulting indexed documents 
if I switched source and dest fields in copyField directive? My understanding 
is copyField operates on raw data arriving at Solr as is, and field 
declarations themselves decide what to do with it, so there shouldn’t be any 
difference, but I’m currently investigating an issue which,


Presumably your indexing includes place.name but does not contain 
place.name_orig in the fields that are sent to Solr for indexing.  If 
that's the case, then reversing the fields in the copyField will leave 
place.name_orig empty.


If the indexed data does contain both fields, then the target field 
would contain the data twice, and if the target field is not 
multiValued, then indexing will fail.



- Same data is indexed in two different collections, one uses a copyField 
directive like above
- Other one don’t use copyField, but same value is sent both in place.name and 
place.name_orig fields during indexing
But I’m seeing some slight differences in resulting documents, mainly in casing 
between i and İ.


Analysis does not affect document data in the results.  The data you see 
in results will be exactly what was originally sent.  The only way Solr 
can change stored data is through the use of Update Processors defined 
in solrconfig.xml ... analysis defined in the schema will not affect 
document data in search results.


Thanks,
Shawn


Re: DIH and UUIDProcessorFactory

2020-12-12 Thread Shawn Heisey

On 12/12/2020 12:54 PM, Dmitri Maziuk wrote:
is there an easy way to use the stock UUID generator with DIH? We have a 
hand-written one-liner class we use as DIH entity transformer but I 
wonder if there's a way to use the built-in UUID generator class instead.


 From the TFM it looks like there isn't, is that correct?


The only way I know of to use an update processor chain with DIH is to 
set 'default="true"' when defining the chain.


I did manage to find an example with the default attribute, in javadocs:

https://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/update/processor/UpdateRequestProcessorChain.html

If there is another way to specify the chain to use with DIH, I do not 
know about it.  I am always learning new things, something might exist 
that I have never seen.


Thanks,
Shawn


Re: Solrj supporting term vector component ?

2020-12-03 Thread Shawn Heisey

On 12/3/2020 10:20 AM, Deepu wrote:

I am planning to use Term vector component for one of the use cases, as per
below solr documentation link solrj not supporting Term Vector Component,
do you have any other suggestions to use TVC in java application?

https://lucene.apache.org/solr/guide/8_4/the-term-vector-component.html#solrj-and-the-term-vector-component


SolrJ will support just about any query you might care to send, you just 
have to give it all the required parameters when building the request. 
All the results will be available, though you'll almost certainly have 
to provide code yourself that rips apart the NamedList into usable info.


What is being said in the documentation is that there are not any 
special objects or methods for doing term vector queries.  It's not 
saying that it can't be done.


Thanks,
Shawn


Re: Facet to part of search results

2020-12-03 Thread Shawn Heisey

On 12/3/2020 9:55 AM, Jae Joo wrote:

Is there any way to apply facet to the partial search result?
For ex, we have 10m return by "dog" and like to apply facet to first 10K.
Possible?


The point of facets is to provide accurate numbers.

What would it mean to only apply to the first 10K?  If there are 10 
million documents in the query results that contain "dog" then the facet 
should say 10 million, not 10K.  I do not understand what you're trying 
to do.


Shawn


Re: Solr 8.4.1, NOT NULL query not working on plong & pint type fields (fieldname:* )

2020-11-26 Thread Shawn Heisey

On 11/25/2020 10:42 AM, Deepu wrote:

We are in the process of migrating from Solr 5 to Solr 8, during testing
identified that "Not null" queries on plong & pint field types are not
giving any results, it is working fine with solr 5.4 version.

could you please let me know if you have suggestions on this issue?


Here's a couple of facts:

1) Points-based fields have certain limitations that make explicit value 
lookups very slow, and make them unsuitable for use on uniqueKey fields. 
 Something about the field not having a "term" available.


2) A query of the type "fieldname:*" is a wildcard query.  These tend to 
be slow and inefficient, when they work.


It might be that the limitations of point-based fields make it so that 
wildcard queries don't work.  I have no idea here.  Points-based fields 
did not exist in Solr 5.4, chances are that you were using a Trie-based 
field at that time.  A wildcard query would have worked, but it would 
have been slow.


I may have a solution even though I am pretty clueless about what's 
going on.  When you are looking to do a NOT NULL sort of query, you 
should do it as a range query rather than a wildcard query.  This means 
the following syntax.   Note that it is case sensitive -- the "TO" must 
be uppercase:


fieldname:[* TO *]

This is how all NOT NULL queries should be constructed, regardless of 
the type of field.  Range queries tend to very efficient.


Thanks,
Shawn


Re: Increase in Response time when solr fields are merged

2020-11-19 Thread Shawn Heisey

On 11/19/2020 2:12 AM, Ajay Sharma wrote:

Earlier we were searching in 6 fields i.e qf is applied on 6 fields like
below





We merged all these 6 fields into one field X and now while searching we
using this single filed X





We are able to see a decrease in index size but the response time has
increased.


I can't say for sure, but I would imagine that when querying multiple 
fields using edismax, Solr can manage to do some of that work in 
parallel.  But with only one field, any parallel processing is lost.  If 
I have the right idea, that could explain what you are seeing.


Somebody with far more intimate knowledge of edismax will need to 
confirm or refute my thoughts.


Thanks,
Shawn


Re: How to reflect changes of solrconfig.xml to all the cores without causing any conflict

2020-11-09 Thread Shawn Heisey

On 11/9/2020 5:44 AM, raj.yadav wrote:

*Question:*
Since reload is not done, none of the replica (including leader) will have
updated solrconfig. And if we restart replica and if it trys to sync up with
leader will it reflect the latest changes of solrconfig or it will be the
same as leader.





Solr Collection detail:
single collection having 6 shard. each Vm is hosting single replica.
Collection size: 60 GB (each shard size is 10 GB)
Average doc size: 1.0Kb


If you restart Solr, it is effectively the same thing as reloading all 
cores on that Solr instance.


Your description (use of the terms "collection" and "shards") suggests 
that you're running in SolrCloud mode.  If you are, then modifying 
solrconfig.xml on the disk will change nothing.  You need to modify the 
solrconfig.xml that lives in ZooKeeper, or re-upload the changes to ZK. 
 Is that what you're doing?  After that, to make any changes effective, 
you have to reload the collection or restart the correct Solr instances.


I cannot tell you exactly what will happen as far as SolrCloud index 
synchronization, because I know nothing about your setup.  If the 
follower replica type is TLOG or PULL, then the index will be an exact 
copy of the leader's index.  With NRT, all replicas will independently 
index the data.


Thanks,
Shawn


Re: Solr 8.1.1 installation in Azure App service

2020-11-04 Thread Shawn Heisey

On 11/3/2020 11:49 PM, Narayanan, Bhagyasree wrote:

Steps we followed for creating Solr App service:

 1. Created a blank sitecore 9.3 solution from Azure market place and
created a Web app for Solr.
 2. Unzipped the Solr 8.1.1 package and copied all the contents to
wwwroot folder of the Web app created for Solr using WinSCP/FTP.
 3. Created a new Solr core by creating a new folder {index folder} and
copied 'conf' from the "/site/wwwroot/server/solr/configsets/_default".
 4. Created a core.properties file with numShards=2 & name={index folder}


Can you give us the precise locations of all core.properties files that 
you have and ALL of the contents of those files?  There should not be 
any sensitive information in them -- no passwords or anything like that.


It would also be helpful to see the entire solr.log file, taken shortly 
after Solr starts.  The error will have much more detail than you shared 
in your previous message.


This mailing list eats attachments.  So for the logfile, you'll need to 
post the file to a filesharing service and give us a URL.  Dropbox is an 
example of this.  For the core.properties files, which are not very 
long, it will probably be best if you paste the entire contents into 
your email reply.  If you attach files to your email, we won't be able 
to see them.


Thanks,
Shawn


Re: Solr migration related issues.

2020-11-04 Thread Shawn Heisey

On 11/4/2020 9:32 PM, Modassar Ather wrote:

Another thing: how can I control the core naming? I want the core name to
be *mycore* instead of *mycore**_shard1_replica_n1*/*mycore*
*_shard2_replica_n2*.
I tried setting it using property.name=*mycore* but it did not work.
What can I do to achieve this? I am not able to find any config option.


Why would you need to this or even want to?  It sounds to me like an XY 
problem.


http://xyproblem.info/


I understand the core.properties file is required for core discovery but
when this file is present under a subdirectory of SOLR_HOME I see it not
getting loaded and not available in Solr dashboard.


You should not be trying to manipulate core.properties files yourself. 
This is especially discouraged when Solr is running in cloud mode.


When you're in cloud mode, the collection information in zookeeper will 
always be consulted during core discovery.  If the found core is NOT 
described in zookeeper, it will not be loaded.  And in any recent Solr 
version when running in cloud mode, a core that is not referenced in ZK 
will be entirely deleted.


Thanks,
Shawn


Re: Commits (with openSearcher = true) are too slow in solr 8

2020-11-03 Thread Shawn Heisey

On 11/3/2020 11:46 PM, raj.yadav wrote:

We have two parallel system one is  solr 8.5.2 and other one is solr 5.4
In solr_5.4 commit time with opensearcher true is 10 to 12 minutes while in
solr_8 it's around 25 minutes.


Commits on a properly configured and sized system should take a few 
seconds, not minutes.  10 to 12 minutes for a commit is an enormous red 
flag.



This is our current caching policy of solr_8




This is probably the culprit.  Do you know how many entries the 
filterCache actually ends up with?  What you've said with this config is 
"every time I open a new searcher, I'm going to execute up to 6000 
queries against the new index."  If each query takes one second, running 
6000 of them is going to take 100 minutes.  I have seen these queries 
take a lot longer than one second.


Also, each entry in the filterCache can be enormous, depending on the 
number of docs in the index.  Let's say that you have five million 
documents in your core.  With five million documents, each entry in the 
filterCache is going to be 625000 bytes.  That means you need 20GB of 
heap memory for a full filterCache of 32768 entries -- 20GB of memory 
above and beyond everything else that Solr requires.  Your message 
doesn't say how many documents you have, it only says the index is 11GB. 
 From that, it is not possible for me to figure out how many documents 
you have.



While debugging this we came across this page.
https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-Slowcommits


I wrote that wiki page.


Here one of the reasons for slow commit is mentioned as:
*/`Heap size issues. Problems from the heap being too big will tend to be
infrequent, while problems from the heap being too small will tend to happen
consistently.`/*

Can anyone please help me understand the above point?


If your heap is a lot bigger than it needs to be, then what you'll see 
is slow garbage collections, but it won't happen very often.  If the 
heap is too small, then there will be garbage collections that happen 
REALLY often, leaving few system resources for actually running the 
program.  This applies to ANY Java program, not just Solr.



System config:
disk size: 250 GB
cpu: (8 vcpus, 64 GiB memory)
Index size: 11 GB
JVM heap size: 30 GB


That heap seems to be a lot larger than it needs to be.  I have run 
systems with over 100GB of index, with tens of millions of documents, on 
an 8GB heap.  My filterCache on each core had a max size of 64, with an 
autowarmCount of four ... and commits STILL would take 10 to 15 seconds, 
which I consider to be very slow.  Most of that time was spent executing 
those four queries in order to autowarm the filterCache.


What I would recommend you start with is reducing the size of the 
filterCache.  Try a size of 128 and an autowarmCount of 8, see what you 
get for a hit rate on the cache.  Adjust from there as necessary.  And I 
would reduce the heap size for Solr as well -- your heap requirements 
should drop dramatically with a reduced filterCache.


Thanks,
Shawn


Re: filterCache ramBytesUsed monitoring statistics go negative

2020-11-02 Thread Shawn Heisey

On 11/2/2020 4:27 AM, Dawn wrote:

filterCache ramBytesUsed monitoring statistics go negative.
Is there a special meaning, or is there a statistical problem
When present the list, can sort it by key. Solr7 is like this, easy to 
view.


When problems like this surface, it's usually because the code uses an 
"int" variable somewhere instead of a "long".  All numeric variables in 
Java are signed, and an "int" can only go up to a little over 2 billion 
before the numbers start going negative.


The master code branch looks like it's fine.  What is the exact version 
of Solr you're using?  With that information, I can check the relevant code.


Maybe simply upgrading to a much newer version would take care of this 
for you.


Thanks,
Shawn


Re: httpclient gives error

2020-10-31 Thread Shawn Heisey

On 10/31/2020 12:54 PM, Raivo Rebane wrote:
I try to use solrj in web application with eclipse tomcat but I get 
following errors





Tomcat lib contains following http jars:

-rw-rw-rw- 1 hydra hydra 326724 sept   6 21:33 httpcore-4.4.4.jar
-rw-rw-rw- 1 hydra hydra 736658 sept   6 21:33 httpclient-4.5.2.jar
-rwxrwxr-x 1 hydra hydra  21544 sept   9 11:17 
httpcore5-reactive-5.0.2.jar*

-rwxrwxr-x 1 hydra hydra 809733 sept   9 12:26 httpcore5-5.0.2.jar*
-rwxrwxr-x 1 hydra hydra 225863 sept   9 12:27 httpcore5-h2-5.0.2.jar*
-rwxrwxr-x 1 hydra hydra 145492 sept   9 12:30 httpcore5-testing-5.0.2.jar*
-rwxrwxr-x 1 hydra hydra 775798 okt    3 18:53 httpclient5-5.0.3.jar*
-rwxrwxr-x 1 hydra hydra  24047 okt    3 18:54 
httpclient5-fluent-5.0.3.jar*

-rwxrwxr-x 1 hydra hydra 259199 okt    3 18:54 httpclient5-cache-5.0.3.jar*
-rwxrwxr-x 1 hydra hydra  15576 okt    3 18:54 httpclient5-win-5.0.3.jar*
-rwxrwxr-x 1 hydra hydra  38022 okt    3 18:55 
httpclient5-testing-5.0.3.jar*

-rw-rw-r-- 1 hydra hydra  37068 okt   31 19:50 httpmime-4.3.jar


Version 5 of the apache httpclient is not used by any SolrJ version. 
Newer versions of SolrJ utilize the Jetty httpclient for http/2 support, 
not the apache httpclient.  The older client, using apache httpclient 
4.x, is still present in newer SolrJ versions.  Your message did not 
indicate which version of SolrJ you are using.  One of your previous 
emails to the list mentions version 8.6.3 of SolrJ ... the httpclient 
4.x jars that you have are different versions than that version of SolrJ 
asks for.


Looking over previous emails that you have sent to the mailing list, I 
wonder why you are adding jars manually instead of letting Maven handle 
all of the dependencies.  A common problem when dependency resolution is 
not automatic is that the classpath is missing one or more of the jars 
that exist on the filesystem.


I don't think this problem is directly caused by SolrJ.  It could be 
that the httpclient 4.x jars you have are not new enough, or there might 
be some unknown interaction between the 4.x jars and the 5.x jars.  Or 
maybe your classpath is incomplete -- doesn't include something in your 
file listing above.


Problems like this can also be caused by having multiple copies of the 
same or similar versions of jars on the classpath.  That kind of issue 
could be very hard to track down.  It can easily be caused by utilizing 
a mixture of automatic and manual dependencies.  Choose either all 
automatic (maven, ivy, gradle, etc) or all manual.


Thanks,
Shawn


Re: TieredMergePolicyFactory question

2020-10-26 Thread Shawn Heisey

On 10/25/2020 11:22 PM, Moulay Hicham wrote:

I am wondering about 3 other things:

1 - You mentioned that I need free disk space. Just to make sure that we
are talking about disc space here. RAM can still remain at the same size?
My current RAM size is  Index size < RAM < 1.5 Index size


You must always have enough disk space available for your indexes to 
double in size.  We recommend having enough disk space for your indexes 
to *triple* in size, because there is a real-world scenario that will 
require that much disk space.



2 - When the merge is happening, it happens in disc and when it's
completed, then the data is sync'ed with RAM. I am just guessing here ;-).
I couldn't find a good explanation online about this.


If you have enough free memory, then the OS will make sure that the data 
is available in RAM.  All modern operating systems do this 
automatically.  Note that I am talking about memory that is not 
allocated to programs.  Any memory assigned to the Solr heap (or any 
other program) will NOT be available for caching index data.


If you want ideal performance in typical situations, you must have as 
much free memory as the space your indexes take up on disk.  For ideal 
performance in ALL situations, you'll want enough free memory to be able 
to hold both the original and optimized copies of your index data at the 
same time.  We have seen that good performance can be achieved without 
going to this extreme, but if you have little free memory, Solr 
performance will be terrible.


I wrote a wiki page that covers this in some detail:

https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems


3 - Also I am wondering what recommendation you have for continuously
purging deleted documents. optimize? expungeDeletes? Natural Merge?
Here are more details about the need to purge documents.


The only way to guarantee that all deleted docs are purged is to 
optimize.   You could use the expungeDeletes action ... but this might 
not get rid of all the deleted documents, and depending on how those 
documents are distributed across the whole index, expungeDeletes might 
not do anything at all.  These operations are expensive (require a lot 
of time and system resources) and will temporarily increase the size of 
your index, up to double the starting size.


Before you go down the road of optimizing regularly, you should 
determine whether freeing up the disk space for deleted documents 
actually makes a substantial difference in performance.  In very old 
Solr versions, optimizing the index did produce major performance 
gains... but current versions have much better performance on indexes 
that have deleted documents.  Because performance is typically 
drastically reduced while the optimize is happening, the tradeoff may 
not be worthwhile.


Thanks,
Shawn


Re: SolrDocument difference between String and text_general

2020-10-20 Thread Shawn Heisey

On 10/20/2020 1:53 AM, Cox, Owen wrote:

I've now written a Java Spring-Boot program to populate documents (snippet below) using SolrCrudRepository.  
This works when I don't index the "title" field, but when I try include title I get the following 
error "cannot change field "title" from index options=DOCS_AND_FREQS_AND_POSITIONS to 
inconsistent index options=DOCS"


I have no idea at all what a SolrCrudRepository is.  That must be part 
of Spring's repackaging of SolrJ.  It's probably not important anyway.


Some schema changes require more than a simple reindex.  For those 
changes, you must entirely delete the index directory, so that the 
Lucene index can be built from scratch.


That error message indicates that such a change has been made to the 
schema, and the existing index was NOT deleted before trying to index 
new docs.


Thanks,
Shawn


Re: SolrCloud 6.6.2 suddenly crash due to slow queries and Log4j issue

2020-10-18 Thread Shawn Heisey

On 10/18/2020 3:22 AM, Dominique Bejean wrote:

A few months ago, I reported an issue with Solr nodes crashing due to the
old generation heap growing suddenly and generating OOM. This problem
occurred again this week. I have threads dumps for each minute during the 3
minutes the problem occured. I am using fastthread.io in order to analyse
these dumps.





* The Log4j issue starts (
https://blog.fastthread.io/2020/01/24/log4j-bug-slows-down-your-app/)


If the log4j bug is the root cause here, then the only way you can fix 
this is to upgrade to at least Solr 7.4.  That is the Solr version where 
we first upgraded from log4j 1.2.x to log4j2.  You cannot upgrade log4j 
in Solr 6.6.2 without changing Solr code.  The code changes required 
were extensive.  Note that I did not do anything to confirm whether the 
log4j bug is responsible here.  You seem pretty confident that this is 
the case.


Note that if you upgrade to 8.x, you will need to reindex from scratch. 
Upgrading an existing index is possible with one major version bump, but 
if your index has ever been touched by a release that's two major 
versions back, it won't work.  In 8.x, that is enforced -- 8.x will not 
even try to read an old index touched by 6.x or earlier.


In the following wiki page, I provided instructions for getting a 
screenshot of the process listing.


https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems

In addition to that screenshot, I would like to know the on-disk size of 
all the cores running on the problem node, along with a document count 
from those cores.  It might be possible to work around the OOM just by 
increasing the size of the heap.  That won't do anything about problems 
with log4j.


Thanks,
Shawn


Re: converting string to solr.TextField

2020-10-17 Thread Shawn Heisey

On 10/17/2020 6:23 AM, Vinay Rajput wrote:

That said, one more time I want to come back to the same question: why
solr/lucene can not handle this when we are updating all the documents?
Let's take a couple of examples :-

*Ex 1:*
Let's say I have only 10 documents in my index and all of them are in a
single segment (Segment 1). Now, I change the schema (update field type in
this case) and reindex all of them.
This is what (according to me) should happen internally :-

1st update req : Solr will mark 1st doc as deleted and index it again
(might run the analyser chain based on config)
2nd update req : Solr will mark 2st doc as deleted and index it again
(might run the analyser chain based on config)
And so on..
based on autoSoftCommit/autoCommit configuration, all new documents will be
indexed and probably flushed to disk as part of new segment (Segment 2)





*Ex 2:*
I see that it can be an issue if we think about reindexing millions of
docs. Because in that case, merging can be triggered when indexing is half
way through, and since there are some live docs in the old segment (with
old cofig), things will blow up. Please correct me if I am wrong.


If you could guarantee a few things, you could be sure this will work. 
But it's a serious long shot.


The change in schema might be such that when Lucene tries to merge them, 
it fails because the data in the old segments is incompatible with the 
new segments.  If that happens, then you're sunk ... it won't work at all.


If the merges of old and new segments are successful, then you would 
have to optimize the index after you're done indexing to be SURE there 
were no old documents remaining.  Lucene calls that operation 
"ForceMerge".  This operation is disruptive and can take a very long time.


You would also have to be sure there was no query activity until the 
update/merge is completely done.  Which probably means that you'd want 
to work on a copy of the index in another collection.  And if you're 
going to do that, you might as well start indexing from scratch into a 
new/empty collection.  That would also allow you to continue querying 
the old collection until the new one was ready.


Thanks,
Shawn


Re: converting string to solr.TextField

2020-10-16 Thread Shawn Heisey

On 10/16/2020 2:36 PM, David Hastings wrote:

sorry, i was thinking just using the
*:*
method for clearing the index would leave them still


In theory, if you delete all documents at the Solr level, Lucene will 
delete all the segment files on the next commit, because they are empty. 
 I have not confirmed with testing whether this actually happens.


It is far safer to use a new index as Erick has said, or to delete the 
index directories completely and restart Solr ... so you KNOW the index 
has nothing in it.


Thanks,
Shawn


Re: Memory line in status output

2020-10-12 Thread Shawn Heisey

On 10/12/2020 5:11 PM, Ryan W wrote:

Thanks.  How do I activate the G1GC collector?  Do I do this by editing a
config file, or by adding a parameter when I start solr?

Oracle's docs are pointing me to a file that supposedly is at
instance-dir/OUD/config/java.properties, but I don't have that path.  I am
not sure what is meant by instance-dir here, but perhaps it means my JRE
install, which is at
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el7_8.x86_64/jre -- but
there is no "OUD" directory in this location.


The collector is chosen by the startup options given to Java, in this 
case by the start script for Solr.  I've never heard of it being set by 
a config in the JRE.


In Solr 7, the start script defaults to the CMS collector.  We have 
updated that to G1 in the latest Solr 8.x versions, because CMS has been 
deprecated by Oracle.


Adding the following lines to the correct solr.in.sh would change the 
garbage collector to G1.  I got this from the "bin/solr" script in Solr 
8.5.1:


  GC_TUNE=('-XX:+UseG1GC' \
'-XX:+PerfDisableSharedMem' \
'-XX:+ParallelRefProcEnabled' \
'-XX:MaxGCPauseMillis=250' \
'-XX:+UseLargePages' \
'-XX:+AlwaysPreTouch')

If you used the service installer script to install Solr, then the 
correct file to add this to is usually /etc/default/solr.in.sh ... but 
if you did the install manually, it may be in the same bin directory 
that contains the solr script itself.  Your initial message says the 
solr home is /opt/solr/server/solr so I am assuming it's not running on 
Windows.


Thanks,
Shawn


Re: Help with uploading files to a core.

2020-10-11 Thread Shawn Heisey

On 10/11/2020 2:28 PM, Guilherme dos Reis Meneguello wrote:

Hello! My name is Guilherme and I'm a new user of Solr.

Basically, I'm developing a database to help a research team in my
university, but I'm having some problems uploading the files to the
database. Either using curl commands or through the admin interface, I
can't quite upload the files from my computer to Solr and set up the field
types I want that file to have while indexed. I can do that through the
document builder, but my intent was to have the research team I'm
supporting just upload them through the terminal or something like that. My
schema is all set up nicely, however the Solr's field class guessing isn't
guessing correctly.


If you're using the capability to automatically add unknown fields, then 
your schema is NOT "all set up nicely".  It's apparently not set up at all.


The "add unknown fields" update processor is not recommended for 
production, because as you have noticed, it sometimes guesses the field 
type incorrectly.  The fact that it guesses incorrectly is not a bug ... 
we can't fix it because it's not actually broken.  Getting it right in 
every case is not possible.


Your best bet will be to set up the entire schema manually in advance of 
any indexing.  To do that, you're going to have to know every field that 
the data uses, and have field definitions already in the schema.


Thanks,
Shawn


Re: Question about solr commits

2020-10-07 Thread Shawn Heisey

On 10/7/2020 4:40 PM, yaswanth kumar wrote:

I have the below in my solrconfig.xml


 
   ${solr.Data.dir:}
 
 
   ${solr.autoCommit.maxTime:6}
   false
 
 
   ${solr.autoSoftCommit.maxTime:5000}
 
   

Does this mean even though we are always sending data with commit=false on
update solr api, the above should do the commit every minute (6 ms)
right?


Assuming that you have not defined the "solr.autoCommit.maxTime" and/or 
"solr.autoSoftCommit.maxTime" properties, this config has autoCommit set 
to 60 seconds without opening a searcher, and autoSoftCommit set to 5 
seconds.


So five seconds after any indexing begins, Solr will do a soft commit. 
When that commit finishes, changes to the index will be visible to 
queries.  One minute after any indexing begins, Solr will do a hard 
commit, which guarantees that data is written to disk, but it will NOT 
open a new searcher, which means that when the hard commit happens, any 
pending changes to the index will not be visible.


It's not "every five seconds" or "every 60 seconds" ... When any changes 
are made, Solr starts a timer.  When the timer expires, the commit is 
fired.  If no changes are made, no commits happen, because the timer 
isn't started.


Thanks,
Shawn


Re: Non Deterministic Results from /admin/luke

2020-10-01 Thread Shawn Heisey

On 10/1/2020 4:24 AM, Nussbaum, Ronen wrote:

We are using the Luke API in order to get all dynamic field names from our 
collection:
/solr/collection/admin/luke?wt=csv=0

This worked fine in 6.2.1 but it's non deterministic anymore (8.6.1) - looks 
like it queries a random single shard.

I've tried using /solr/collection/select?q=*:*=csv=0 but it 
behaves the same.

Can it be configured to query all shards?
Is there another way to achieve this?


The Luke handler (usually at /admin/luke) is not SolrCloud aware.  It is 
designed to operate on a single core.  So if you send the request to the 
collection and not a specific core, Solr must forward the request to a 
core in order for you to get ANY result.  The core selection will be random.


The software called Luke (which is where the Luke handler gets its name) 
operates on a Lucene index -- each Solr core is based around a Lucene 
index.  It would be a LOT of work to make the handler SolrCloud aware.


Depending on how your collection is set up, you may need to query the 
Luke handler on multiple cores in order to get a full picture of all 
fields present in the Lucene indexes.  I am not aware of any other way 
to do it.


Thanks,
Shawn


Re: Solr 7.7 - Few Questions

2020-10-01 Thread Shawn Heisey

On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:

We are using Apache Solr 7.7 on Windows platform. The data is synced to Solr 
using Solr.Net commit. The data is being synced to SOLR in batches. The 
document size is very huge (~0.5GB average) and solr indexing is taking long 
time. Total document size is ~200GB. As the solr commit is done as a part of 
API, the API calls are failing as document indexing is not completed.


A single document is five hundred megabytes?  What kind of documents do 
you have?  You can't even index something that big without tweaking 
configuration parameters that most people don't even know about. 
Assuming you can even get it working, there's no way that indexing a 
document like that is going to be fast.



   1.  What is your advise on syncing such a large volume of data to Solr KB.


What is "KB"?  I have never heard of this in relation to Solr.


   2.  Because of the search requirements, almost 8 fields are defined as Text 
fields.


I can't figure out what you are trying to say with this statement.


   3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such a large 
volume of data?


If just one of the documents you're sending to Solr really is five 
hundred megabytes, then 2 gigabytes would probably be just barely enough 
to index one document into an empty index ... and it would probably be 
doing garbage collection so frequently that it would make things REALLY 
slow.  I have no way to predict how much heap you will need.  That will 
require experimentation.  I can tell you that 2GB is definitely not enough.



   4.  How to set up Solr in production on Windows? Currently it's set up as a 
standalone engine and client is requested to take the backup of the drive. Is 
there any other better way to do? How to set up for the disaster recovery?


I would suggest NOT doing it on Windows.  My reasons for that come down 
to costs -- a Windows Server license isn't cheap.


That said, there's nothing wrong with running on Windows, but you're on 
your own as far as running it as a service.  We only have a service 
installer for UNIX-type systems.  Most of the testing for that is done 
on Linux.



   5.  How to benchmark the system requirements for such a huge data


I do not know what all your needs are, so I have no way to answer this. 
You're going to know a lot more about it that any of us are.


Thanks,
Shawn


Re: Solr client in JavaScript

2020-10-01 Thread Shawn Heisey

On 10/1/2020 3:55 AM, Sunil Dash wrote:

This is my javascript code ,from where I am calling solr ,which has a
loaded nutch core (index).
My java script client ( runs on TOMCAT server) and Solr
server are on the same machine (10.21.6.100) . May be due to cross
domain references issues OR something is missing I don't know.
I expected Response from Solr server (search result) as raw JASON
object. Kindly help me fix it.Thanks in advance .


As far as I can tell, your message doesn't tell us what the problem is. 
So I'm having a hard time coming up with a useful response.


If the problem is that the response isn't JSON, then either you need to 
tell Solr that you want JSON, or run a new enough version that the 
default response format *IS* JSON.  I do not recall which version we 
changed the default from XML to JSON.


One thing you should be aware of ... if the javascript is running in the 
end user's browser, then the end user has direct access to your Solr 
install.  That is a bad idea.


Thanks,
Shawn


Re: solr performance with >1 NUMAs

2020-09-28 Thread Shawn Heisey

On 9/28/2020 12:17 PM, Wei wrote:

Thanks Shawn. Looks like Java 11 is the way to go with -XX:+UseNUMA. Do you
see any backward compatibility issue for Solr 8 with Java 11? Can we run
Solr 8 built with JDK 8 in Java 11 JRE, or need to rebuild solr with Java
11 JDK?


I do not know of any problems running the binary release of Solr 8 
(which is most likely built with the Java 8 JDK) with a newer release 
like Java 11 or higher.


I think Sun was really burned by such problems cropping up in the days 
of Java 5 and 6, and their developers have worked really hard to make 
sure that never happens again.


If you're running Java 11, you will need to pick a different garbage 
collector if you expect the NUMA flag to function.  The most recent 
releases of Solr are defaulting to G1GC, which as previously mentioned, 
did not gain NUMA optimizations until Java 14.


It is not clear to me whether the NUMA optimizations will work with any 
collector other than Parallel until Java 14.  You would need to check 
Java documentation carefully or ask someone involved with development of 
Java.


If you do see an improvement using the NUMA flag with Java 11, please 
let us know exactly what options Solr was started with.


Thanks,
Shawn


Re: Solr storage of fields <-> indexed data

2020-09-28 Thread Shawn Heisey

On 9/28/2020 8:56 AM, Edward Turner wrote:

By removing the copyfields, we've found that our index sizes have reduced
by ~40% in some cases, which is great! We're just curious now as to exactly
how this can be ...


That's not surprising.


My question is, given the following two schemas, if we index some data to
the "description" field, will the index for schema1 be twice as large as
the index of schema2? (I guess this relates to how, internally, Solr stores
field + index data)

Old way -- schema1:
===





If the only field in the indexed documents is "description", the index 
built with schema2 will be half the size of the index built with 
schema1.  Both fields referenced by "copyField" are the same type and 
have the same settings, so they would contain exactly the same data at 
the Lucene level.


Having the same type for a source and destination field is normally only 
useful if multiple sources are copied to a destination, which requires 
multiValued="true" on the destination -- NOT the case in your example.


There is one other use case for a copyField -- using the same data 
differently, with different type values.  For example you might have one 
type for faceting and one for searching.


Thanks,
Shawn


Re: solr performance with >1 NUMAs

2020-09-26 Thread Shawn Heisey

On 9/26/2020 1:39 PM, Wei wrote:

Thanks Shawn! Currently we are still using the CMS collector for solr with
Java 8. When last evaluated with Solr 7, CMS performs better than G1 for
our case. When using G1, is it better to upgrade from Java 8 to Java 11?
 From https://lucene.apache.org/solr/guide/8_4/solr-system-requirements.html,
seems Java 14 is not officially supported for Solr 8.


It has been a while since I was working with Solr every day, and when I 
was, Java 11 did not yet exist.  I have no idea whether Java 11 improves 
things beyond Java 8.  That said ... all software evolves and usually 
improves as time goes by.  It is likely that the newer version has SOME 
benefit.


Regarding whether or not Java 14 is supported:  There are automated 
tests where all the important code branches are run with all major 
versions of Java, including pre-release versions, and those tests do 
include various garbage collectors.  Somebody notices when a combination 
doesn't work, and big problems with newer Java versions are something 
that gets discussed on our mailing lists.


Java 14 has been out for a while, with no big problems being discussed 
so far.  So it is likely that it works with Solr.  Can I say for sure? 
No.  I haven't tried it myself.


I don't have any hardware available where there is more than one NUMA, 
or I would look deeper into this myself.  It would be interesting to 
find out whether the -XX:+UseNUMA option makes a big difference in 
performance.


Thanks,
Shawn


Re: solr performance with >1 NUMAs

2020-09-25 Thread Shawn Heisey

On 9/23/2020 7:42 PM, Wei wrote:

Recently we deployed solr 8.4.1 on a batch of new servers with 2 NUMAs. I
noticed that query latency almost doubled compared to deployment on single
NUMA machines. Not sure what's causing the huge difference. Is there any
tuning to boost the performance on multiple NUMA machines? Any pointer is
appreciated.


If you're running with standard options, Solr 8.4.1 will start using the 
G1 garbage collector.


As of Java 14, G1 has gained the ability to use the -XX:+UseNUMA option, 
which makes better decisions about memory allocations and multiple 
NUMAs.  If you're running a new enough Java, it would probably be 
beneficial to add this to the garbage collector options.  Solr itself is 
unaware of things like NUMA -- Java must handle that.


https://openjdk.java.net/jeps/345

Thanks,
Shawn


Re: Solr 8.6.2 - Solr loaded a deprecated plugin/analysis

2020-09-23 Thread Shawn Heisey

On 9/22/2020 10:22 PM, Anuj Bhargava wrote:

How to solve this issue? How to replace it?

SolrResourceLoader
Solr loaded a deprecated plugin/analysis class [solr.DataImportHandler].
Please consult documentation how to replace it accordingly.


That is a generic message about using deprecated features.  It's a 
message that means "Hey, we noticed you're using something that's going 
to disappear one day.  You might want to look into using something else, 
because one day that thing you're using is going to be gone."


DIH (the DataImportHandler) is being discontinued.  It will no longer be 
maintained by the project.  Its functionality might be continued by the 
community in a new project, that may or may not be part of the Apache 
Foundation.


If you stick with Solr 8.x, you will not need to find a replacement for 
DIH.  It should remain part of Solr in all 8.x versions, most things 
that are deprecated are not removed until the next *major* version is 
released, which in this case, will be version 9.0.0.  We do not yet know 
when that version will be released.


Thanks,
Shawn


Re: Fetched but not Added Solr 8.6.2

2020-09-18 Thread Shawn Heisey

On 9/18/2020 1:27 AM, Anuj Bhargava wrote:

In managed schema, I have 

Still getting the following error-

org.apache.solr.common.SolrException: Document is missing mandatory
uniqueKey field: id


The problem is that the document that has been fetched with DIH does NOT 
have a field named id.  Because your schema has named the id field as 
uniqueKey, that field is required -- it *must* exist in any document 
that is indexed.


Your DIH config suggests that the database has a field named posting_id 
... perhaps your Solr schema should use that field as the uniqueKey instead?


Thanks,
Shawn


Re: schema.xml version attribute

2020-09-05 Thread Shawn Heisey

On 9/5/2020 3:30 AM, Dominique Bejean wrote:
Hi, I often see a bad usage of the version attribute in shema.xml. For 
instance  The version attribute is to 
specify the schema syntax and semantics version to be used by Solr. 
The current value is 1.6 It is clearly specified in schema.xml 
comments "It should not normally be changed by applications". However, 
what happens if this attribute is not correctly set ? I tried to find 
the answer in the code but without success. If the value is not 1.0, 
1.1, ... or 1.6, does Solr default it to the last correct value so 1.6 ? 


I've checked the code.

If the version is not specified in the schema, then it defaults to 1.0.  
The code that handles this can be found in IndexSchema.java.


Currently the minimum value is 1.0 and the maximum value is 1.6. If the 
actual configured version is outside of these limits, then the effective 
value is raised to the minimum or lowered to the maximum.


Thanks,
Shawn



Re: About solr.HyphenatedWordsFilter

2020-08-26 Thread Shawn Heisey

On 8/26/2020 12:05 AM, Kayak28 wrote:
I would like to tokenize the following sentence. I do want to tokens 
that remain hyphens. So, for example, original text: This is a new 
abc-edg and xyz-abc is coming soon! desired output tokens: 
this/is/a/new/abc-edg/and/xyz-abc/is/coming/soon/! Is there any way 
that I do not omit hyphens from tokens? I though HyphenatedWordsFilter 
does have similar functionalities, but it gets rid of hyphens.


I doubt that filter is what you need.  It is fully described in Javadocs:

https://lucene.apache.org/core/8_6_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/HyphenatedWordsFilter.html

Your requirement is a little odd.  Are you SURE that you want to 
preserve hyphens like that?


I think that you could probably achieve it with a carefully configured 
WordDelimiterGraphFilter.  This filter can be highly customized with its 
"types" parameter.  This parameter refers to a file in the conf 
directory that can change how the filter recognizes certain characters.  
I think that if you used the whitespace tokenizer along with the word 
delimiter filter, and put the following line into the file referenced by 
the "types" parameter, it would do most of what you're after:


- => ALPHA

What that config would do is cause the word delimiter filter to treat 
the hyphen as an alpha character -- so it will not use it as a 
delimiter.  One thing about the way it works -- the exclamation point at 
the end of your sentence would NOT be emitted as a token as you have 
described.  If that is critically important, and I cannot imagine that 
it would be, you're probably going to want to write your own custom 
filter.  That would be very much an expert option.


Thanks,
Shawn



Re: SOLR Compatibility with Oracle Enterprise Linux 7

2020-08-24 Thread Shawn Heisey

On 8/24/2020 12:46 AM, Wang, Ke wrote:
We are using Apache SOLR version 8.4.4.0. The project is planning to 
upgrade the Linux server from Oracle Enterprise Linux (Red Hat 
Enterprise Linux) 6 to OEL 7. As I was searching on the Confluence 
page and was not able to find the information, can I please confirm 
if: * Apache SOLR 8.4.4.0 is compatible with Oracle Enterprise Linux 
(Red Hat Enterprise Linux) 7? Please let me know if any further 
information is required.


There is no 8.4.4.0 version of Solr.  The closest versions to that are 
8.4.0 and 8.4.1.  If you are seeing 8.4.4.0 as the version, that must 
have come from somewhere other than this project.


The only concrete system requirement for Solr is Java. Solr 8.x has a 
requirement of Java 8 or later.  If Java is available for the OS, then 
Solr should work on that OS.  I am pretty sure that Oracle Linux has 
Java available.


Thanks,
Shawn



Re: SOLR indexing takes longer time

2020-08-17 Thread Shawn Heisey

On 8/17/2020 12:22 PM, Abhijit Pawar wrote:

We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
replicas and just single core.
It takes almost 3.5 hours to index that data.
I am using a data import handler to import data from the mongo database.

Is there something we can do to reduce the time taken to index?
Will upgrade to newer version help?


There's not enough information here to provide a diagnosis.

Are you running Solr in cloud mode (with zookeeper)?

3.5 hours for 20 documents sounds like slowness with the data 
source, not a problem with Solr, but it's too soon to rule anything out.


Would you be able to write a program that pulls data from your mongo 
database but doesn't send it to Solr?  Ideally it would be a Java 
program using the same JDBC driver you're using with DIH.


Thanks,
Shawn



Re: Solr ping taking 600 seconds

2020-08-15 Thread Shawn Heisey

On 8/14/2020 3:39 PM, Susheel Kumar wrote:

One of our Solr 6.6.2 DR cluster (target CDCR) which even doesn't have any
live search load seems to be taking 60 ms many times for the ping /
health check calls. Anyone has seen this before/suggestion what could be
wrong. The collection has 8 shards/3 replicas and 64GB memory and index
seems to fit in memory. Below solr log entries.


10 minutes for an all docs query is a REALLY long time.On a properly 
sized and tuned system, that query should complete in far less than one 
second, and when Solr has the query cached, it might have a QTime in the 
single digits.


What happens if you manually send a query to the standard handler 
(usually /select) where the whole q parameter is just the "*:*" text, 
with no other parameters?


At the following location in one of our wiki pages, there are 
instructions for getting a screenshot showing some detailed information 
about the system:


https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-Askingforhelponamemory/performanceissue

The rest of that wiki page has some good info about performance 
problems.  It's worth reading.  Disclaimer:  I wrote it.


I typically see one of two problems causing such massive performance issues:

1) Huge GC pauses, usually caused by a heap that's too small.
2) Too little memory in the system available for disk caching.

The screenshot described at the link above will tell us a LOT about your 
system, usually enough to pinpoint the cause of major performance problems.


Once you have the screenshot, please be aware that sending it as an 
email attachment will not work.  The mailing list strips most 
attachments as it processes your email, so you will need to upload the 
file to some kind of hosting and include a link to it in your email.  I 
normally use dropbox for that, and there are many other services.


Thanks,
Shawn



Re: Can create collections with Drupal 8 configset

2020-08-09 Thread Shawn Heisey

On 8/9/2020 8:11 AM, Shane Brooks wrote:

Thanks Shawn. The way we have it configured presently is as follows:
icu4j.jar is located in /opt/solr/contrib/analysis-extras/lib/icu4j-62.1.jar

solrconfig.xml contains:



Which should load the jar at startup, correct?


I do not know if that path spec is right or not.  It might be.

The class that doesn't load (in your error message) is not located in 
the icu4j jar.  It is located in the lucene-analyzers-icu-X.Y.Z.jar 
file, which is found in the contrib/analysis-extras/lucene-libs 
subdirectory.  That jar also needs the icu4j jar.


If the same class is loaded more than once, it probably won't work.  I 
know for sure from experience that this is the case for the Lucene ICU 
classes.  That's the biggest reason I use the ${solr.home}/lib directory 
-- so I am sure that each extra jar is only loaded once.  That directory 
does not exist until you create it.


Thanks,
Shawn


Re: HttpSolrClient Connection Evictor

2020-08-09 Thread Shawn Heisey

On 8/9/2020 2:46 AM, Srinivas Kashyap wrote:

We are using HttpSolrClient(solr-solrj-8.4.1.jar) in our app along with 
required jar(httpClient-4.5.6.jar). Before that we upgraded these jars from 
(solr-solrj-5.2.1.jar) and (httpClient-4.4.1.jar).

After we upgraded, we are seeing lot of below connection evictor statements in 
log file.

DEBUG USER_ID - STEP 2020-08-09 13:59:33,085 [Connection evictor] - Closing 
expired connections
DEBUG USER_ID - STEP 2020-08-09 13:59:33,085 [Connection evictor] - Closing 
connections idle longer than 5 MILLISECONDS


These logs are coming from the HttpClient library, not from SolrJ. 
Those actions appear to be part of HttpClient's normal operation.


It is entirely possible that the older version of the HttpClient library 
doesn't create these debug-level log entries.  That library is managed 
by a separate project under the Apache umbrella -- we here at the Solr 
project are not involved with it.


The solution here is to change your logging level.  You can either 
change the level of the main logger to something like INFO or WARN, or 
you can reduce the logging level of the HttpClient classes without 
touching the rest.  I do not know what logging system you are using.  If 
you need help with how to configure it, the people who made the logging 
system are much better equipped to configure it than we are.


I personally would change the default logging level for the whole 
program.  Those messages are logged at the DEBUG level.  Running an 
application with all loggers set that low should only be done when 
debugging a problem ... that level is usually far too verbose for a 
production system.  I do not recommend it at all.


If you choose to only change the level of the HttpClient classes, those 
loggers all start with "org.apache.http" which you will need for your 
logging configuration.


An additional note:  Your code should *not* create and close the 
HttpSolrClient for every query as you have done.  The HttpSolrClient 
object should be created once and re-used for the life of the program.


Thanks,
Shawn


Re: Can create collections with Drupal 8 configset

2020-08-09 Thread Shawn Heisey

On 8/8/2020 10:31 PM, Shane Brooks wrote:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error 
from server at http://192.168.xx.xx:8983/solr: Expected mime type 
application/octet-stream but got text/html. \n\nhttp-equiv=\"Content-Type\" 
content=\"text/html;charset=utf-8\"/>\nError 500 Server 
Error\n\nHTTP ERROR 500\nProblem 
accessing /solr/admin/cores. Reason:\n Server 
ErrorCaused by:java.lang.NoClassDefFoundError: 
org/apache/lucene/collation/ICUCollationKeyAnalyzer\n\tat



I haven’t found anything on Google except that ICUCollationKeyAnalyzer 
depends on the icu4j library, which I verified is part of the SOLR 
package.


The icu4j library and the related Lucene jars for ICU capability are not 
part of Solr by default.  They can be found in the "contrib" in the Solr 
download ... but you must add the jars to Solr if you intend to use any 
contrib capability.


The best way I have found to add custom jars to Solr is to create a 
"lib" directory under the location designated as the solr home and place 
the jars there.  All jars found in that directory will be automatically 
loaded and will be available to all cores.


An alternate way to load jars is with the  directive in 
solrconfig.xml ... but I don't recommend that approach.


Thanks,
Shawn



Re: wt=xml not defaulting the results to xml format

2020-08-07 Thread Shawn Heisey

On 8/7/2020 9:30 AM, yaswanth kumar wrote:

solr/PROXIMITY_DATA_V2/select?q=pkey:223_*=true=country_en=country_en





What ever I am trying is not working other than sending wt=xml as a
parameter while hitting the url.


I tried your solrconfig.xml addition and a URL similar to yours out on 
8.5.1, using the techproducts example.  The results were in XML.


I'm betting that you modified a copy of solrconfig.xml that is *NOT* the 
correct one for PROXIMITY_DATA_V2.  Or that after modifying it, you did 
not reload the core or restart Solr.


If your Solr server is in cloud mode, then you must modify the 
solrconfig.xml that lives in the ZooKeeper database, under the config in 
use for the collection.


If your server is not in cloud mode, then the relevant file is most 
likely to be the solrconfig.xml that is in the core's "conf" directory. 
Modifying the version of the file under the configsets directory after 
the core is created will not change anything.


I am also curious about the answer to Alexandre's question -- what is in 
the echoed parameters found in the incorrect response?  Setting 
echoParams to "all" as you have done can be very useful for this.


Thanks,
Shawn


Re: wt=xml not defaulting the results to xml format

2020-08-07 Thread Shawn Heisey
How are you sending the query request that doesn't come back as xml? I suspect 
that the request is being sent with an explicit wt parameter set to something 
other than xml. Making a query with the admin ui would do this, and it would 
probably default to json.

When you make a query, assuming you haven't changed the logging config, every 
parameter in that request can be found in the log entry for the query, 
including those that come from the solrconfig.xml.

Sorry about the top posted reply. It's the only option on this email app. My 
computer isn't available so I'm on my phone.

⁣Get TypeApp for Android ​

On Aug 6, 2020, 21:52, at 21:52, yaswanth kumar  wrote:
>Can someone help me on this ASAP? I am using solr 8.2.0 and below is
>the
>snippet from solrconfig.xml for one of the configset, where I am trying
>to
>default the results into xml format but its giving me as a json result.
>
>
>
>
>  all
>  10
>  
> pkey
> xml
>
>
>Can some one let me know if I need to do something more to always get a
>solr /select query results as XML??
>--
>Thanks & Regards,
>Yaswanth Kumar Konathala.
>yaswanth...@gmail.com



Re: Solr 8.5.2 - Solr shards param does not work without localhost

2020-08-06 Thread Shawn Heisey

On 8/6/2020 6:03 PM, gnandre wrote:

Please ignore the space between. I have updated the calls by removing space
below:

http://my.domain.com/solr/core/select?q=*:*=0=10=
my.domain.com/solr/another_core=*

http://my.domain.com/solr/core/select?q=*:*=0=10=
localhost:8983/solr/another_core=*


Assuming that these are the actual URLs (copy/paste) and not something 
you've typed up as an example...  one of them has port 8983 and the 
other has no port, which would mean it's using port 80.


That looks like it could be a problem.  It takes special effort to get 
Solr listening on port 80.


Thanks,
Shawn


Re: Solr 8.5.2 - Solr shards param does not work without localhost

2020-08-06 Thread Shawn Heisey

On 8/6/2020 5:59 PM, gnandre wrote:

http://my.domain.com/solr/core/select?q=*:*=0=10=
my.domain.com /solr/another_core=*

Ir does not work in Solr 8.5.2 anymore unless I pass localhost instead of
my domain in shards param value as follows:
http://my.domain.com/solr/core/select?q=*:*=0=10=
localhost:8983  /solr/another_core=*

This is a master-slave setup and not a cloud setup.


I've set up sharded indexes without SolrCloud before, and I've never 
used "localhost".  Always used FQDN.


When you try it using the name, what shows up in your solr logfile? I 
would assume you're getting some kind of error.  Can you share it?  It 
is likely to be many lines long.


Thanks,
Shawn



Re: SolrCloud on PublicCloud

2020-08-03 Thread Shawn Heisey

On 8/3/2020 12:04 PM, Mathew Mathew wrote:

Have been looking for architectural guidance on correctly configuring SolrCloud 
on Public Cloud (eg Azure/AWS)
In particular the zookeeper based autoscaling seems to overlap with the auto 
scaling capabilities of cloud platforms.

I have the following questions.

   1.  Should the ZooKeeper ensable be put in a autoscaling group. This seems 
to be a no, since the SolrNodes need to register against a static list of 
Zookeeper ips.


Correct.  There are features in ZK 3.5 for dynamic server membership, 
but in general it is better to have a static list.  The client must be 
upgraded as well for that feature to work.  The ZK client was upgraded 
to a 3.5 version in Solr 8.2.0.  I don't think we have done any testing 
of the dynamic membership feature.


ZK is generally best set up with either 3 or 5 servers, depending on the 
level of redundancy desired, and left alone unless there's a problem. 
With 3 servers, the ensemble can survive the failure of 1 server.  With 
5, it can survive the failure of 2.  As far as I know, getting back to 
full redundancy is best handled as a manual process, even if running 
version 3.5.



   2.  Should the SolrNodes be put in a AutoScaling group? Or should we just 
launch/register SolrNodes using a lambda function/Azure function.


That really depends on what you're doing.  There is no "one size fits 
most" configuration.


I personally would avoid setting things up in a way that results in Solr 
nodes automatically being added or removed.  Adding a node will 
generally result in a LOT of data being copied, and that can impact 
performance in a major way, so adding nodes should be scheduled to 
minimize impact.  If it's automatic in response to high load, adding a 
node can make performance a lot worse before it gets better.  When a 
node disappears, manual action is required for SolrCloud to forget the node.



   3.  Should the SolrNodes be associated with local storage or should they be 
attached to shared storage volumes.


Lucene (which provides most of Solr's functionality) generally does not 
like to work with shared storage.  In addition to potential latency 
issues for storage connected via a network, Lucene works extremely hard 
to ensure that only one process can open an index.  Using shared storage 
will encourage attempts to share the index directory between multiple 
processes, which almost always fails to work.


Things work best with locally attached storage utilizing an extremely 
fast connection method (like SATA or SCSI), and a locally handled 
filesystem.  Lucene uses some pretty involved file locking mechanisms, 
which often do not work well on remote or shared filesystems.


---

We (the developers that build this software) generally have a very 
near-sighted view of things, not really caring about details like the 
hardware deployment.  That probably needs to change a little bit, 
particularly when it comes to documentation.


Thanks,
Shawn


Re: Cybersecurity Incident Report

2020-07-24 Thread Shawn Heisey

On 7/24/2020 2:35 PM, Man with No Name wrote:
This version of jackson is pulled in as a shadow jar. Also solr is using 
io.netty version 4.1.29.Final which has critical vulnerabilities which 
are fixed in 4.1.44.


It looks like that shaded jackson library is included in the jar for 
htrace.  I looked through the commit history and learned that htrace is 
included for the HDFS support in Solr.  Which means that if you are not 
using the HDFS capability, then htrace will not be used, so the older 
jackson library will not be used either.


If you are not using TLS connections from SolrCloud to ZooKeeper, then 
your install of Solr will not be using the netty library, and 
vulnerabilities in netty will not apply.


The older version of Guava is pulled in with a jar from carrot2.  If 
your Solr install does not use carrot2 clustering, then that version of 
Guava will never be called.


The commons-compress and tika libraries are only used if you have 
configured the extraction contrib, also known as SolrCell.  This contrib 
module is used to index rich-text documents, such as PDF and Word. 
Because it makes Solr unstable, we strongly recommend that nobody should 
use SolrCell in production.  When rich-text documents need to be 
indexed, it should be accomplished by using Tika outside of Solr... and 
if that recommendation is followed, you can control the version used so 
that the well-known vulnerabilities will not be present.


We have always recommended that Solr should be located in a network 
place that can only be reached by systems and people who are authorized. 
 If that is done, then nobody will be able to exploit any 
vulnerabilities that might exist in Solr unless they first successfully 
break into an authorized system.


We do take these reports of vulnerabilities seriously and close them as 
quickly as we can.


Thanks,
Shawn


Re: IndexSchema is not mutable error Solr Cloud 7.7.1

2020-07-23 Thread Shawn Heisey

On 7/23/2020 8:56 AM, Porritt, Ian wrote:
Note: the solrconfig has class="ClassicIndexSchemaFactory"/> defined.



org.apache.solr.common.SolrException: *This IndexSchema is not mutable*.

     at 
org.apache.solr.update.processor.AddSchemaFieldsUpdateProcessorFactory$AddSchemaFieldsUpdateProcessor.processAdd(AddSchemaFieldsUpdateProcessorFactory.java:376)


Your config contains an update processor chain using the 
AddSchemaFieldsUpdateProcessorFactory.


This config requires a mutable schema, but you have changed to the 
classic schema factory, which is not mutable.


You'll either have to remove the config for the update processor, or 
change back to the mutable schema.  I would recommend the former.


Thanks,
Shawn


Re: zookeeper data and collection properties were lost

2020-07-20 Thread Shawn Heisey

On 7/20/2020 10:30 AM, yaswanth kumar wrote:

1# I did make sure that zoo.cfg got the proper data dir and its not
pointing to temp folder; do I need to set the variables in ZK_ENV.sh. as
well on top of the zoo.cfg ??


Those are questions about the ZK server, which we are not completely 
qualified to answer.  ZK and Solr are separate Apache projects, with 
separate mailing lists.  We have some familiarity with ZK because it is 
required to run Solr in cloud mode, but are not experts.  We can only 
provide minimal help with standalone ZK servers ... you would need to 
talk to the ZK project for the best information.



Here are my confusions, as I said we are in two node architecture in DEV
but maintaining only one instance of zookeeper, is that true that I need to
maintain the same folder structure that we specify on the dataDir of
zoo.cfg on both the nodes ??


Each ZK server is independent of the others and should have its own data 
directory.  ZK will handle creating the contents of that directory, it 
is likely not something you would do.  Each server could have a 
different setting for the data directory, or the same setting.  Note 
that if the setting is the same on multiple servers, that each of those 
directories should point to separate storage.  If you try to use a 
shared directory (perhaps with NFS) then I would imagine that ZK will 
not function correctly.


A fault tolerant install of ZK cannot be created with only two servers. 
It requires a minimum of three.  For the Solr part, only two servers are 
required for minimal fault tolerance.  Each Solr server must be 
configured with the addresses and ports of all 3 (or more) zookeeper 
servers.


See the Note in the following sections of the ZK documentation:

https://zookeeper.apache.org/doc/r3.5.8/zookeeperAdmin.html#sc_zkMulitServerSetup

https://zookeeper.apache.org/doc/r3.5.8/zookeeperStarted.html#sc_RunningReplicatedZooKeeper

Thanks,
Shawn


Re: AtomicUpdate on SolrCloud is not working

2020-07-20 Thread Shawn Heisey

On 7/19/2020 1:37 AM, yo tomi wrote:

I have no choice but use post-processor.
However bug of SOLR-8030 makes me not feel like using it.


Can you explain why you need the trim field and remove blank field 
processors to be post processors?  When I think about these 
functionalities, they should work fully as expected even when executed 
as "pre" processors.


Thanks,
Shawn


Re: UpdateProcessorChains -cdcr processor along with ignore commit processor

2020-07-18 Thread Shawn Heisey

On 7/15/2020 11:39 PM, Natarajan, Rajeswari wrote:

Resending this again as I still could not make this work. So would like to know 
if this is even possible to have
both solr.CdcrUpdateProcessorFactory and 
solr.IgnoreCommitOptimizeUpdateProcessorFactory  in solrconfig.xml and get both 
functionalities work.


You need to create one update chain that uses both processors.  Only one 
update chain can be applied at a time.  So create one chain with all the 
processors you meed and use that.


Your config has two chains.  Only one of them can be active on each update.

Thanks,
Shawn


Re: AtomicUpdate on SolrCloud is not working

2020-07-18 Thread Shawn Heisey

On 7/17/2020 1:32 AM, yo tomi wrote:

When I did AtomicUpdate on SolrCloud by the following setting, it does
not work properly.


As Jörn Franke already mentioned, you haven't said exactly what "does 
not work properly" actually means in your situation.  Without that 
information, it will be very difficult to provide any real help.


Atomic update functionality is currently implemented in 
DistributedUpdateProcessorFactory.



---

  
  
  
  
  

---
When changed as follows and made it work, it became as expected.
---

  
  
  
  

---


The effective result difference between these configurations is that 
atomic updates will happen first with the first config, and in the 
second, atomic updates will happen second to last -- just before 
RunUpdateProcessorFactory.


Also, with the first config, most of the update processors are going to 
be executed on the machine with the shard leader (after the update is 
distributed) and if there is more than one NRT replica, they will be 
executed multiple times.  With the second config, most of the processors 
will be executed on the machine that actually receives the update 
request.  For the purposes of that discussion, remember that when a PULL 
replica is elected leader, it is effectively an NRT replica.


Does that information help you determine why it doesn't do what you expect?


The later setting and the way of using post-processor could make the
same result, I though,
but using post-processor, bug of SOLR-8030 makes me not feel like using it.
By the latter setting even, is there any possibility of SOLR-8030 to
become?


See this part of the reference guide for a bunch of gory details about 
DistributedUpdateProcessorFactory:


https://cwiki.apache.org/confluence/display/SOLR/UpdateRequestProcessor#UpdateRequestProcessor-DistributedUpdates

In SOLR-8030, the general consensus among committers is that you should 
configure almost all update processors as "pre" processors -- placed 
before DistributedUpdatePorcessorFactory in the config.  When done this 
way, updates are usually faster and less likely to yield inconsistent 
results.


There may be situations where having them as "post" processors is 
correct, but that won't happen very often.  The second config above does 
implicitly use "pre" for most of the processors.


Thanks,
Shawn


Re: In-place update vs Atomic updates

2020-07-14 Thread Shawn Heisey

On 7/14/2020 12:21 PM, raj.yadav wrote:

As per the above statement in atomic-update, it reindex the entire document
and deletes the old one.
But I was going through solr documentation regarding the  ( solr document
update policy

) and found these two contradicting statements:

1. /The first is atomic updates. This approach allows changing only one or
more fields of a document without having to reindex the entire document./


Here is how I would rewrite that paragraph to make it correct.  The 
asterisks represent bold text:


1. The first is atomic updates.  This approach allows the indexing 
request to contain *only* the desired changes, instead of the entire 
document.



2./In regular atomic updates, the entire document is reindexed internally
during the application of the update. /


This is correct as written.

Thanks,
Shawn


Re: Query in quotes cannot find results

2020-07-11 Thread Shawn Heisey

On 6/30/2020 12:07 PM, Permakoff, Vadim wrote:

Regarding removing the stopwords, I agree, there are many cases when you don't 
want to remove the stopwords, but there is one very compelling case when you 
want them to be removed.

Imagine, you have one document with the following text:
1. "to expand the methods for mailing cancellation"
And another document with the text:
2. "to expand methods for mailing cancellation"

The user query is (without quotes): q=expand the methods for mailing 
cancellation
I don't want to bring all the documents with condition q.op=OR, it will find too many 
unrelated documents, so I want to search with q.op=AND. Unfortunately, the document 2 
will not be found as it has no stop word "the" in it.
What should I do now?


Do these users want imprecise matches to only show up when there is a 
well-known stopword involved, or do they also want imprecise matches to 
show up with ANY word missing, added, or moved?  If I were betting on 
it, I'd say they want the latter, not the former.  Erick already gave 
you the solution to that -- phrase slop.


In modern times, the only valid reason I can think of to implement a 
stopword filter is for situations where you want it to be impossible to 
search for certain words -- some might want expletives in this category, 
for example.


Tuning a Solr config for good results is an exercise in tradeoffs.  The 
core tradeoff in most situations is the standard "precision vs. recall" 
discussion.  A change that increases precision will almost always reduce 
recall, and vice versa.  I know from experience that you'll get more 
complaints about reducing recall than you will about reducing precision. 
 Implementing a hard-coded phrase slop value of 1 will reduce precision 
by an amount that's hard to determine, and GREATLY increase recall. 
Chances are good that most users will appreciate the change.  If you 
make the phrase slop setting configurable by the user, that's even better.


Thanks,
Shawn


Re: Solr heap Old generation grows and it is not recovered by G1GC

2020-07-11 Thread Shawn Heisey

On 6/25/2020 2:08 PM, Odysci wrote:

I have a solrcloud setup with 12GB heap and I've been trying to optimize it
to avoid OOM errors. My index has about 30million docs and about 80GB
total, 2 shards, 2 replicas.


Have you seen the full OutOfMemoryError exception text?  OOME can be 
caused by problems that are not actually memory-related.  Unless the 
error specifically mentions "heap space" we might be chasing the wrong 
thing here.



When the queries return a smallish number of docs (say, below 1000), the
heap behavior seems "normal". Monitoring the gc log I see that young
generation grows then when GC kicks in, it goes considerably down. And the
old generation grows just a bit.

However, at some point i have a query that returns over 300K docs (for a
total size of approximately 1GB). At this very point the OLD generation
size grows (almost by 2GB), and it remains high for all remaining time.
Even as new queries are executed, the OLD generation size does not go down,
despite multiple GC calls done afterwards.


Assuming the OOME exceptions were indeed caused by running out of heap, 
then the following paragraphs will apply:


G1 has this concept called "humongous allocations".  In order to reach 
this designation, a memory allocation must get to half of the G1 heap 
region size.  You have set this to 4 megabytes, so any allocation of 2 
megabytes or larger is humongous.  Humongous allocations bypass the new 
generation entirely and go directly into the old generation.  The max 
value that can be set for the G1 region size is 32MB.  If you increase 
the region size and the behavior changes, then humongous allocations 
could be something to investigate.


In the versions of Java that I have used, humongous allocations can only 
be reclaimed as garbage by a full GC.  I do not know if Oracle has 
changed this so the smaller collections will do it or not.


Were any of those multiple GCs a Full GC?  If they were, then there is 
probably little or no garbage to collect.  You've gotten a reply from 
"Zisis T." with some possible causes for this.  I do not have anything 
to add.


I did not know about any problems with maxRamMB ... but if I were 
attempting to limit cache sizes, I would do so by the size values, not a 
specific RAM size.  The size values you have chosen (8192 and 16384) 
will most likely result in a total cache size well beyond the limits 
you've indicated with maxRamMB.  So if there are any bugs in the code 
with the maxRamMB parameter, you might end up using a LOT of memory that 
you didn't expect to be using.


Thanks,
Shawn


Re: SOLR / Zookeeper Compatibility

2020-07-11 Thread Shawn Heisey

On 7/10/2020 5:14 AM, mithunseal wrote:

I am new to this SOLR-ZOOKEEPER. I am not able to understand the
compatibility thing. For example, I am using SOLR 7.5.0 which uses ZK
3.4.11. So SOLR 7.5.0 will not work with ZK 3.4.10?

Can someone please confirm this?


According to what the ZooKeeper project has published regarding 
compatibility, Solr 7.5.0 (with ZK client version 3.4.11) should work 
with ZK servers from 3.3.0 to the latest 3.5.x, which is currently 
3.5.8.  Using Solr 7.5.0 with ZK servers running version 3.6.x *might* 
work, but there is no guarantee.


https://cwiki.apache.org/confluence/display/ZOOKEEPER/ReleaseManagement

And that is assuming that the newer version does not use any 
capabilities that did not exist in the older version.  For instance, if 
you want to have a dynamic server ensemble, all ZK versions must be 
3.5.x or newer, because that capability did not exist in 3.4.x.


Thanks,
Shawn


Re: Solr docker image works with image option but not with build option in docker-compose

2020-07-09 Thread Shawn Heisey

On 7/8/2020 3:36 PM, gnandre wrote:

I am using Solr docker image 8.5.2-slim from https://hub.docker.com/_/solr.
I use it as a base image and then add some more stuff to it with my custom
Dockerfile. When I build the final docker image, it is built successfully.
After that, when I try to use it in docker-compose.yml (with build option)
to start a Solr service, it complains about no permission for creating
directories under /var/solr path. I have given read/write permission to
solr user for /var/solr path in dockerfile.Also, when I use image instead
of build option in docker-compose.yml file for the same image, it does not
throw any errors like that and Solr starts without any issues. Any clue why
this might be happening?


The docker images for Solr are not created by this project.  They are 
made by third parties.


We are in discussions for bringing one of the docker images into the 
project, but until that happens, support for it will have to come from 
the people that made it.  We know very little about how to deal with any 
problems that are occurring.


I would really like to help, but I do not know what might be wrong, and 
I do not know what questions to ask.


Thanks,
Shawn


Re: Query in quotes cannot find results

2020-06-29 Thread Shawn Heisey

On 6/29/2020 3:34 PM, Permakoff, Vadim wrote:

The basic query q=expand the methods   <<< finds the document,
the query (in quotes) q="expand the methods"   <<< cannot find the document

Am I doing something wrong, or is it known bug (I saw similar issues discussed 
in the past, but not for exact match query) and if yes - what is the Jira for 
it?


The most helpful information will come from running both queries with 
debug enabled, so you can see how the query is parsed.  If you add a 
parameter "debugQuery=true" to the URL, then the response should include 
the parsed query.  Compare those, and see if you can tell what the 
differences are.


One of the most common problems for queries like this is that you're not 
searching the field that you THINK you're searching.  I don't know 
whether this is the problem, I just mention it because it is a common error.


Thanks,
Shawn


Re: Solr 8.5.2: DataImportHandler failed to instantiate org.apache.solr.request.SolrRequestHandler

2020-06-26 Thread Shawn Heisey

On 6/24/2020 1:59 PM, Peter van de Kerk wrote:

So I copied files from C:\solr-8.5.2\dist to C:\solr-8.5.2\server\lib

But then I get error


org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
Error Instantiating requestHandler, 
org.apache.solr.handler.dataimport.DataImportHandler failed to instantiate 
org.apache.solr.request.SolrRequestHandler


All of the errors will be MUCH longer than what you have included here, 
and we need that detail to diagnose anything.  If you are seeing these 
in the admin UI "logging" tab, you can click on the little "i" icon to 
expand them.  But be aware that you'll have to read/copy the expanded 
data very quickly - the admin UI will quickly close the expansion.  It's 
much better to go to the actual logfile than use the admin UI.


There are better locations than server/lib for jars, but I don't think 
that's causing the problem.  You should definitely NOT copy ALL of the 
jars in the dist directory -- this places an additional copy of the main 
Solr jars on the classpath, and having the same jar accessible from two 
places is a VERY bad thing for Java software.  It causes some really 
weird problems, and I can see this issue being a result of that.  For 
most DIH uses, you only need the "solr-dataimporthandler-X.Y.Z.jar" 
file.  For some DIH use cases (but not most of them) you might also need 
the extras jar.


Thanks,
Shawn


Re: Deleting on exact match

2020-06-21 Thread Shawn Heisey

On 6/21/2020 1:04 PM, Scott Q. wrote:

The task at hand is to remove all documents indexed the old way, but
how can I do that ? user is of the form u...@domain.com and if I
search for u...@domain.com it matches all of 'user' or 'domain.com'
which has obvious unwanted consequences.

Therefore, how can I remove older documents which were indexed with
partial match ?


If it were me, I would probably set up a new core/collection with any 
config changes you want to make and reindex into the new location from 
scratch.


Then once the new index is available, you can switch your application to 
it and completely delete the old one.


Thanks,
Shawn


Re: Gettings interestingTerms from solr.MoreLikeThisHandler using SolrJ

2020-06-19 Thread Shawn Heisey

On 6/18/2020 5:31 AM, Zander, Sebastian wrote:

In the returning QueryResponse I can't find the interestingTerms.
I would really like to grab it on this way, before calling another time.
Any advices? I'm running solr 8.5.2


If you can send the full json or XML response, I think I can show you 
how to parse it with SolrJ.  I don't have easy access to production Solr 
servers, so it's a little difficult for me to try it out myself.


Thanks,
Shawn


Re: Solr cloud backup/restore not working

2020-06-17 Thread Shawn Heisey

On 6/17/2020 8:55 PM, yaswanth kumar wrote:

Caused by: javax.crypto.BadPaddingException: RSA private key operation
failed


Something appears to be wrong with the private key that Solr is 
attempting to use for a certificate.


Best guess, incorporating everything I can see in the stacktrace, is 
that you have enabled certificate-based authentication, and the private 
key for the client certificate is malformed in some way.  The error 
message originated in Java code, not Solr.


https://docs.oracle.com/javase/8/docs/api/javax/crypto/BadPaddingException.html

It sounds like the keystore has a problem.  You would need to consult 
with someone who is an expert at Java crypto mechanisms.


Thanks,
Shawn


Re: Solr cloud backup/restore not working

2020-06-17 Thread Shawn Heisey

On 6/16/2020 8:44 AM, yaswanth kumar wrote:

I don't see anything related in the solr.log file for the same error. Not
sure if there is anyother place where I can check for this.


The underlying request that failed might be happening on one of the 
other nodes in the cloud.  It might be necessary to check the solr.log 
file on multiple machines.


The response here does NOT contain any information about what caused the 
problem.  All it says is that an ADDREPLICA action necessary to complete 
the restore failed.  You'll need to locate the node where the ADDREPLICA 
failed, and we will need to see the FULL error message.  It is probably 
dozens of lines in length.


I see that you've opened an issue in Jira.  That is premature.  The Solr 
project does NOT use Jira as a support portal.  If we determine that 
you're running into a bug, then it would be appropriate to open an issue.


Thanks,
Shawn


Re: Log4J Logging to Http

2020-06-17 Thread Shawn Heisey

On 6/17/2020 1:33 AM, Krönert Florian wrote:
2020-06-17T07:06:55.121856339Z java.lang.NoClassDefFoundError: Failed to 
initialize Apache Solr: Could not find necessary SLF4j logging jars. If 
using Jetty, the SLF4j logging jars need to go in the jetty lib/ext 
directory. For other containers, the corresponding directory should be 
used. For more information, see: http://wiki.apache.org/solr/SolrLogging


It seems that only when using the http appender these jars are needed, 
without this appender everything works.


There must be some aspect of your log4j2.xml configuration that requires 
a jar that is not included with Solr.


Can you point me in the right direction, where I need to place the 
needed jars? Seems to be a little special since I only access the 
/var/solr mount directly, the rest is running in docker.


If there are extra jars needed for your logging config, they should go 
in the server/lib/ext directory, which should already exist and contain 
several jars related to logging.


Thanks,
Shawn


Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Shawn Heisey

On 6/17/2020 2:36 PM, Trey Grainger wrote:

2) TLOG - which can only serve in the role of follower


This is inaccurate.  TLOG can become leader.  If that happens, then it 
functions exactly like an NRT leader.


I'm aware that saying the following is bikeshedding ... but I do think 
it would be as mistake to use any existing SolrCloud terminology for 
non-cloud deployments, including the word "replica".  The top contenders 
I have seen to replace master/slave in Solr are primary/secondary and 
publisher/subscriber.


It has been interesting watching this discussion play out on multiple 
open source mailing lists.  On other projects, I have seen a VERY high 
level of resistance to these changes, which I find disturbing and 
surprising.


Thanks,
Shawn


Re: How to determine why solr stops running?

2020-06-16 Thread Shawn Heisey

On 6/11/2020 11:52 AM, Ryan W wrote:

I will check "dmesg" first, to find out any hardware error message.





[1521232.781801] Out of memory: Kill process 117529 (httpd) score 9 or
sacrifice child
[1521232.782908] Killed process 117529 (httpd), UID 48, total-vm:675824kB,
anon-rss:181844kB, file-rss:0kB, shmem-rss:0kB

Is this a relevant "Out of memory" message?  Does this suggest an OOM
situation is the culprit?


Because this was in the "dmesg" output, it indicates that it is the 
operating system killing programs because the *system* doesn't have any 
memory left.  It wasn't Java that did this, and it wasn't Solr that was 
killed.  It very well could have been Solr that was killed at another 
time, though.


The process that it killed this time is named httpd ... which is most 
likely the Apache webserver.  Because the UID is 48, this is probably an 
OS derived from Redhat, where the "apache" user has UID and GID 48 by 
default.  Apache with its default config can be VERY memory hungry when 
it gets busy.



-XX:InitialHeapSize=536870912 -XX:MaxHeapSize=536870912


This says that you started Solr with the default 512MB heap.  Which is 
VERY VERY small.  The default is small so that Solr will start on 
virtually any hardware.  Almost every user must increase the heap size. 
And because the OS is killing processes, it is likely that the system 
does not have enough memory installed for what you have running on it.


It is generally not a good idea to share the server hardware between 
Solr and other software, unless the system has a lot of spare resources, 
memory in particular.


Thanks,
Shawn


Re: Solr cloud backup/restore not working

2020-06-16 Thread Shawn Heisey

On 6/12/2020 8:38 AM, yaswanth kumar wrote:

Using solr 8.2.0 and setup a cloud with 2 nodes. (2 replica's for each
collection)
Enabled basic authentication and gave all access to the admin user

Now trying to use solr cloud backup/restore API, backup is working great,
but when trying to invoke restore API its throwing the below error



 "msg":"ADDREPLICA failed to create replica",
 "trace":"org.apache.solr.common.SolrException: ADDREPLICA failed to
create replica\n\tat
org.apache.solr.client.solrj.SolrResponse.getException(SolrResponse.java:53)\n\tat
org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:280)\n\tat


The underlying cause of this exception is not recorded here.  Are there 
other entries in the Solr log with more detailed information from the 
ADDREPLICA attempt?


Thanks,
Shawn


Re: Proxy Error when cluster went down

2020-06-16 Thread Shawn Heisey

On 6/15/2020 9:04 PM, Vishal Vaibhav wrote:

I am running on solr 8.5. For some reason entire cluster went down. When i
am trying to bring up the nodes,its not coming up. My health check is
on "/solr/rules/admin/system". I tried forcing a leader election but it
dint help.
so when i run the following commands. Why is it trying to proxy when those
nodes are down. Am i missing something?





java.net.UnknownHostException:
search-rules-solr-v1-2.search-rules-solr-v1.search-digital.svc.cluster.local:


It is trying to proxy because it's SolrCloud.  SolrCloud has an internal 
load balancer that spreads queries across multiple replicas when 
possible.  Your cluster must be aware of multiple servers where the 
"rules" collection can be queried.


The underlying problem behind this error message is that the following 
hostname is being looked up, and it doesn't exist:


search-rules-solr-v1-2.search-rules-solr-v1.search-digital.svc.cluster.local

This hostname is most likely coming from /etc/hosts on one of your 
systems when that system starts Solr and it registers with the cluster, 
and that /etc/hosts file is the ONLY place that the hostname exists, so 
when SolrCloud tries to forward the request to that server, it is failing.


Thanks,
Shawn


Re: getting different errors from complex phrase query

2020-06-16 Thread Shawn Heisey

On 6/15/2020 2:52 PM, Deepu wrote:

sample query is
"{!complexphrase inOrder=true}(all_text_txt_enus:\"by\\ test*\") AND
(({!terms f=product_id_l}959945,959959,959960,959961,959962,959963)
AND (date_created_at_rdt:[2020-04-07T01:23:09Z TO *} AND
date_created_at_rdt:{* TO 2020-04-07T01:24:57Z]))"

org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server  https://XX.XX.XX:8983/solr/problem: undefined field text


The error is "undefined field text".  How exactly that occurs with what 
you have sent, I do not know.  There is something defined somewhere that 
refers to a field named "text" and the field does not exist in that index.


Something that may be indirectly relevant:  Generally speaking, Solr 
only supports one "localparams" in a query, and it must be the first 
text in the query string.  You have two -- one starts with 
{!complexphrase and the other starts with {!terms.


There are some special circumstances where multiples are allowed, but I 
do not know which circumstances.  For the most part, more than one isn't 
allowed or supported.  I am pretty sure that you can't use multiple 
query parsers in one query string.


Thanks,
Shawn


Re: eDismax query syntax question

2020-06-16 Thread Shawn Heisey

On 6/15/2020 8:01 AM, Webster Homer wrote:

Only the minus following the parenthesis is treated as a NOT.
Are parentheses special? They're not mentioned in the eDismax documentation.


Yes, parentheses are special to edismax.  They are used just like in 
math equations, to group and separate things or to override the default 
operator order.


https://lucene.apache.org/solr/guide/8_5/the-standard-query-parser.html#escaping-special-characters

The edismax parser supports a superset of what the standard (lucene) 
parser does, so they have the same special characters.


Thanks,
Shawn


  1   2   3   4   5   6   7   8   9   10   >