Re: [DISCUSS] ConfigSet ZK to file system fallback

Tomás Fernández Löbbe Tue, 26 Jan 2021 10:27:33 -0800

Thanks for bringing this up, David. I thought about this same situation
before, but I think I never convinced myself in one way or another :p. As I
mentioned in many other emails, I think the infrastructure and the node
configuration (such as solr.xml) needs to be local (at least, needs to be
able to be local and not forced on ZooKeeper) for various reasons.
The same reasons exist for configsets: safe upgrades, or possible
node-specific configuration, as you mentioned. But Configsets have another
layer of complexity in my mind, which is, you don't know where you'll need
them... because you don't (necessarily) know where replicas of a collection
are going to be created. True that this is not a problem in the Docker
image situation you are describing, or if handled with care, but how can
Solr make sure of it?


But I think it's a valuable feature to explore. Maybe the configset needs
to exist in ZooKeeper and have some sort of flag (similar to secure=true)
where it could say "local=true", and then fail Solr instances to start if
the configset is not present or something? Otherwise the collection
creation and replica addition operations may need to know where configsets
are present, etc. I'm wondering if this mix you are proposing of some files
in ZooKeeper and some files local wouldn't complicate things too much...
not sure.

Tomás

On Mon, Jan 25, 2021 at 3:15 PM David Smiley <dsmi...@apache.org> wrote:

> I'm not entirely sure how to react to the feedback.  Maybe in listing
> multiple benefits and a follow-on proposal, I inadvertently opened doors to
> distracting points.  I know I can be guilty of scope creep.  My proposal
> has no impact on where JARs go, and so let's not discuss lib directories,
> the package store, or LTR's feature store either which my proposal is not
> related to, ok?  My proposal doesn't even add a new configuration place
> that doesn't already exist.
>
> Let me try to express this proposal through a different angle / lens that
> I think is more clear and motivating than the first:
>
> Each physical Solr node (perhaps a Docker image) is composed of Solr's
> code, perhaps some plugin code too, and some configuration files with some
> settings.  Baked into any code are settings with a default value.  There
> are trivial primitive settings like an integer for "maxMergeAtOnce" on
> TieredMergePolicy, and there are more aggregate settings, like what the
> default MergePolicy is.  Sometimes the default changes from one release to
> the next, or new settings get added or go away (albeit rarely).  Let's just
> consider SolrCloud.
>
> ... Let's say you need to make a settings change.  ...
>
> For changes specified in solrconfig.xml (generalizable to any file in the
> configSet, really), you MUST deploy this to ZooKeeper.  That sucks when the
> configuration might only make sense for some nodes.  Most likely you are
> doing an upgrade in which you can't simply change the Solr nodes in an
> instant, but perhaps some nodes are simply different (different hardware?
> -- SSDs vs HDDs).  Upgrades can be orchestrated but it's more complex when
> there is ZK resident configuration, and it will impose annoying
> restrictions on the underlying code (i.e. back-compat concerns).  By having
> a "physical layer configuration" (borrowing Eric's terminology), we can tie
> some settings to this layer while still having a higher level layer.  I
> proposed one way of doing this; I'd be happy to discuss others.
>
> I'd like to extend the same argument to solr.xml, a node level
> configuration file.  Here, at least there is already _some_ flexibility --
> you can supply solr.xml with the physical layer (the Docker image) *OR* in
> ZooKeeper.  But IMO it's not ideal because it's either-or..  Some
> configuration might make sense with the physical node, and some at the
> cluster node.  Ideally IMO, we'd have a way to blend both such that the
> deployer chooses where the configuration makes sense based on their cluster.
>
> WDYT?
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Sun, Jan 24, 2021 at 6:08 AM Ilan Ginzburg <ilans...@gmail.com> wrote:
>
>> An aspect that would be interesting to consider IMO is upgrade and
>> configuration changes.
>> For example a collection in use across Solr version upgrade might require
>> different configuration (config set) with the old and new Solr versions.
>> Solr itself can require changes in config across updates.
>>
>> Backward compatibility is the usual answer (the new code continues
>> working with the old config that can be updated once all nodes have been
>> deployed) but this imposes constraints on new code.
>> If there was a way for the new Solr code to "magically" use a different
>> config set for the collection (and for Solr config in general) there would
>> be more freedom to add or change features, change default behavior across
>> Solr versions etc.
>>
>> Ilan
>>
>> On Sat 23 Jan 2021 at 22:22, Gus Heck <gus.h...@gmail.com> wrote:
>>
>>> I'm in agreement with Eric here that fewer ways (or at least a clearer
>>> default way) of supplying resources would be better. Additionally, it
>>> should be easy to specify that this resource that I've shared should be
>>> loaded on a per SolrCore or per node basis (or even better per collection
>>> present on the node, accessible under a standard name to replicas belonging
>>> to that collection?). Not many cases beyond the simplest single collection
>>> install few shards where you want a 1GB resource to be duplicated in memory
>>> across N cores running on the same node, though obviously there's ample
>>> cases where the 10k stop words file is meant to differ across collections.
>>>
>>> As it stands Eric's list seems like something that should be in the
>>> documentation somewhere just so people can properly troubleshoot where
>>> something they don't expect to be loaded is getting loaded from, or why
>>> their attempts to load something new aren't working...  especially if it
>>> were ordered to show the precedence of these options.
>>>
>>> As for ease of editing configurations, I've long felt that this should
>>> be possible via the admin UI though there's been much worry about security
>>> implications there. Personally, I think that those concerns are resolvable,
>>> but have not found time to make that case. Aside from that I think we need
>>> to support tooling to enable easy management of config sets rather than
>>> expanding the possible number of places the configurations might get loaded
>>> from.
>>>
>>> Several years ago I wrote a plugin for gradle that is very very basic,
>>> but after some configuration so that it can see zookeeper, it will happily
>>> pull configs down and push them up for you which is convenient for keeping
>>> configs under version control during development. There's LOTS to improve
>>> there, most especially adding support to manage multiple configs at a time,
>>> and I had hoped that folks would use it and have suggestions,
>>> contributions, but I've got no indication that anyone but me uses it. (
>>> https://github.com/nsoft/solr-gradle)
>>>
>>> -Gus
>>>
>>> On Fri, Jan 22, 2021 at 8:19 AM Eric Pugh <
>>> ep...@opensourceconnections.com> wrote:
>>>
>>>> There is a lot in here ;-).
>>>>
>>>> With the caveat that I don’t have recent experience that many of you do
>>>> with massive solr clusters, I think that we need to commit to fewer, not
>>>> more, ways of maintaining the supporting resources that these clusters
>>>> need..   I’d like to see ways of managing our Solr clusters that encourage
>>>> easy change and experimentation, and encourage us to separate the physical
>>>> layer (version of Solr, networking setup, packages used) from the logical
>>>> layer (individual collections and their supporting code and resources).
>>>>
>>>> I think the configSet was a huge jump forward..   My workflow is to
>>>> think
>>>> 1) What’s unusual about this Solr setup?  What is the physical layer
>>>> need to be?  Special package?  Special code?   Build a Docker image.
>>>> 2) Fire up a three node Solr cluster, wait till it’s up and responsive
>>>> via checking APIs.
>>>> 3) Now think about my specific use case.   What collections do I need?
>>>> Is it just 1, or is it 5 or 10 collections.  Are they on the same configSet
>>>> or different.   Great, zip up the configSet and pop it into Solr via APIs.
>>>>
>>>> 4) Create the collections in the shapes I need with the APIs, and now
>>>> start iterating on what I need to do.  Use the APIs to create fields, or
>>>> set up different ParamSets.
>>>>
>>>> However, with configSets we only did half the job, because we still
>>>> don’t have a single well understood way of handling Jars and other
>>>> resources.  We have many ways of doing it.   Which generates constant user
>>>> confusion and contributes to the perspective that “Solr is hard to use”.
>>>>
>>>> Right now, across the Solr landscape I can think of many ways of adding
>>>> “external” files to my Solr:
>>>>
>>>> 1) Classic ./lib as a place to put things.
>>>> 2) The new to me solr.allow.unsafe.resourceloading=true approach
>>>> 3) The userfiles directory in Solr accessed by streaming expressions
>>>> load function.
>>>> 4) The “package store” for packages located in file store
>>>> 5) The blob store .system concept from before the package store
>>>> 6) the LTR feature store (which I guess is backed by ZK but could be on
>>>> the disk as well through more hoops...
>>>> 7) Layering stuff in directly via Docker build files
>>>>
>>>> These are each a little different, with varying levels of support.
>>>>
>>>> Let’s figure out how we can include a resource that is 10 KB, 1 MB or 1
>>>> GB and not have to think about ZooKeeper or any of the other implementation
>>>> details of backing that.    Let’s figure out where the package manager is
>>>> letting us down and keep working on it.
>>>>
>>>>
>>>>
>>>> On Jan 22, 2021, at 12:16 AM, David Smiley <dsmi...@apache.org> wrote:
>>>>
>>>> Summary:  I've been contemplating a simple enhancement to how SolrCloud
>>>> resolves files in a configSet:  when a file isn't in ZooKeeper, fallback
>>>> resolution to the same-named configset on the file system (which normally
>>>> is ignored in SolrCloud today).  A further fallback to _default on the
>>>> filesystem could be useful as well. The mutable space is always ZK if you
>>>> edit a schema or configOverlay.json or whatever.
>>>>
>>>> My primary motivation is allowing for upgrades to plugins, configs, or
>>>> Solr itself to be easier in some scenarios (certainly not all!).  Imagine
>>>> that you've got configOverlay.json (with some handlers defined) &
>>>> params.json & schema.xml in ZK, and solrconfig.xml on the file system, plus
>>>> some partial xml file of schema field types that is "xi:include"-ed by
>>>> schema.xml.  Assume that a custom Solr Docker image is used including
>>>> custom plugins, and with this configSet baked in.  One day you add some new
>>>> token filters, add a new Lucene merge policy, and remove some outdated
>>>> update request processor.  You do plugin code changes and xi:included
>>>> field type changes and edit solrconfig.xml, and build this into your latest
>>>> company Solr Docker image, and you get it deployed using Kubernetes.  Those
>>>> changes can be safe to deploy without touching any ZK resident configSet.
>>>> Other changes might not be (e.g. removing a field type that is referenced,
>>>> etc. or doing changes to analyzed text that are too incompatible requiring
>>>> a re-index) but my point is that some are, and this would be easier.
>>>>
>>>> An additional motivation is storing large relatively static common
>>>> resources on the file system.  Where I work, I've got over a gig of them
>>>> :-). This can be worked around with solr.allow.unsafe.resourceloading=true
>>>> but... it'd be nice to not have to resort to that.
>>>>
>>>> Another benefit would be to make it easier to separate one's own
>>>> configuration with that of the _default configSet you took from Solr when
>>>> starting a new project.  Resolving differences and then doing Solr upgrades
>>>> was a common task I had to do as a consultant and my own Solr upgrades.
>>>> Granted this is possible today but perhaps if this overlay was
>>>> emphasized/embraced more, it would lead to this outcome.  It's still a
>>>> problem that a bare-bones solrconfig.xml & schema.xml are either too
>>>> bare-bones or say too much, and it's a separate issue for Solr to improve
>>>> that.
>>>>
>>>> Probably secondary related issue: If the SolrCloud configSet ZK node
>>>> were to be optional instead of required (thus assume the configSet is
>>>> entirely on the file system), it would bring other benefits.  It would
>>>> allow users to use the "file store" or some network mounted storage (NFS)
>>>> as the configSet location.  It would accelerate experimentation with
>>>> SolrCloud in docker locally. The biggest PITA anyone notices when first
>>>> exploring SolrCloud is that configs are fundamentally not on the file
>>>> system despite you seeing them there; it's all in ZK.  And there's no super
>>>> convenient way to edit the configuration, not even a web UI.  Using the
>>>> file system for configSets would be especially nice when doing local
>>>> SolrCloud experimentation in Docker, eliminating an annoying configSet
>>>> deployment step.
>>>>
>>>> I plan to file an issue of course but I think this deserved a dev list
>>>> discussion.
>>>>
>>>> I know the new package manager could help with my primary motivating
>>>> use-case, but I think at present there are too many obstacles there, at
>>>> least at present.  A file system fallback is a simple thing by comparison.
>>>>
>>>> Question:  Does the k8s Solr Operator do anything to make configSet &
>>>> plugin upgrades better?
>>>>
>>>> ~ David Smiley
>>>> Apache Lucene/Solr Search Developer
>>>> http://www.linkedin.com/in/davidwsmiley
>>>>
>>>>
>>>> _______________________
>>>> *Eric Pugh **| *Founder & CEO | OpenSource Connections, LLC | 434.466.1467
>>>> | http://www.opensourceconnections.com | My Free/Busy
>>>> <http://tinyurl.com/eric-cal>
>>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed
>>>> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>>>> This e-mail and all contents, including attachments, is considered to
>>>> be Company Confidential unless explicitly stated otherwise, regardless
>>>> of whether attachments are marked as such.
>>>>
>>>>
>>>
>>> --
>>> http://www.needhamsoftware.com (work)
>>> http://www.the111shift.com (play)
>>>
>>

Re: [DISCUSS] ConfigSet ZK to file system fallback

Reply via email to