Re: Deprecate Schemaless Mode?

David Smiley Tue, 04 Aug 2020 22:02:07 -0700

Thanks for starting this thread Marcus!  For a historical note, the current
_default configSet being "data driven" (aka "schemaless", a worse name) is
largely because of SOLR-10272
<https://issues.apache.org/jira/browse/SOLR-10272>  Maybe I should have
fought harder against it then.  I threatened to veto but I was placated by
it being easily disabled.  And it's true; you can disable it, and there are
some loud warnings on the CLI so... yeah.


I think my views most align with Gus.  The name "default" is suggestive of
good settings you ought to change if you know what you are doing.  Perhaps
there simply can be no reasonable "default" for a search platform.  There
might be "basic minimal blah blah" etc. that _is_ the default choice if you
don't specify it but naming the configSet itself as "default" gives too
much blessing to it.  I've seen too many configs with tons of stuff that
were there because it was inherited, and then it's hard to guess what's
_actually_ being used.  Alexandre Rafalov had done some great work in
figuring out how to minimize configs.  There's more to do there.

I'd be happy to see basically any change though; even a simple change from
opt-out to opt-in to "data driven" URPs.  I don't like the status quo.

BTW I've also seen people try to take "bin/solr -e cloud" to production :-(
  "Hey look, this is how a tutorial told me to run SolrCloud" (so the logic
goes).

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Tue, Aug 4, 2020 at 2:24 PM Jan Høydahl <[email protected]> wrote:

> Learning mode won’t work if you have 10 existing collections and want to
> create #11. We could rather have a SchemaLearningUpdateHandler so people
> could explicitly post documents to say  /schema-guess to modify the schema.
> We could even have this implicit. Then the _default config would have just
> _root_, is and a few more, and if you want guessing you first send a number
> of docs to /schema-guess endpoint and then inspect in schema browser what
> you got. That handler could support a Parma &reset=true which would wipe
> the schema to start guessing from scratch.
>
> Jan Høydahl
>
> 4. aug. 2020 kl. 15:30 skrev Gus Heck <[email protected]>:
>
> 
> Interesting read. Might have changed now that we have authentication
> capabilities... but let's not thread jack :)
>
> On Tue, Aug 4, 2020 at 8:28 AM Erick Erickson <[email protected]>
> wrote:
>
>> Having the admin UI allow uploads may not be secure. When I had a similar
>> idea a long time ago it got shot down, see the discussion at:
>> https://issues.apache.org/jira/browse/SOLR-5287.
>>
>> I _think_ this is a different issue if the configs have to be residing on
>> the system, not coming in from outside, just FYI...
>>
>> > On Aug 3, 2020, at 7:03 PM, Gus Heck <[email protected]> wrote:
>> >
>> >
>> >
>> > On Mon, Aug 3, 2020 at 5:03 PM Erick Erickson <[email protected]>
>> wrote:
>> > Gus’s point about implementing something before removing it is well
>> taken, but we can deprecate it immediately without removing it. Gus’s point
>> about dynamic fields not being found until later in the cycle is well
>> taken, but not enough to persuade me.
>> >
>> > Fair enough :)
>> >
>> > I’m not enthusiastic about multiple getting started schemas. The whole
>> motivation behind schemaless is that the user doesn’t need to know about
>> schemas to get started. By providing multiple “getting started” schemas we
>> require them to become aware of schemas again.
>> >
>> > Here's my theory (which may or may not be persuasive :) )
>> >
>> > My thinking in that suggestion is that the majority of the problem is
>> due to the fact that people new to a technology will tend to latch onto the
>> defaults that come with something as being something that should be held
>> onto until you have a good reason to change it. This is reasonable because
>> changing things you don't understand willy nilly is often a road to pain.
>> And people DO want a safe starting point and we should give it to them
>> because it makes their life easier once they get a little further down the
>> road, but this is not compatible with the easy-start schemaless mode.
>> Looking at https://lucene.apache.org/solr/guide/8_5/solr-tutorial.html I
>> see that the initial tutorial experience is fully scripted, and the user
>> won't likely notice if they are told to ignore _default or guessing-proto
>> in favor of the tech products config set... BUT when they do get to the
>> point of looking at the config name they'll see the more descriptive name.
>> So rather than seeing "_default" and thinking "Ah ha! Here's something I
>> can take as gospel and not change until I have a reason!" they'll see
>> "guessing-proto" or "dynamic-proto" and say "Hunh, I wonder what that
>> means?" which is a good question for them to ask I think.
>> >
>> > The concept of a default lays in a strong bias of not touching it
>> (IMHO) which will be wrong most of the time no matter what we give them as
>> a default. If something must be a default I'd favor a non-managed,
>> non-dynamic, non-guessing minimal schema with the required fields, and an
>> id field, maybe a _text_ field, and a comment pointing to the section of
>> the ref guide where they can copy and paste in all the stuff that's
>> currently in our base schema as example (things like the text_ga type), IF
>> they want it. I get really tired of seeing mile long schemas that have a
>> ton of unused stuff that is retained because people didn't know if they
>> needed it or not...
>> >
>> > Note that not having some default would break back compat, on bin/solr
>> but changing the default is also a break of sorts.
>> >
>> >
>> > All that said, maybe we could rethink the approach. My two objections
>> are:
>> > 1> schemaless, by updating the schema based on a very small sample set
>> is very susceptible to failing early and often
>> > 2> Constantly updating the config in ZK and reloading the collections
>> seems very hard to get right.
>> >
>> > I have for some time thought the inability to upload and download a
>> config (or files within a config) via the web UI was a gap. But I found it
>> easier to write
>> https://plugins.gradle.org/plugin/com.needhamsoftware.solr-gradle than
>> add that feature to the UI :)
>> >
>> > So I can imagine a “getting started” mode that indexed to the glob
>> field while creating a schema. Ideally, it would be necessary to enable it
>> specifically rather than have it be the default. I’d imagine this being
>> coupled with some kind of “export schema” button. So the process would be
>> > > start Solr with -Dsolr.learningmode.confg=some_config_name.
>> > > index a bunch of documents, perhaps prototyping the search app on the
>> dynamic glob field.
>> > > The admin UI should have a big, intrusive banner saying “RUNNING IN
>> LEARNING MODE” with instructions on what to do next.
>> > > In that mode there’d need to be a “save schema” button or something.
>> What I’d like that to do would be examine the index and write a new schema
>> somewhere. If ths was the mode, then you’d be able to run it any time.
>> >
>> > +1 for anything that makes a round-trip of working with the schema
>> easier, but not really a fan of learning mode.
>> >
>> >
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>
>

Re: Deprecate Schemaless Mode?

Reply via email to