I almost never use schemaless mode (better named "schema guessing mode")
and I would never recommend it for use beyond prototyping. The primary use
I see for it is to throw a bunch of data at it to get a starting point for
a schema... say for example you want to see what tika's going to produce
for metadata before solidifying what you will and will not rely on. I think
the ability to suggest a schema is valuable and shouldn't go away. I'm all
for not having it be the default configuration however, and I really like
the suggestions linked in the ticket for features that consider a number of
documents before trying to guess the schema and if we implement one of
those I'd be for deprecation and eventual removal, but not before.

The ticket contains a suggestion of adding a catch all '*' dynamic field,
but we should make sure to indicate that that ALSO is not typically good
for production use because one garbage (or malicious) document can explode
the number of fields in the index, or cause cases where forgetting to add a
properly typed field makes it much further down the development cycle
before getting caught. (i.e. not caught until a user tries to sort on it
and gets 1, 10, 11, 2,... ), and dev churn due to data silently indexed
into typo variants.... etc.

Perhaps we should distribute more than one pre-baked config set and
label none of them as "default"? I'd suggest maybe

   - guessing-proto --> our current _default possibly refined, for
   protoytping
   - dynamic-proto --> a schema based on dynamic fields with a * default to
   text-general as an alternative prototyping tool less dependent on data
   order, but requiring more editing
   - managed-min --> A base on which to build a production quality managed
   schema
   - static-min --> A base on which to build a production quality classic
   (non-managed) schema

Also +1 to renaming the feature away from "Schemaless" to "Schema Guessing"

-Gus

On Mon, Aug 3, 2020 at 11:33 AM Marcus Eagan <marcusea...@gmail.com> wrote:

> Community,
>
> There are many of us that have had to deal with the pain of managing the
> schemaless mode of operation in Solr. I'm curious to get others thoughts
> about how well it is working for them and if they would like to continue to
> use it.
>
> I for one don't think Schemaless works as intended and favor deprecating
> it and replacing it with some more usable but I am sure others have
> thoughts here.
>
> Is anyone on this list using schemaless mode in production or have you
> tried to?
>
> A preliminary discussion has occurred in this Jira ticket:
> https://issues.apache.org/jira/browse/SOLR-14701
> <https://issues.apache.org/jira/browse/SOLR-14701?>
>
> Thank you all,
>
> Marcus Eagan
>
>

-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Reply via email to