[jira] [Commented] (SOLR-14701) Deprecate Schemaless Mode (Discussion)

Erick Erickson (Jira) Mon, 03 Aug 2020 05:28:25 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169984#comment-17169984
 ]


Erick Erickson commented on SOLR-14701:
---------------------------------------

I don't agree that we can rescue this code for the following reasons:

1> When we guess wrong, you can't index some documents. For instance, the first 
time a field is indexed that contains "1", an integer field is created. A doc 
that has, say, "1.0" in that field fails because it's not an integer. And don't 
even get me started on dates.

2> The mechanism for updating the schema is fragile. You can have many shards 
trying to update ZKs configset at the same time, leading to instability even if 
it does "do the right thing".

3> It's another instance of complex code that we have to maintain. Actually, I 
don't think we are maintaining it. And there are consistent failures in that 
code lately that aren't getting attention. 

4> We don't really deliver "schemaless". What we deliver is something that 
doesn't (and can't) work correctly. There have been proposals to, say, have a 
"learning mode" that doesn't really index docs, just assembles a schema based 
on N documents that'll index all of them, then use that schema. That would make 
the problem better, but still fail in some cases.

5> We could improve it around the edges forever trying to make it not fail so 
regularly, and the users _still_ have to go in and tweak the schema.

6> The point of schemaless mode at all is that a user can just start indexing 
docs without having to deal with managing a schema. They'll have to get into 
the schema anyway eventually for anything except the most trivial corpus. So 
the suggestion to index every new field as a text field by using the 
dynamicField lets them do that without all the baggage.

7> Version control is another hidden gotcha. The schema is changing willy-nilly 
on Zookeeper and users have to  take periodic snapshots and store it away 
somewhere if they wan to preserve it. So now you have a case where, say, they 
need to re-index the corpus. If they do it to a new collection, the resulting 
schema may well be different, if it works at all. How could it fail? Well, the 
first doc originally indexed has a field with 1.0 and becomes a float that 
indexes 1 fine in subsequent docs. Next time 'round the order is reversed for 
some reason.

8> Big fat warning or not, it doesn't necessarily even work for non-production 
code.

Hmmm, though if we wanted to help them make a real schema, we could write 
something that processed an existing index and produced an example schema they 
could tweak, or even use as-is although I'd rather not have it be automatic.

So if we focus on "let the user index and search documents OOB without having 
defining a schema be a barrier to entry", I claim we can create a much simpler 
solution with minimal effort and not carry this albatross going forward. Of 
course we're still supporting managed-schema, that's a whole different kettle 
of fish.

> Deprecate Schemaless Mode (Discussion)
> --------------------------------------
>
>                 Key: SOLR-14701
>                 URL: https://issues.apache.org/jira/browse/SOLR-14701
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Schema and Analysis
>            Reporter: Marcus Eagan
>            Priority: Major
>
> I know this won't be the most popular ticket out there, but I am growing more 
> and more sympathetic to the idea that we should rip many of the freedoms out 
> that cause users more harm than not. One of the freedoms I saw time and time 
> again to cause issues was schemaless mode. It doesn't work as named or 
> documented, so I think it should be deprecated. 
> If you use it in production reliably and in a way that cannot be accomplished 
> another way, I am happy to hear from more knowledgeable folks as to why 
> deprecation is a bad idea. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-14701) Deprecate Schemaless Mode (Discussion)

Reply via email to