Summary and discussion request for Solr 9 blocker issue SOLR-14701 (schemaless mode)

Alexandre Rafalovitch Sat, 20 Mar 2021 19:15:52 -0700

Hello,

I was asked to draw attention and summarize the cross-road situation
with the current Solr 9 blocker issue about the "schemaless mode".

Basically, the current shameless mode is broken in two main ways, one
familiar and one that is a bit more obscure:
1) Familiar: The schemaless mode will update schema on the first view
of a field value, which means it can fail on the second view of the
field value and the solution requires pre-registering trouble fields
(e.g. in our films example) and/or removing errant fields from updated
schema.
2) Obscure: Even if schemaless mode works, we currently say to disable
it when it goes into production and provide instructions for that.
But, schemaless mode does not only update schema, it also does custom
parsing. This is especially obvious for Date parsing, as we allow 7
different formats rather than just one. So, when schemaless mode is
disabled as we recommend, the indexing that worked before will start
failing for those case in very non-obvious ways (see
https://github.com/apache/solr/blob/9eaefdf6b6178042bd26420fede08c0db65c45f3/solr/server/solr/configsets/_default/conf/solrconfig.xml#L1001-L1011)

This has been discussed in SOLR-14701, SOLR-11741 and even SOLR-6939,
with most suggestions focusing on alternatives to 1) above, but
ignoring the 2). Most popular idea is to generate schema altering
commands or JSON or similar.

I have tried to build a solution that dealt with both of above points
in https://github.com/apache/lucene-solr/pull/1863 . I felt I was
implementing Hoss' proposal from SOLR-6939 by batching the documents'
mappings and then doing the type widening and final schema creation on
commit. I do not have a specific client that needs it, so I was trying
to do something that fixes the pain points instead of completely
reimagining it. My solution is still not fantastic, but - I feel - it
does address both issues above.

However, the PR discussion, which boomeranged back into SOLR-14701 has
gotten stuck as multiple people were not able to get onto the same
page about the specific implementation points.

So, I marked the issue as a blocker, to help recognition that we are
at the cross-roads. The options I see are:
1) We can unmark the issue and just keep shipping the broken
implementation with the knowingly wrong advice; probably for a very
long time as (just as my data point) it was a lot of effort to even
understand the logic of current AddSchemaFieldsUpdateProcessorFactory
2) We can rip out this mode all together and/or move it into plugin
3) We can adopt my solution, possibly with some minor adjustments (and
deprecate/remove from config the other one)
4) Somebody other than me can do something completely else. I failed
to understand the alternative proposals, despite trying very hard and
my own 'alternative' proposal is very very different.

I, myself, no longer have a preferred position on this issue. But I
was asked to bring it back to the community anyway, just in case the
time and summary will help to move this forward.

Regards,
Alex.

Summary and discussion request for Solr 9 blocker issue SOLR-14701 (schemaless mode)

Reply via email to