[jira] [Updated] (SOLR-10229) See what it would take to shift many of our one-off schemas used for testing to managed schema and construct them as part of the tests

Erick Erickson (JIRA) Mon, 03 Jul 2017 22:23:42 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Erick Erickson updated SOLR-10229:
----------------------------------
    Attachment: SOLR-10229-straw-man.patch

Here's a straw-man re-working of Amrit's patch. It's got nocommits, lacks 
javadocs etc. I wanted to see what people thought.

In order to be used, I think it's imperative that this be easy. It must not be 
harder (much) to use than just adding a bunch of new stuff to a new schema file.

So the basic pattern here to specify the fields that you need is:

{code}
    SchemaFrameworkFactory.SchemaFramework fac = 
SchemaFrameworkFactory.getFramework();
    initCore("solrconfig-id-and-version-fields-only.xml", 
"schema-id-and-version-fields-only.xml");

    fac.addFieldTypesFromMother(h.getCore(), "whitespace", "tint", "tints", 
"tfloat", "tfloats",
        "tlong", "tlongs", "tdouble", "tdoubles",
        "text_mock");
    fac.addFields(h.getCore(), new String[][] {
        {"name", "f_ti", "type", "tint", "indexed", "true"},
        {"name", "f_tis", "type", "tints", "indexed", "true"},
        {"name", "f_tf", "type", "tfloat", "indexed", "true"},
        {"name", "f_tfs", "type", "tfloats", "indexed", "true"},
        {"name", "f_tl", "type", "tlong", "indexed", "true"},
        {"name", "f_tls", "type", "tlongs", "indexed", "true"},
        {"name", "f_td", "type", "tdouble", "indexed", "true", 
"docValues",Boolean.toString(Boolean.getBoolean(NUMERIC_DOCVALUES_SYSPROP))},
        {"name", "f_tds", "type", "tdoubles", "indexed", "true"},
        {"name", "*_s", "type", "string", "indexed", "true"},
        {"name", "*_ws", "type", "whitespace", "indexed", "true"},
        {"name", "*_ss", "type", "string", "indexed", "true", "multiValued", 
"true"},
        {"name", "*_t", "type", "text_mock", "indexed", "true"},
        {"name", "val_i", "type", "tint", "indexed", "true"}
    });
{code}

At this point I'm more interested in what people think of the two tests I've 
re-worked to use this mechanism. From the perspective of writing tests, does 
this seem easy to grasp? How could it be easier? And the 
NUMERIC_DOCVALUES_SYSPROP took me a while to figure out. I picked that test at 
random and it turned out to be trickier than I thought.

Some notes (and this is a work in progress):

- Siiiigggghhhh. To reduce the number of config files I had to add some. 
Eventually others will go away but for the nonce... How much do we want to cut 
down on the new solrconfig file? The spellcheck stuff is still in there for 
instance. It's a copy of solrconfig.xml with managed schema in it.

- We could do something very similar with solrconfig I should think, but one 
thing at a time.

- Use the pre-existing fieldTypes in the schema-mother.xml file if at all 
possible. If you think the particular ones you're adding will be generally 
useful, just add them to schema-mother.xml. Eventually the schema-mother.xml 
will be quite big. However, it's only loaded once.
-- the SchemaFrameworkFactory has a pass-through for defining new field types 
to the managed schema code. I haven't tried it yet though. I don't see a way to 
make adding new fieldTypes easy, so I don't see a benefit in adding another 
layer. Suggestions for how to make adding a field type simply are very welcome.

- You _can_ redefine the id field as above if you want to change it to a 
different type for some specific reason. Currently it's a string.

- There are the beginnings of randomizing unspecified field options for a few 
options at present. This could be made more sophisticated.
-- Make stored/dv (when unspecified) be one or the other or both?
-- This means that if you _don't_ specify, say, DV but rely on them your tests 
will pass sometimes and fail others. So far I've gotten pretty good error 
messages when that happens and it just requires that you add the option you 
need when you add the field.

- When making changes to the schema and config files, I sometimes have to 
rebuild IntelliJ (ant clean-idea idea). Haven't tracked down why yet.

- Haven't approached making this work with SolrCloud tests. One step at a time. 
DistributedQueryElevationComponentTest fails for instance with "undefined 
field".

- There are id-and-version-only configs (solrconfig and schema) that are meant 
to be used in tandem. The critical bit is that you need to have the schema be 
defined as a managed resource.
-- It's a little awkward that the id-and-version-only schema has a "string" and 
"long" type. But we need to load the file to change the file and I don't see a 
way to add a <uniqueKey> via the managed schema code. I'm not sure we want to 
force every test to define these anyway. It does take a little getting used to 
before you realize you do _not_ have to add "string" and "long" as types in 
your test. Live with it.

- We'd better not persist configs. I did have a couple of instances where my 
schema-mother.xml was renamed to schema-mother.xml.bak but that went away with 
I got the loading straightened out.

- I'm not sure what I think about the "mother" bits. Anyone have a better name?

- I do like how this approach couples the field definitions with the test. The 
randomizing will also help catch hidden assumptions. So far it isn't onerous to 
add the fields for a test one needs.

- Perhaps we should do some variant of "hungarian notation" for the fieldTypes 
in schema-mother.xml? There should be a way to quickly find out what field 
types are available without having to look at every single one. A field of 
"name" for instance doesn't tell us anything about it. We already use things 
like _s _ss for string and string-multivalued for instance.

I want to emphasize that this is a WIP. We wanted to get this to the point of 
being able to see what it looked like in a couple of tests and invite comments, 
so fire away. I'm particularly looking at whether people think the _approach_ 
makes sense or if there are alternate ways we could approach it that are better.




> See what it would take to shift many of our one-off schemas used for testing 
> to managed schema and construct them as part of the tests
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-10229
>                 URL: https://issues.apache.org/jira/browse/SOLR-10229
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>         Attachments: SOLR-10229.patch, SOLR-10229.patch, SOLR-10229.patch, 
> SOLR-10229.patch, SOLR-10229.patch, SOLR-10229.patch, 
> SOLR-10229-straw-man.patch
>
>
> The test schema files are intimidating. There are about a zillion of them, 
> and making a change in any of them risks breaking some _other_ test. That 
> leaves people three choices:
> 1> add what they need to some existing schema. Which makes schemas bigger and 
> bigger and bigger.
> 2> create a new schema file, adding to the proliferation thereof.
> 3> Look through all the existing tests to see if they have something that 
> works.
> The recent work on LUCENE-7705 is a case in point. We're adding a maxLen 
> parameter to some tokenizers. Putting those parameters into any of the 
> existing schemas, especially to test < 255 char tokens is virtually 
> guaranteed to break other tests, so the only safe thing to do is make another 
> schema file. Adding to the multiplication of files.
> As part of SOLR-5260 I tried creating the schema on the fly rather than 
> creating a new static schema file and it's not hard. WDYT about making this 
> into some better thought-out utility? 
> At present, this is pretty fuzzy, I wanted to get some reactions before 
> putting much effort into it. I expect that the utility methods would 
> eventually get a bunch of canned types. It's reasonably straightforward for 
> primitive types, if lengthy. But when you get into solr.TextField-based types 
> it gets less straight-forward.
> We could manage to just move the "intimidation" from the plethora of schema 
> files to a zillion fieldTypes in the utility to choose from...
> Also, forcing every test to define the fields up-front is arguably less 
> convenient than just having _some_ canned schemas we can use. And erroneous 
> schemas to test failure modes are probably not very good fits for any such 
> framework.
> [~steve_rowe] and [[email protected]] in particular might have 
> something to say.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-10229) See what it would take to shift many of our one-off schemas used for testing to managed schema and construct them as part of the tests

Reply via email to