[jira] [Commented] (SOLR-3250) Dynamic Field capabilities based on value not name
[ https://issues.apache.org/jira/browse/SOLR-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673197#comment-13673197 ] Steve Rowe commented on SOLR-3250: -- Now that we have the ability to dynamically add schema fields (SOLR-3251), I want to push forward on this issue. Value-based dynamic field capabilities for document updates - which I'll sometimes refer to as schemaless mode - will a) determine the type for field names that don’t match explicit or dynamic fields in the schema; b) add these field names to the schema with their determined types; and c) complete the document update request as normal. This process should apply equally to new doc additions, atomic updates, and regular updates. In a conversation with [~hossman_luc...@fucit.org] about this feature, he suggested that configuration for parsing/converting {{String}}-typed field values into the appropriate Java objects could be separated from configuration of mappings from Java object types to schema field types. In this way, components built for schemaless mode could be reused for other purposes. JSON and Javabin content streams already carry some type information for their field values. The {{ContentStreamLoader}}-s corresponding to these, {{JsonLoader}} and {{JavabinLoader}}, should set field value object types in the {{SolrInputDocument}} according to the content stream's data types. (Currently {{JavabinLoader}} does this correctly, but {{JsonLoader}} stores everything as {{String}}-s; this will need to be fixed.) As a result, for the Java object types supported by these content streams and their loaders (as well as other update processors, etc. that set field values' Java object types), {{String}} parsing/conversion won't be required, and only the Java object type - schema field type mappings will be necessary to determine the schema field type for new fields. [On SOLR-2802|https://issues.apache.org/jira/browse/SOLR-2802?focusedCommentId=13117911page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13117911], Hoss wrote that {{FieldMutatingUpdateProcessor}}-s that parsed dates, numbers and booleans would be generally useful. I plan on going that route to implement {{String}}-typed field value parsing. These field value parsing update processors should operate on {{String}}-valued fields that either a) are not in the schema, or b) have a schema field type with an appropriate {{typeClass}}. After the new parsing update processors detect and convert field values to the appropriate Java object types, an update processor that adds fields to the schema as needed can be configured with a mapping from Java object type to schema field type. Here is the list of things I think need to happen - I plan on making JIRA issues for each of these: # Fix {{JsonLoader}} to create field values using the JSON-supplied type, rather than making everything a {{String}}. # Add a new field update processor selector that will configure the processor to select fields that match any schema field, or that match no schema field, depending on its boolean parameter: {{bool name=fieldNameMatchesSchemaField}} # Add new {{FieldMutatingUpdateProcessorFactory}} subclasses {{ParseFooUpdateProcessorFactory}}, where {{Foo}} includes {{Date}}, {{Double}}, {{Long}}, and {{Boolean}}. If they see a field value that is not {{String}}-valued, or can't parse the value, they will ignore it and leave it as is. For multi-valued fields, they should be all-or-nothing. # Add a new {{AddSchemaFieldsUpdateProcessorFactory}}, with configurable mappings from Java object type to schema field type, that will dynamically add fields to the schema, as needed. # Add a new example config set for schemaless mode. Dynamic Field capabilities based on value not name -- Key: SOLR-3250 URL: https://issues.apache.org/jira/browse/SOLR-3250 Project: Solr Issue Type: Improvement Reporter: Grant Ingersoll In some situations, one already knows the schema of their content, so having to declare a schema in Solr becomes cumbersome in some situations. For instance, if you have all your content in JSON (or can easily generate it) or other typed serializations, then you already have a schema defined. It would be nice if we could have support for dynamic fields that used whatever name was passed in, but then picked the appropriate FieldType for that field based on the value of the content. So, for instance, if the input is a number, it would select the appropriate numeric type. If it is a plain text string, it would pick the appropriate text field (you could even add in language detection here). If it is comma separated, it would treat them as keywords, etc. Also, we could likely send in a hint as
[jira] [Commented] (SOLR-3250) Dynamic Field capabilities based on value not name
[ https://issues.apache.org/jira/browse/SOLR-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13230231#comment-13230231 ] Grant Ingersoll commented on SOLR-3250: --- Note, a core reload is not something I would want to do. Dynamic Field capabilities based on value not name -- Key: SOLR-3250 URL: https://issues.apache.org/jira/browse/SOLR-3250 Project: Solr Issue Type: Improvement Reporter: Grant Ingersoll In some situations, one already knows the schema of their content, so having to declare a schema in Solr becomes cumbersome in some situations. For instance, if you have all your content in JSON (or can easily generate it) or other typed serializations, then you already have a schema defined. It would be nice if we could have support for dynamic fields that used whatever name was passed in, but then picked the appropriate FieldType for that field based on the value of the content. So, for instance, if the input is a number, it would select the appropriate numeric type. If it is a plain text string, it would pick the appropriate text field (you could even add in language detection here). If it is comma separated, it would treat them as keywords, etc. Also, we could likely send in a hint as to the type too. With this approach, you of course have a first in wins situation, but assuming you have this schema defined elsewhere, it is likely fine. Supporting such cases would allow us to be schemaless when appropriate, while offering the benefits of schemas when appropriate. Naturally, one could mix and match these too. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3250) Dynamic Field capabilities based on value not name
[ https://issues.apache.org/jira/browse/SOLR-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13230243#comment-13230243 ] Yonik Seeley commented on SOLR-3250: Of course hopefully everyone knows schemaless is mostly marketing b.s. - when people do this, there is still a schema, but it's guessed on first use (and hence generally a horrible idea for production systems). It would be easy enough on a single node... but how does one handle a cluster? Say you index price=0 on nodeA, and price=100.0 on nodeB? A quick thought on how it might work: - have a separate file auto_fields.json that keeps track of the mappings that would be the same for all cores using that schema - when we run across a field we haven't seen before, we must guess a type for it, then grab a lock - update the auto_fields.json - we can update our in-memory schema with any new fields we find in auto_fields.json - works the same in ZK mode... it's just the auto_fields.json is in ZK, and we would use something like optimistic locking to update it Dynamic Field capabilities based on value not name -- Key: SOLR-3250 URL: https://issues.apache.org/jira/browse/SOLR-3250 Project: Solr Issue Type: Improvement Reporter: Grant Ingersoll In some situations, one already knows the schema of their content, so having to declare a schema in Solr becomes cumbersome in some situations. For instance, if you have all your content in JSON (or can easily generate it) or other typed serializations, then you already have a schema defined. It would be nice if we could have support for dynamic fields that used whatever name was passed in, but then picked the appropriate FieldType for that field based on the value of the content. So, for instance, if the input is a number, it would select the appropriate numeric type. If it is a plain text string, it would pick the appropriate text field (you could even add in language detection here). If it is comma separated, it would treat them as keywords, etc. Also, we could likely send in a hint as to the type too. With this approach, you of course have a first in wins situation, but assuming you have this schema defined elsewhere, it is likely fine. Supporting such cases would allow us to be schemaless when appropriate, while offering the benefits of schemas when appropriate. Naturally, one could mix and match these too. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org