[jira] [Commented] (SOLR-11741) Offline training mode for schema guessing

Abhishek Kumar Singh (JIRA) Fri, 05 Jan 2018 23:27:32 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-11741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314428#comment-16314428
 ]


Abhishek Kumar Singh commented on SOLR-11741:
---------------------------------------------

What i was thinking was something similar to the above implementation, just 
that instead of recording every *value* that ever appeared for a field, I would 
record all the distinct *fieldTypes of the values* that appeared for a each 
field. This will be the mapping of *field -> supported types*. This will need 
very small storage.  

And instead of recording in memory, this data can be stored externally, (say 
_zookeeper_, or some _temporary index_ inside solr.). I think it will get rid 
of the following problem.

bq. It doesn't play very nicely with distributed updates (you'd either have to 
ensure all training data was sent to the same node where you send the "commit" 
or add special custom logic to ensure it all got forwarded to a special node) 
and there are probably a lot more sophisticated / smarter ways to do it



> Offline training mode for schema guessing
> -----------------------------------------
>
>                 Key: SOLR-11741
>                 URL: https://issues.apache.org/jira/browse/SOLR-11741
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Ishan Chattopadhyaya
>
> Our data driven schema guessing doesn't work under many situations. For 
> example, if the first document has a field with value "0", it is guessed as 
> Long and subsequent fields with "0.0" are rejected. Similarly, if the same 
> field had alphanumeric contents for a latter document, those documents are 
> rejected. Also, single vs. multi valued field guessing is not ideal.
> Proposing an offline training mode where Solr accepts bunch of documents and 
> returns a guessed schema (without indexing). This schema can then be used for 
> actual indexing. I think the original idea is from Hoss.
> I think initial implementation can be based on an UpdateRequestProcessor. We 
> can hash out the API soon, as we go along.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-11741) Offline training mode for schema guessing

Reply via email to