[jira] [Comment Edited] (SOLR-11741) Offline training mode for schema guessing

Abhishek Kumar Singh (JIRA) Sun, 07 Jan 2018 13:03:12 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-11741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315465#comment-16315465
 ]


Abhishek Kumar Singh edited comment on SOLR-11741 at 1/7/18 9:02 PM:
---------------------------------------------------------------------

The above approach can be optimised by replacing the *Supported FieldTypes* by  
*_BitSets_* , 
As shown in the following table:-
!screenshot-1.png!

We can map every FieldType to a BitSet. For eg. *String will be 10000* , *Long 
will be 00100* and so on..  

1. Now For every product, get the BitSet of the fieldType supported by each 
field 
2.  For every field, Find the *_BITWISE OR_* of the current BitSet with the 
BitSet value already recorded, and replace it.

Use the following rule to decide the final FieldType that the field should 
have. 
!RuleForMostAccomodatingField.png!

Say if a field called *price* has values as following values: 
In Product1 -> *12321  (Long, i.e. 00100)*
In Product2 -> *77261.66  (Double, i.e. 01000)* 
The supported BitSet for *price* will have a final value of *[ 00100 OR 01000 = 
01100 ]* , i.e. It should be assigned a Double. 

The above rule can be extended to any number of types, just the number of bits 
will increase accordingly. 

Using BitSets like above will decrease the storage space to 1 byte per field, 
will make the computation easier and faster, and will also remove the overhead 
of computing the trained schema separately, as they will be updated in-place 
with every Product.

Every api call to ask for *Trained Schema*,  will get the schema calculated 
till that point using the above rule. 


was (Author: abhidemon):
The above approach can be optimised by replacing the *Supported FieldTypes* by  
*_BitSets_* , 
As shown in the following table:-
!screenshot-1.png!

We can map every FieldType to a BitSet. For eg. *String will be 10000* , *Long 
will be 00100* and so on..  

1. Now For every product, get the BitSet of the fieldType supported by each 
field 
2.  For every field, Find the *_BITWISE OR_* of the current BitSet with the 
BitSet value already recorded, and replace it.

Use the following rule to decide the final FieldType that the field should 
have. 
!screenshot-3.png!

Say if a field called *price* has values as following values: 
In Product1 -> *12321  (Long, i.e. 00100)*
In Product2 -> *77261.66  (Double, i.e. 01000)* 
The supported BitSet for *price* will have a final value of *[ 00100 OR 01000 = 
01100 ]* , i.e. It should be assigned a Double. 

The above rule can be extended to any number of types, just the number of bits 
will increase accordingly. 

Using BitSets like above will decrease the storage space to 1 byte per field, 
will make the computation easier and faster, and will also remove the overhead 
of computing the trained schema separately, as they will be updated in-place 
with every Product.

Every api call to ask for *Trained Schema*,  will get the schema calculated 
till that point using the above rule. 

> Offline training mode for schema guessing
> -----------------------------------------
>
>                 Key: SOLR-11741
>                 URL: https://issues.apache.org/jira/browse/SOLR-11741
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Ishan Chattopadhyaya
>         Attachments: RuleForMostAccomodatingField.png, SOLR-11741-temp.patch, 
> screenshot-1.png, screenshot-3.png
>
>
> Our data driven schema guessing doesn't work under many situations. For 
> example, if the first document has a field with value "0", it is guessed as 
> Long and subsequent fields with "0.0" are rejected. Similarly, if the same 
> field had alphanumeric contents for a latter document, those documents are 
> rejected. Also, single vs. multi valued field guessing is not ideal.
> Proposing an offline training mode where Solr accepts bunch of documents and 
> returns a guessed schema (without indexing). This schema can then be used for 
> actual indexing. I think the original idea is from Hoss.
> I think initial implementation can be based on an UpdateRequestProcessor. We 
> can hash out the API soon, as we go along.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-11741) Offline training mode for schema guessing

Reply via email to