Manish Gupta created CARBONDATA-542:
---------------------------------------

             Summary: Parsing values for measures and dimensions during data 
load should adopt a strict check
                 Key: CARBONDATA-542
                 URL: https://issues.apache.org/jira/browse/CARBONDATA-542
             Project: CarbonData
          Issue Type: Improvement
            Reporter: Manish Gupta
            Assignee: Manish Gupta
            Priority: Minor
             Fix For: 1.0.0-incubating


Currently in carbon we treat Short and Int as long and at the time of storing 
in carbon data files delta compression is used which compresses the data based 
on min and max values of the column.

While parsing the values for these datatypes, we use Double data type parser 
and extract long value from that. Code snippet as below. 
Double.valueOf(msrValue).longValue()

This has the following problems.

1. Measure Values beyond the range of Int and Short are parsed successfully. 
This behavior conflicts when the same measure is included as dictionary_include 
and becomes a dimension. When we query then each dimension value is parsed for 
its datatype for result conversion and at that time NumberFormatException is 
thrown and null is displayed in the result while for measure the loaded values 
are displayed. This also impacts aggregate queries. That is why strict check 
mechanism is adopted for dimensions values parsing.

2. Data inconsistency  in case of measures as for decimal values, the value 
before decimal will only be considered for Int and Short datatypes.

3. For measures, if values beyond the datatype range are allowed the 
compression will decrease.

Therefore we will have to adopt a strict behavior for both dimensions and 
measures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to