Manish Gupta created CARBONDATA-542:
---------------------------------------
Summary: Parsing values for measures and dimensions during data
load should adopt a strict check
Key: CARBONDATA-542
URL: https://issues.apache.org/jira/browse/CARBONDATA-542
Project: CarbonData
Issue Type: Improvement
Reporter: Manish Gupta
Assignee: Manish Gupta
Priority: Minor
Fix For: 1.0.0-incubating
Currently in carbon we treat Short and Int as long and at the time of storing
in carbon data files delta compression is used which compresses the data based
on min and max values of the column.
While parsing the values for these datatypes, we use Double data type parser
and extract long value from that. Code snippet as below.
Double.valueOf(msrValue).longValue()
This has the following problems.
1. Measure Values beyond the range of Int and Short are parsed successfully.
This behavior conflicts when the same measure is included as dictionary_include
and becomes a dimension. When we query then each dimension value is parsed for
its datatype for result conversion and at that time NumberFormatException is
thrown and null is displayed in the result while for measure the loaded values
are displayed. This also impacts aggregate queries. That is why strict check
mechanism is adopted for dimensions values parsing.
2. Data inconsistency in case of measures as for decimal values, the value
before decimal will only be considered for Int and Short datatypes.
3. For measures, if values beyond the datatype range are allowed the
compression will decrease.
Therefore we will have to adopt a strict behavior for both dimensions and
measures.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)