[ 
https://issues.apache.org/jira/browse/DRILL-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454698#comment-17454698
 ] 

Charles Givre commented on DRILL-8071:
--------------------------------------

[~pj.fanning] One thing to think about is inconsistent data.  For Drill to work 
well with data, it needs to be able to identify a data type per column that 
doesn't change.  When I wrote the Excel reader, my goal was to use the closest 
possible data type to the data, but also provide fallbacks in the case of 
inconsistent data.  

For instance, what do you do if the first row of `foo` is a string and the 
second is a number of sorts?  The way I set it up was to only recognize 
strings, doubles and timestamps.  But... I also provided the `allTextMode` 
which reads everything as text.  That way if the user has really bad data, they 
can still get Drill to read it.

Another option to throw out there is to use Drill's VARDECIMAL data type for 
numbers if scale and precision are available in the Excel reader.  That would 
ensure that the numbers are represented correctly. 

I do think this is a good idea, with the caveat that it may be more work than 
you realize. 

> format-excel should use POI DataFormatter
> -----------------------------------------
>
>                 Key: DRILL-8071
>                 URL: https://issues.apache.org/jira/browse/DRILL-8071
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Data Types
>    Affects Versions: 1.19.0
>            Reporter: PJ Fanning
>            Priority: Major
>
> The existing ExcelBatchReader uses the raw data values from the cells. This 
> raw data ignores formatting set on the cells. As an example, numbers and 
> dates are stored as doubles. With the POI DataFormatter, you can get the cell 
> style applied so that the data will appear as it does when you view the data 
> in Excel itself.
> [https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-]
>  
> A big number like 123456789.987654 could be stored as double that is more 
> like 123456789.9876539999999 when represented in decimal format (because this 
> might be the closest match that double can represent). The cell format could 
> say that cell has 6 decimal places after the decimal point so the formatter 
> would round the number back to the value that it displayed in Excel as.
> Even if you choose not to use the DataFormatter, you have unprotected calls 
> to `cell.getNumericCellValue()` and that could easily throw an exception (if 
> the data is not stored a number). Even `cell.getStringCellValue()` can throw 
> an exception - for similar reasons.
>  
> There is also custom code for handling the conversion of the raw numbers 
> representing dates/timestamps but this also seems like a bad idea. The Cell 
> class has getLocalDateTimeCellValue and this has the right logic for 
> converting 1904 and 1900 based dates - yes, Excel uses 2 different formats.
> Code that processes excel files is a real pain to get right because the 
> Microsoft storage format is really bad.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to