[jira] [Updated] (DRILL-8071) format-excel should use POI DataFormatter

PJ Fanning (Jira) Tue, 07 Dec 2021 04:03:05 -0800


     [ 
https://issues.apache.org/jira/browse/DRILL-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


PJ Fanning updated DRILL-8071:
------------------------------
    Description: 
The existing ExcelBatchReader uses the raw data values from the cells. This raw 
data ignores formatting set on the cells. As an example, numbers and dates are 
stored as doubles. With the POI DataFormatter, you can get the cell style 
applied so that the data will appear as it does when you view the data in Excel 
itself.

[https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-]

 

A big number like 123456789.987654 could be stored as double that is more like 
123456789.9876539999999 when represented in decimal format (because this might 
be the closest match that double can represent). The cell format could say that 
cell has 6 decimal places after the decimal point so the formatter would round 
the number back to the value that it displayed in Excel as.

Even if you choose not to use the DataFormatter, you have unprotected calls to 
`cell.getNumericCellValue()` and that could easily throw an exception (if the 
data is not stored a number). Even `cell.getStringCellValue()` can throw an 
exception - for similar reasons.

 

There is also custom code for handling the conversion of the raw numbers 
representing dates/timestamps but this also seems like a bad idea. The Cell 
class has getLocalDateTimeCellValue and this has the right logic for converting 
1904 and 1900 based dates - yes, Excel uses 2 different formats.

Code that processes excel files is a real pain to get right because the 
Microsoft storage format is really bad.

 

  was:
The existing ExcelBatchReader uses the raw data values from the cells. This raw 
data ignores formatting set on the cells. As an example, numbers and dates are 
stored as doubles. With the POI DataFormatter, you can get the cell style 
applied so that the data will appear as it does when you view the data in Excel 
itself.

[https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-]

 

A big number like 123456789.987654 could be stored as double that is more like 
123456789.9876539999999 when represented in decimal format (because this might 
be the closest match that double can represent). The cell format could say that 
cell has 6 decimal places after the decimal point so the formatter would round 
the number back to the value that it displayed in Excel as.

Even if you choose not to use the DataFormatter, you have unprotected calls to 
`cell.getNumericCellValue()` and that could easily throw an exception (if the 
data is not stored a number). Even `cell.getStringCellValue()` can throw an 
exception - for similar reasons.

 

Code that processes excel files is a real pain to get right because the 
Microsoft storage format is really bad.

 


> format-excel should use POI DataFormatter
> -----------------------------------------
>
>                 Key: DRILL-8071
>                 URL: https://issues.apache.org/jira/browse/DRILL-8071
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Data Types
>            Reporter: PJ Fanning
>            Priority: Major
>
> The existing ExcelBatchReader uses the raw data values from the cells. This 
> raw data ignores formatting set on the cells. As an example, numbers and 
> dates are stored as doubles. With the POI DataFormatter, you can get the cell 
> style applied so that the data will appear as it does when you view the data 
> in Excel itself.
> [https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/DataFormatter.html#formatCellValue-org.apache.poi.ss.usermodel.Cell-]
>  
> A big number like 123456789.987654 could be stored as double that is more 
> like 123456789.9876539999999 when represented in decimal format (because this 
> might be the closest match that double can represent). The cell format could 
> say that cell has 6 decimal places after the decimal point so the formatter 
> would round the number back to the value that it displayed in Excel as.
> Even if you choose not to use the DataFormatter, you have unprotected calls 
> to `cell.getNumericCellValue()` and that could easily throw an exception (if 
> the data is not stored a number). Even `cell.getStringCellValue()` can throw 
> an exception - for similar reasons.
>  
> There is also custom code for handling the conversion of the raw numbers 
> representing dates/timestamps but this also seems like a bad idea. The Cell 
> class has getLocalDateTimeCellValue and this has the right logic for 
> converting 1904 and 1900 based dates - yes, Excel uses 2 different formats.
> Code that processes excel files is a real pain to get right because the 
> Microsoft storage format is really bad.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (DRILL-8071) format-excel should use POI DataFormatter

Reply via email to