[ 
https://issues.apache.org/jira/browse/DRILL-8146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496357#comment-17496357
 ] 

ASF GitHub Bot commented on DRILL-8146:
---------------------------------------

pseudomo opened a new pull request #2472:
URL: https://github.com/apache/drill/pull/2472


   # [DRILL-8146](https://issues.apache.org/jira/browse/DRILL-8146): SAS reader 
fails to read the majority of sas files
   
   ## Description
   
   The idea to infer schema by analyzing the type of first row is not the best 
idea in this case because either the value of field in the first row can be 
null or the entire row can be missing (0 rows). Moreover, I think that there is 
no point of using MinorType.BIGINT (Long) at all since SAS stores all numbers 
as Double anyway. Actually, as it turned out, SAS stores any data either in 
VARCHAR or DOUBLE format.
   
   My proposal is to analyze SAS column type (VARCHAR/DOUBLE) together with 
column format to define MinorType. In this case we don't need to use the first 
row at all. 
   As you can see below, we can take advantage of dictionaries with all 
possible Date/Time formats that are already defined in parso lib to distinguish 
between TIME, DATE and TIMESTAMP. 
   ```
           if 
(DateTimeConstants.TIME_FORMAT_STRINGS.contains(columnFormat.getName())) {
             type = MinorType.TIME;
           } else if 
(DateTimeConstants.DATE_FORMAT_STRINGS.containsKey(columnFormat.getName())) {
             type = MinorType.DATE;
           } else if 
(DateTimeConstants.DATETIME_FORMAT_STRINGS.containsKey(columnFormat.getName())) 
{
             type = MinorType.TIMESTAMP;
   ```
   All other fields that are not Date/Time can be recognized either as String 
or Double.
   
   ## Documentation
   
   ## Testing
   I tested it on 160 real world sas files and 90 synthetic sas files. In all 
cases the result was ok.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> SAS reader fails to read the majority of sas files
> --------------------------------------------------
>
>                 Key: DRILL-8146
>                 URL: https://issues.apache.org/jira/browse/DRILL-8146
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text & CSV
>            Reporter: pseudomo
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.20.0
>
>
> SAS reader fails to read the majority of real world sas files.
> The reader throws NPEs if:
>  * SAS file has 0 rows
>  * Date column value is null
>  * The type of value is Number
>  * Long and Double are mixed together in one column (for some reason if the 
> fractional part of any number is zero, it will be converted to Long by parso 
> library)
> Schema inference issue:
>  * Any Date values converted to LocalDate but actually SAS supports DateTime 
> (timestamps). The problem is that time will be dropped
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to