pseudomo opened a new pull request #2472:
URL: https://github.com/apache/drill/pull/2472


   # [DRILL-8146](https://issues.apache.org/jira/browse/DRILL-8146): SAS reader 
fails to read the majority of sas files
   
   ## Description
   
   The idea to infer schema by analyzing the type of first row is not the best 
idea in this case because either the value of field in the first row can be 
null or the entire row can be missing (0 rows). Moreover, I think that there is 
no point of using MinorType.BIGINT (Long) at all since SAS stores all numbers 
as Double anyway. Actually, as it turned out, SAS stores any data either in 
VARCHAR or DOUBLE format.
   
   My proposal is to analyze SAS column type (VARCHAR/DOUBLE) together with 
column format to define MinorType. In this case we don't need to use the first 
row at all. 
   As you can see below, we can take advantage of dictionaries with all 
possible Date/Time formats that are already defined in parso lib to distinguish 
between TIME, DATE and TIMESTAMP. 
   ```
           if 
(DateTimeConstants.TIME_FORMAT_STRINGS.contains(columnFormat.getName())) {
             type = MinorType.TIME;
           } else if 
(DateTimeConstants.DATE_FORMAT_STRINGS.containsKey(columnFormat.getName())) {
             type = MinorType.DATE;
           } else if 
(DateTimeConstants.DATETIME_FORMAT_STRINGS.containsKey(columnFormat.getName())) 
{
             type = MinorType.TIMESTAMP;
   ```
   All other fields that are not Date/Time can be recognized either as String 
or Double.
   
   ## Documentation
   
   ## Testing
   I tested it on 160 real world sas files and 90 synthetic sas files. In all 
cases the result was ok.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to