[
https://issues.apache.org/jira/browse/DRILL-8146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496357#comment-17496357
]
ASF GitHub Bot commented on DRILL-8146:
---------------------------------------
pseudomo opened a new pull request #2472:
URL: https://github.com/apache/drill/pull/2472
# [DRILL-8146](https://issues.apache.org/jira/browse/DRILL-8146): SAS reader
fails to read the majority of sas files
## Description
The idea to infer schema by analyzing the type of first row is not the best
idea in this case because either the value of field in the first row can be
null or the entire row can be missing (0 rows). Moreover, I think that there is
no point of using MinorType.BIGINT (Long) at all since SAS stores all numbers
as Double anyway. Actually, as it turned out, SAS stores any data either in
VARCHAR or DOUBLE format.
My proposal is to analyze SAS column type (VARCHAR/DOUBLE) together with
column format to define MinorType. In this case we don't need to use the first
row at all.
As you can see below, we can take advantage of dictionaries with all
possible Date/Time formats that are already defined in parso lib to distinguish
between TIME, DATE and TIMESTAMP.
```
if
(DateTimeConstants.TIME_FORMAT_STRINGS.contains(columnFormat.getName())) {
type = MinorType.TIME;
} else if
(DateTimeConstants.DATE_FORMAT_STRINGS.containsKey(columnFormat.getName())) {
type = MinorType.DATE;
} else if
(DateTimeConstants.DATETIME_FORMAT_STRINGS.containsKey(columnFormat.getName()))
{
type = MinorType.TIMESTAMP;
```
All other fields that are not Date/Time can be recognized either as String
or Double.
## Documentation
## Testing
I tested it on 160 real world sas files and 90 synthetic sas files. In all
cases the result was ok.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> SAS reader fails to read the majority of sas files
> --------------------------------------------------
>
> Key: DRILL-8146
> URL: https://issues.apache.org/jira/browse/DRILL-8146
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Text & CSV
> Reporter: pseudomo
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.20.0
>
>
> SAS reader fails to read the majority of real world sas files.
> The reader throws NPEs if:
> * SAS file has 0 rows
> * Date column value is null
> * The type of value is Number
> * Long and Double are mixed together in one column (for some reason if the
> fractional part of any number is zero, it will be converted to Long by parso
> library)
> Schema inference issue:
> * Any Date values converted to LocalDate but actually SAS supports DateTime
> (timestamps). The problem is that time will be dropped
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)