[jira] [Commented] (SPARK-17545) Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset

Nathan Beyer (JIRA) Fri, 16 Sep 2016 08:00:50 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15496505#comment-15496505
 ]


Nathan Beyer commented on SPARK-17545:
--------------------------------------

I agree, the data is quirky and it's not how I'd personally serialize the data, 
but it is valid ISO 8601 format, regardless of it being discouraged. Also, the 
code already has precedent for dealing with "quirks".

{code}
    val indexOfGMT = s.indexOf("GMT")
    if (indexOfGMT != -1) {
      // ISO8601 with a weird time zone specifier (2000-01-01T00:00GMT+01:00)
      val s0 = s.substring(0, indexOfGMT)
      val s1 = s.substring(indexOfGMT + 3)
      // Mapped to 2000-01-01T00:00+01:00
      stringToTime(s0 + s1)
    }
{code}
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L126

If the converters are going to handle some quirks, why wouldn't it handle this 
one? If no quirks are going to be handled, then I would suggest that the 
documentation be made explicit to define that it handles the W3C note's profile 
of ISO 8601.

FWIW - I'm getting this data via a CSV export from Splunk.

BTW - Thanks for the link to the PR, that's actually another issue that I was 
wondering about.

> Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset
> -----------------------------------------------------------------------
>
>                 Key: SPARK-17545
>                 URL: https://issues.apache.org/jira/browse/SPARK-17545
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Nathan Beyer
>
> When parsing a CSV with a date/time column that contains a variant ISO 8601 
> that doesn't include a colon in the offset, casting to Timestamp fails.
> Here's a simple, example CSV content.
> {quote}
> time
> "2015-07-20T15:09:23.736-0500"
> "2015-07-20T15:10:51.687-0500"
> "2015-11-21T23:15:01.499-0600"
> {quote}
> Here's the stack trace that results from processing this data.
> {quote}
> 16/09/14 15:22:59 ERROR Utils: Aborting task
> java.lang.IllegalArgumentException: 2015-11-21T23:15:01.499-0600
>       at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>       at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>       at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.<init>(Unknown 
> Source)
>       at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>       at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>       at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>       at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>       at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>       at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:287)
> {quote}
> Somewhat related, I believe Python standard libraries can produce this form 
> of zone offset. The system I got the data from is written in Python.
> https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17545) Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset

Reply via email to