[ 
https://issues.apache.org/jira/browse/SPARK-18407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18407:
-------------------------------------
    Description: 
[This 
assertion|https://github.com/apache/spark/blob/16eaad9daed0b633e6a714b5704509aa7107d6e5/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L408]
 fails when you run a stream against json data that is stored in partitioned 
folders, if you manually specify the schema and that schema omits the 
partitioned columns.

My hunch is that we are inferring those columns even though the schema is being 
passed in manually and adding them to the end.

While we are fixing this bug, it would be nice to make the assertion better.  
Truncating is not terribly useful as, at least in my case, it truncated the 
most interesting part.  I changed it to this while debugging:

{code}
          s"""
             |Batch does not have expected schema
             |Expected: ${output.mkString(",")}
             |Actual: ${newPlan.output.mkString(",")}
             |
             |== Original ==
             |$logicalPlan
             |
             |== Batch ==
             |$newPlan
           """.stripMargin
{code}

I also tried specifying the partition columns in the schema and now it appears 
that they are filled with corrupted data.

  was:
[This 
assertion|https://github.com/apache/spark/blob/16eaad9daed0b633e6a714b5704509aa7107d6e5/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L408]
 fails when you run a stream against json data that is stored in partitioned 
folders, if you manually specify the schema and that schema omits the 
partitioned columns.

My hunch is that we are inferring those columns even though the schema is being 
passed in manually and adding them to the end.

While we are fixing this bug, it would be nice to make the assertion better.  
Truncating is not terribly useful as, at least in my case, it truncated the 
most interesting part.  I changed it to this while debugging:

{code}
          s"""
             |Batch does not have expected schema
             |Expected: ${output.mkString(",")}
             |Actual: ${newPlan.output.mkString(",")}
             |
             |== Original ==
             |$logicalPlan
             |
             |== Batch ==
             |$newPlan
           """.stripMargin
{code}


> Inferred partition columns cause assertion error
> ------------------------------------------------
>
>                 Key: SPARK-18407
>                 URL: https://issues.apache.org/jira/browse/SPARK-18407
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.0.2
>            Reporter: Michael Armbrust
>            Priority: Critical
>
> [This 
> assertion|https://github.com/apache/spark/blob/16eaad9daed0b633e6a714b5704509aa7107d6e5/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L408]
>  fails when you run a stream against json data that is stored in partitioned 
> folders, if you manually specify the schema and that schema omits the 
> partitioned columns.
> My hunch is that we are inferring those columns even though the schema is 
> being passed in manually and adding them to the end.
> While we are fixing this bug, it would be nice to make the assertion better.  
> Truncating is not terribly useful as, at least in my case, it truncated the 
> most interesting part.  I changed it to this while debugging:
> {code}
>           s"""
>              |Batch does not have expected schema
>              |Expected: ${output.mkString(",")}
>              |Actual: ${newPlan.output.mkString(",")}
>              |
>              |== Original ==
>              |$logicalPlan
>              |
>              |== Batch ==
>              |$newPlan
>            """.stripMargin
> {code}
> I also tried specifying the partition columns in the schema and now it 
> appears that they are filled with corrupted data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to