[jira] [Commented] (SPARK-11657) Bad Dataframe data read from parquet

Virgil Palanciuc (JIRA) Wed, 11 Nov 2015 10:21:36 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-11657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000840#comment-15000840
 ]


Virgil Palanciuc commented on SPARK-11657:
------------------------------------------

On the simple example:
{code}
> val df = sc.makeRDD(Seq("20206E25479B5C67992015A9")).toDF()
df: org.apache.spark.sql.DataFrame = [_1: string]
> df.show()
+--------------------+
|                  _1|
+--------------------+
|479B5C67992015A9...|
+--------------------+

> df.take(2)
res1: Array[org.apache.spark.sql.Row] = Array([479B5C67992015A9????????]) 
{code}

also:
{code}
== Parsed Logical Plan ==
Limit 21
 LogicalRDD [_1#1], MapPartitionsRDD[5] at stringRddToDataFrameHolder at 
<console>:21

== Analyzed Logical Plan ==
_1: string
Limit 21
 LogicalRDD [_1#1], MapPartitionsRDD[5] at stringRddToDataFrameHolder at 
<console>:21

== Optimized Logical Plan ==
Limit 21
 LogicalRDD [_1#1], MapPartitionsRDD[5] at stringRddToDataFrameHolder at 
<console>:21

== Physical Plan ==
Limit 21
 Scan PhysicalRDD[_1#1]

Code Generation: true
{code}


> Bad Dataframe data read from parquet
> ------------------------------------
>
>                 Key: SPARK-11657
>                 URL: https://issues.apache.org/jira/browse/SPARK-11657
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 1.5.1, 1.5.2
>         Environment: EMR (yarn)
>            Reporter: Virgil Palanciuc
>            Priority: Critical
>         Attachments: sample.tgz
>
>
> I get strange behaviour when reading parquet data:
> {code}
> scala> val data = sqlContext.read.parquet("hdfs:///sample")
> data: org.apache.spark.sql.DataFrame = [clusterSize: int, clusterName: 
> string, clusterData: array<string>, dpid: int]
> scala> data.take(1)    /// this returns garbage
> res0: Array[org.apache.spark.sql.Row] = 
> Array([1,56169A947F000101????????,WrappedArray(164594606101815510825479776971????????),813])
>  
> scala> data.collect()    /// this works
> res1: Array[org.apache.spark.sql.Row] = 
> Array([1,6A01CACD56169A947F000101,WrappedArray(77512098164594606101815510825479776971),813])
> {code}
> I've attached the "hdfs:///sample" directory to this bug report



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-11657) Bad Dataframe data read from parquet

Reply via email to