[jira] [Commented] (SPARK-31703) Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)

Apache Spark (Jira) Wed, 12 Aug 2020 09:32:09 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-31703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176458#comment-17176458
 ]


Apache Spark commented on SPARK-31703:
--------------------------------------

User 'tinhto-000' has created a pull request for this issue:
https://github.com/apache/spark/pull/29419

> Changes made by SPARK-26985 break reading parquet files correctly in 
> BigEndian architectures (AIX + LinuxPPC64)
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-31703
>                 URL: https://issues.apache.org/jira/browse/SPARK-31703
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.5, 3.0.0
>         Environment: AIX 7.2
> LinuxPPC64 with RedHat.
>            Reporter: Michail Giannakopoulos
>            Assignee: Tin Hang To
>            Priority: Blocker
>              Labels: BigEndian, correctness
>             Fix For: 3.0.1, 3.1.0
>
>         Attachments: Data_problem_Spark.gif
>
>
> Trying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) 
> so as to be able to read data stored in parquet format, we notice that values 
> associated with DOUBLE and DECIMAL types are parsed in the wrong form.
> According toe parquet documentation, they always opt to store the values 
> using little-endian representation for values:
>  [https://github.com/apache/parquet-format/blob/master/Encodings.md]
> {noformat}
> The plain encoding is used whenever a more efficient encoding can not be 
> used. It
> stores the data in the following format:
> BOOLEAN: Bit Packed, LSB first
> INT32: 4 bytes little endian
> INT64: 8 bytes little endian
> INT96: 12 bytes little endian (deprecated)
> FLOAT: 4 bytes IEEE little endian
> DOUBLE: 8 bytes IEEE little endian
> BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained 
> in the array
> FIXED_LEN_BYTE_ARRAY: the bytes contained in the array
> For native types, this outputs the data as little endian. Floating
> point types are encoded in IEEE.
> For the byte array type, it encodes the length as a 4 byte little
> endian, followed by the bytes.{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31703) Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)

Reply via email to