[ 
https://issues.apache.org/jira/browse/SPARK-48460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros resolved SPARK-48460.
----------------------------------------
    Resolution: Not A Bug

It is clean design decision in ORC.

See the ORC implementation:

0) The limit as a constant: 
https://github.com/apache/orc/blob/4cbe9db7b76/java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java#L655
1) the getMinimum() method where null is returned if isLowerBoundSet is true: 
https://github.com/apache/orc/blob/4cbe9db7b76/java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java#L801
2) Where the isLowerBoundSet is set to true:
https://github.com/apache/orc/blob/4cbe9db7b76/java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java#L705-L710

> Spark ORC writer generates incorrect meta information(min, max)
> ---------------------------------------------------------------
>
>                 Key: SPARK-48460
>                 URL: https://issues.apache.org/jira/browse/SPARK-48460
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.3, 3.4.2, 
> 3.3.2, 3.4.0, 3.4.1, 3.5.0, 3.5.1, 3.3.4, 3.4.3
>            Reporter: Volodymyr T
>            Priority: Major
>
> We found that Hive cannot concatenate some ORC files generated by Spark 3.2.1 
> and higher versions which contain long strings.
> Steps to reproduce the issue:
> 1) Create DF with a string longer than 1024
>  
> {code:java}
> val valid = spark.sql("SELECT 1 as id, cast(NULL as string) as null, 
> lpad('A', 1024, 'A') as string;"){code}
> {code:java}
> val invalid = spark.sql("SELECT 1 as id, cast(NULL as string) as null, 
> lpad('A', 1025, 'A') as string;"){code}
> {code:java}
> valid.withColumn("len", length($"string")).show()
> +---+----+--------------------+----+ | id|null| string| len| 
> +---+----+--------------------+----+ | 1|null|AAAAAAAAAAAAAAAAA...|1024| 
> +---+----+--------------------+----+{code}
> {code:java}
> invalid.withColumn("len", length($"string")).show()
> +---+----+--------------------+----+ | id|null| string| len| 
> +---+----+--------------------+----+ | 1|null|AAAAAAAAAAAAAAAAA...|1025| 
> +---+----+--------------------+----+{code}
> 2. Write in ORC format to S3
> {code:java}
> valid.write.format("orc")
>       .option("path", "s3://bucket/test/test_orc/")
>       .option("compression", "zlib")
>       .mode("overwrite")
>       .save(){code}
> 3. Check ORC meta by *hive –orcfiledump* command
> {code:java}
> [hadoop@ip ~]$ hive --orcfiledump s3://bucket/tets/test_orc/{code}
> We can see incorrect statistics for column string
> {code:java}
> Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: null max: null sum: 
> 1025{code}
> {code:java}
> Processing data file 
> s3://bucket-dev/tets/test_orc/part-00000-ec01de8f-8f6b-4937-b107-e88f5a5d2d67-c000.zlib.orc
>  [length: 488]Structure for 
> s3://timmedia-dev/volodymyr/test_orc/part-00000-ec01de8f-8f6b-4937-b107-e88f5a5d2d67-c000.zlib.orcFile
>  Version: 0.12 with FUTURERows: 1Compression: ZLIBCompression size: 
> 262144Type: struct<id:int,null:string,string:string>
> Stripe Statistics:  Stripe 1:    Column 0: count: 1 hasNull: false    Column 
> 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1    Column 2: 
> count: 0 hasNull: true bytesOnDisk: 5    Column 3: count: 1 hasNull: false 
> bytesOnDisk: 23 min: null max: null sum: 1025
> File Statistics:  Column 0: count: 1 hasNull: false  Column 1: count: 1 
> hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1  Column 2: count: 0 
> hasNull: true bytesOnDisk: 5  Column 3: count: 1 hasNull: false bytesOnDisk: 
> 23 min: null max: null sum: 1025
> Stripes:  Stripe: offset: 3 data: 34 rows: 1 tail: 66 index: 108    Stream: 
> column 0 section ROW_INDEX start: 3 length 11    Stream: column 1 section 
> ROW_INDEX start: 14 length 24    Stream: column 2 section ROW_INDEX start: 38 
> length 19    Stream: column 3 section ROW_INDEX start: 57 length 54    
> Stream: column 1 section DATA start: 111 length 6    Stream: column 2 section 
> PRESENT start: 117 length 5    Stream: column 2 section DATA start: 122 
> length 0    Stream: column 2 section LENGTH start: 122 length 0    Stream: 
> column 2 section DICTIONARY_DATA start: 122 length 0    Stream: column 3 
> section DATA start: 122 length 16    Stream: column 3 section LENGTH start: 
> 138 length 7    Encoding column 0: DIRECT    Encoding column 1: DIRECT_V2    
> Encoding column 2: DICTIONARY_V2[0]    Encoding column 3: DIRECT_V2
> File length: 488 bytesPadding length: 0 bytesPadding ratio: 0%
> User Metadata:  org.apache.spark.version=3.4.1{code}
> For DF with a value smaller than 1024, we can see valid statistics
> {code:java}
> hive --orcfiledump s3://bucket/test/test_orcProcessing data file 
> s3://bucket/test/test_orc/part-00000-e395cc4d-9e2a-4ef0-9adb-640ed41dd2b7-c000.zlib.orc
>  [length: 485]Structure for 
> s3://timmedia-dev/volodymyr/test_orc/part-00000-e395cc4d-9e2a-4ef0-9adb-640ed41dd2b7-c000.zlib.orcFile
>  Version: 0.12 with FUTURERows: 1Compression: ZLIBCompression size: 
> 262144Type: struct<id:int,null:string,string:string>
> Stripe Statistics:  Stripe 1:    Column 0: count: 1 hasNull: false    Column 
> 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1    Column 2: 
> count: 0 hasNull: true bytesOnDisk: 5    Column 3: count: 1 hasNull: false 
> bytesOnDisk: 23 min: 
> AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>  max: 
> AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>  sum: 1024
> File Statistics:  Column 0: count: 1 hasNull: false  Column 1: count: 1 
> hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1  Column 2: count: 0 
> hasNull: true bytesOnDisk: 5  Column 3: count: 1 hasNull: false bytesOnDisk: 
> 23 min: 
> AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>  max: 
> AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>  sum: 1024
> Stripes:  Stripe: offset: 3 data: 34 rows: 1 tail: 66 index: 107    Stream: 
> column 0 section ROW_INDEX start: 3 length 11    Stream: column 1 section 
> ROW_INDEX start: 14 length 24    Stream: column 2 section ROW_INDEX start: 38 
> length 19    Stream: column 3 section ROW_INDEX start: 57 length 53    
> Stream: column 1 section DATA start: 110 length 6    Stream: column 2 section 
> PRESENT start: 116 length 5    Stream: column 2 section DATA start: 121 
> length 0    Stream: column 2 section LENGTH start: 121 length 0    Stream: 
> column 2 section DICTIONARY_DATA start: 121 length 0    Stream: column 3 
> section DATA start: 121 length 16    Stream: column 3 section LENGTH start: 
> 137 length 7    Encoding column 0: DIRECT    Encoding column 1: DIRECT_V2    
> Encoding column 2: DICTIONARY_V2[0]    Encoding column 3: DIRECT_V2
> File length: 485 bytesPadding length: 0 bytesPadding ratio: 0%
> User Metadata:  
> org.apache.spark.version=3.4.1________________________________________________________________________________________________________________________
>  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to