[ 
https://issues.apache.org/jira/browse/ARROW-17465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581543#comment-17581543
 ] 

Jorge Leitão edited comment on ARROW-17465 at 8/19/22 4:49 AM:
---------------------------------------------------------------

Agree. The issue was not very clear - we currently only support reading 
DELTA_BINARY_PACKED, and it is in the read path that currently bails.

Note that pyspark can read it. Attached to this comment is a minimal parquet 
file reproducing the issue. The code example results in 

 {code:python}
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
22/08/18 21:14:28 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
spark reads it                                                                  
Traceback (most recent call last):
  File "bla.py", line 16, in <module>
    table = pyarrow.parquet.read_table("test.parquet")
  File 
"/home/azureuser/projects/arrow2/arrow-parquet-integration-testing/venv/lib/python3.8/site-packages/pyarrow/parquet/__init__.py",
 line 2827, in read_table
    ...
OSError: delta bit width larger than integer bit width
{code}

which how I concluded that this is something specific in arrow :)

{code:java}
import pyarrow.parquet
import pyspark.sql


spark = pyspark.sql.SparkSession.builder.config(
    # see https://stackoverflow.com/a/62024670/931303
    "spark.sql.parquet.enableVectorizedReader",
    "false",
).getOrCreate()

result = spark.read.parquet("test.parquet").collect()
assert [r["c1"] for r in result] == [863490391, -816295192, 1613070492, 
-1166045478, 1856530847]
print("spark reads it")

table = pyarrow.parquet.read_table("test.parquet")
print("pyarrow reads it")
{code}


 [^test.parquet] 


was (Author: jorgecarleitao):
Agree. The issue was not very clear - we currently only support reading 
DELTA_BINARY_PACKED, and it is in the read path that currently bails.

Note that pyspark can read it. Attached to this comment is a minimal parquet 
file reproducing the issue. The code example results in 

 {code:python}
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
22/08/18 21:14:28 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
spark reads it                                                                  
Traceback (most recent call last):
  File "bla.py", line 16, in <module>
    table = pyarrow.parquet.read_table("test.parquet")
  File 
"/home/azureuser/projects/arrow2/arrow-parquet-integration-testing/venv/lib/python3.8/site-packages/pyarrow/parquet/__init__.py",
 line 2827, in read_table
    ...
OSError: delta bit width larger than integer bit width
{code}

which how I concluded that this is something specific in arrow :)

{code:java}
import pyarrow.ipc
import pyarrow.parquet
import pyspark.sql


spark = pyspark.sql.SparkSession.builder.config(
    # see https://stackoverflow.com/a/62024670/931303
    "spark.sql.parquet.enableVectorizedReader",
    "false",
).getOrCreate()

result = spark.read.parquet("test.parquet").collect()
assert [r["c1"] for r in result] == [863490391, -816295192, 1613070492, 
-1166045478, 1856530847]
print("spark reads it")

table = pyarrow.parquet.read_table("test.parquet")
print("pyarrow reads it")
{code}


 [^test.parquet] 

> [Parquet] DELTA_BINARY_PACKED constraint on num_bits is too restrict?
> ---------------------------------------------------------------------
>
>                 Key: ARROW-17465
>                 URL: https://issues.apache.org/jira/browse/ARROW-17465
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Parquet
>            Reporter: Jorge Leitão
>            Priority: Major
>         Attachments: test.parquet
>
>
> Consider the sequence of (int32) values
> [863490391,-816295192,1613070492,-1166045478,1856530847]
> This sequence can be encoded as a single block, single miniblock with a 
> bit_width of 33.
> However, we currently require [1] the bit_width of each miniblock to be 
> smaller than the bitwidth of the type it encodes.
> We could consider lifting this constraint, as, as shown in the example above, 
> the values representation's `bit_width` can be smaller than the delta's 
> representation's `bit_width`.
> [1] 
> https://github.com/apache/arrow/blob/a376968089d7310f4a88d054822fa1eaf96c46f5/cpp/src/parquet/encoding.cc#L2173



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to