[ 
https://issues.apache.org/jira/browse/PARQUET-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17123728#comment-17123728
 ] 

Joris Van den Bossche commented on PARQUET-1869:
------------------------------------------------

Additional observation: if you add decimals to the large value, this actually 
works fine:

{code}
In [16]: arr = pa.array([decimal.Decimal('9223372036854775808.111'), 
decimal.Decimal('1.111')])                                                      
                                         

In [17]: arr                                                                    
                                                                                
                              
Out[17]: 
<pyarrow.lib.Decimal128Array object at 0x7f020b3bdb88>
[
  9223372036854775808.111,
  1.111
]

In [18]: arr.type                                                               
                                                                                
                              
Out[18]: Decimal128Type(decimal(22, 3))
{code}

(so I only added ".111" to the large value, which I would think should not make 
much difference, since the other number already ensures that there needs to be 
3 decimal points in the resulting decimals)

Now, it is a "decimal(22, 3)" type, and roundtripping to/from parquet works 
correctly. And it indeed uses more bytes now than with the original example:

{code}
# original example where arrow type is "decimal(19, 3)"
>>> pq.read_metadata("test_decimal.parquet").schema                             
>>>                                                                             
>>>                             
<pyarrow._parquet.ParquetSchema object at 0x7f01ef7805c0>
required group field_id=0 schema {
  optional fixed_len_byte_array(9) field_id=1 a (Decimal(precision=19, 
scale=3));
}

# new example where the arrow type is "decimal(22, 3)"
In [15]: pq.read_metadata("test_decimal2.parquet").schema                       
                                                                                
                               
Out[15]: 
<pyarrow._parquet.ParquetSchema object at 0x7f01ef797860>
required group field_id=0 schema {
  optional fixed_len_byte_array(10) field_id=1 a (Decimal(precision=22, 
scale=3));
}
{code}




> [C++] Large decimal values don't roundtrip correctly
> ----------------------------------------------------
>
>                 Key: PARQUET-1869
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1869
>             Project: Parquet
>          Issue Type: Test
>          Components: parquet-cpp
>            Reporter: Joris Van den Bossche
>            Priority: Major
>
> Reproducer with python:
> {code}
> import decimal
> import pyarrow as pa
> import pyarrow.parquet as pq
> arr = pa.array([decimal.Decimal('9223372036854775808'), 
> decimal.Decimal('1.111')])
> print(arr)
> pq.write_table(pa.table({'a': arr}), "test_decimal.parquet") 
> result = pq.read_table("test_decimal.parquet")
> print(result.column('a'))
> {code}
> gives
> {code}
> # before writing
> <pyarrow.lib.Decimal128Array object at 0x7fd07d79a468>
> [
>   9223372036854775808.000,
>   1.111
> ]
> # after reading
> <pyarrow.lib.ChunkedArray object at 0x7fd0711e9f98>
> [
>   [
>     -221360928884514619.392,
>     1.111
>   ]
> ]
> {code}
> I tried reading the file with a different parquet implementation (fastparquet 
> python package), and that gives the same values on read, so the issue might 
> possibly rather be on the write side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to