Re: [PR] GH-44345: [C++][Parquet] Fully support arrow decimal32/64 in Parquet [arrow]

via GitHub Mon, 27 Jan 2025 09:31:30 -0800


curioustien commented on code in PR #45351:
URL: https://github.com/apache/arrow/pull/45351#discussion_r1930922813



##########
python/pyarrow/tests/parquet/test_basic.py:
##########
@@ -364,9 +364,9 @@ def test_byte_stream_split():
 
 def test_store_decimal_as_integer(tempdir):
     arr_decimal_1_9 = pa.array(list(map(Decimal, range(100))),
-                               type=pa.decimal128(5, 2))
+                               type=pa.decimal32(5, 2))

Review Comment:
   This is a little tricky to keep the test to be the same if we don't cast the 
type to a wider decimal. On the writer side, we can keep the same behavior from 
arrow to Parquet with additional support for decimal32/64.
   
   However, on the reader side from Parquet to arrow, the Parquet decimal 
format only contains `precision` and `scale` without any knowledge of different 
arrow types (which is the correct behavior here). Therefore, in order to do the 
conversion, we look at the `precision` to convert it to either 
`decimal32/64/128/256` correspondingly.
   
   For this test which does a round trip for both writing to parquet and 
reading from parquet, the correct end result should be `decimal32` when we read 
the data. I can modify this test case to cast the return decimal to a wider 
decimal if that's what you meant.



##########
cpp/src/parquet/column_writer.cc:
##########
@@ -2383,29 +2391,45 @@ struct SerializeFunctor<
     int64_t non_null_count = array.length() - array.null_count();
     int64_t size = non_null_count * ArrowType::kByteWidth;
     scratch_buffer = AllocateBuffer(ctx->memory_pool, size);
-    scratch = reinterpret_cast<int64_t*>(scratch_buffer->mutable_data());
+    scratch_i32 = reinterpret_cast<int32_t*>(scratch_buffer->mutable_data());
+    scratch_i64 = reinterpret_cast<int64_t*>(scratch_buffer->mutable_data());

Review Comment:
   I'm not sure how to handle this code in a clean way. However, the main 
reason why I had to do this was because of the `int64_t* scratch` pointer we're 
using. IIUC, the current code constructs the decimal endian array using this 
scratch space. The scratch pointer moves along the memory address space to do 
this construction.
   
   If you see the current logic, it looks at the `byte_width` to determine how 
many address spaces we need to use from the input to construct the decimal. The 
`int64_t* scratch` pointer works for `decimal64`, `decimal128`, and 
`decimal256`. However, it doesn't work for `decimal32` because it uses 32-bit 
address space, so I had to create another pointer with `int32_t*`.
   
   I may misunderstand how this code works, so feel free to correct me here



##########
cpp/src/arrow/compute/kernels/vector_hash.cc:
##########
@@ -555,6 +555,7 @@ KernelInit GetHashInit(Type::type type_id) {
     case Type::DATE32:
     case Type::TIME32:
     case Type::INTERVAL_MONTHS:
+    case Type::DECIMAL32:

Review Comment:
   I'm down to split this change to another PR which can cover this support 
with more tests on the arrow compute side. But yes, there are a few tests in 
Parquet that hit arrow vector kernel code path



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-44345: [C++][Parquet] Fully support arrow decimal32/64 in Parquet [arrow]

Reply via email to