curioustien commented on code in PR #45351:
URL: https://github.com/apache/arrow/pull/45351#discussion_r1930922813
##########
python/pyarrow/tests/parquet/test_basic.py:
##########
@@ -364,9 +364,9 @@ def test_byte_stream_split():
def test_store_decimal_as_integer(tempdir):
arr_decimal_1_9 = pa.array(list(map(Decimal, range(100))),
- type=pa.decimal128(5, 2))
+ type=pa.decimal32(5, 2))
Review Comment:
This is a little tricky to keep the test to be the same if we don't cast the
type to a wider decimal. On the writer side, we can keep the same behavior from
arrow to Parquet with additional support for decimal32/64.
However, on the reader side from Parquet to arrow, the Parquet decimal
format only contains `precision` and `scale` without any knowledge of different
arrow types (which is the correct behavior here). Therefore, in order to do the
conversion, we look at the `precision` to convert it to either
`decimal32/64/128/256` correspondingly.
For this test which does a round trip for both writing to parquet and
reading from parquet, the correct end result should be `decimal32` when we read
the data. I can modify this test case to cast the return decimal to a wider
decimal if that's what you meant.
##########
cpp/src/parquet/column_writer.cc:
##########
@@ -2383,29 +2391,45 @@ struct SerializeFunctor<
int64_t non_null_count = array.length() - array.null_count();
int64_t size = non_null_count * ArrowType::kByteWidth;
scratch_buffer = AllocateBuffer(ctx->memory_pool, size);
- scratch = reinterpret_cast<int64_t*>(scratch_buffer->mutable_data());
+ scratch_i32 = reinterpret_cast<int32_t*>(scratch_buffer->mutable_data());
+ scratch_i64 = reinterpret_cast<int64_t*>(scratch_buffer->mutable_data());
Review Comment:
I'm not sure how to handle this code in a clean way. However, the main
reason why I had to do this was because of the `int64_t* scratch` pointer we're
using. IIUC, the current code constructs the decimal endian array using this
scratch space. The scratch pointer moves along the memory address space to do
this construction.
If you see the current logic, it looks at the `byte_width` to determine how
many address spaces we need to use from the input to construct the decimal. The
`int64_t* scratch` pointer works for `decimal64`, `decimal128`, and
`decimal256`. However, it doesn't work for `decimal32` because it uses 32-bit
address space, so I had to create another pointer with `int32_t*`.
I may misunderstand how this code works, so feel free to correct me here
##########
cpp/src/arrow/compute/kernels/vector_hash.cc:
##########
@@ -555,6 +555,7 @@ KernelInit GetHashInit(Type::type type_id) {
case Type::DATE32:
case Type::TIME32:
case Type::INTERVAL_MONTHS:
+ case Type::DECIMAL32:
Review Comment:
I'm down to split this change to another PR which can cover this support
with more tests on the arrow compute side. But yes, there are a few tests in
Parquet that hit arrow vector kernel code path
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]