Re: [PR] [GLUTEN-4241][VL] Add plan node to convert Vanilla spark columnar format data to Velox columnar format data [incubator-gluten]

via GitHub Fri, 08 Mar 2024 02:08:42 -0800


boneanxs commented on PR #4818:
URL: 
https://github.com/apache/incubator-gluten/pull/4818#issuecomment-1985410134


   > @zhztheplayer can you check how the memory is allocated during the 
conversion? Where the arrow memory is allocated? how many memcpy during the 
conversion? Is there onheap=>offheap copy?
   
   Hey, @FelixYBW for each columnar batch, before the conversion, this pr tries 
to allocate offheap memory to perform `spark columnar batch` -> `arrow 
array`(here this pr doesn't treat onheap/offheap spark columnar batch, it uses 
`ArrowFieldWriter`(which is implemented by spark) to do this transformation, 
can see `ArrowColumnarBatchConverter#write`, which is this pr newly added).
   
   So yes, there will be one memcpy from `spark columnar batch` -> `arrow 
array`, no matter `spark columnar batch` is on heap or off heap.
   
   In native side, I simply uses `importFromArrow` to convert `arrow array` to 
`velox columnar batch`, there will be some memcpy either for String, timestamp, 
shortDecimal, etc(I'm not fully checked).
   
   > What I was worried about is ArrowWritableColumnarVector have not actually 
been under active maintenance for a period of time so we should have more tests 
here especially for complex data types.
   
   Sry @zhztheplayer, I might miss something, are you saying 
`ArrowFieldWriter`? This pr uses `ArrowFieldWriter` to do the conversion, it's 
wildly used by pyspark, so reusing it here should be safe.
   
   > Also would be great if you could share thoughts about the risk of memory 
leaks this approach may bring @boneanxs .
   The extra memory here acquired are `arrowArray`, `cSchema`, `velox columnar 
batch`, and they're all well handled by `TaskResources` and `recyclePayload`, 
`arrowArray` and `velox columnar batch` will be released during each iterator, 
and `cSchema` will be released until the iterator ends, `TaskResources` is the 
extra ensureance that all allocated memory will be released if the iterator 
interrupted abnormally.
   
   > I have a impression that parquet-mr can take use of offheap memory for 
columnar data. If so the best case is that we can avoid any memcpy during the 
conversion
   I'm not sure whether parquet-mr can directly use offheap memory, but spark 
can export parquet data as offheap columnar batch by enabling 
`spark.sql.columnVector.offheap.enabled`, but directly reuse offheap columnar 
batch to arrow array is not supported in this pr(by the way, can we do so? Not 
sure abt this)
   
   I can also try to do benchmarks comparing with 
VanillarColumnarBatchToRow->RowToVeloxColumnar if necessary, I think at least 
for memcpy, `RowToVeloxColumnar` also does an extra copy even if the row might 
from an offheap ColumnarBatch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [GLUTEN-4241][VL] Add plan node to convert Vanilla spark columnar format data to Velox columnar format data [incubator-gluten]

Reply via email to