boneanxs commented on PR #4818: URL: https://github.com/apache/incubator-gluten/pull/4818#issuecomment-1985410134
> @zhztheplayer can you check how the memory is allocated during the conversion? Where the arrow memory is allocated? how many memcpy during the conversion? Is there onheap=>offheap copy? Hey, @FelixYBW for each columnar batch, before the conversion, this pr tries to allocate offheap memory to perform `spark columnar batch` -> `arrow array`(here this pr doesn't treat onheap/offheap spark columnar batch, it uses `ArrowFieldWriter`(which is implemented by spark) to do this transformation, can see `ArrowColumnarBatchConverter#write`, which is this pr newly added). So yes, there will be one memcpy from `spark columnar batch` -> `arrow array`, no matter `spark columnar batch` is on heap or off heap. In native side, I simply uses `importFromArrow` to convert `arrow array` to `velox columnar batch`, there will be some memcpy either for String, timestamp, shortDecimal, etc(I'm not fully checked). > What I was worried about is ArrowWritableColumnarVector have not actually been under active maintenance for a period of time so we should have more tests here especially for complex data types. Sry @zhztheplayer, I might miss something, are you saying `ArrowFieldWriter`? This pr uses `ArrowFieldWriter` to do the conversion, it's wildly used by pyspark, so reusing it here should be safe. > Also would be great if you could share thoughts about the risk of memory leaks this approach may bring @boneanxs . The extra memory here acquired are `arrowArray`, `cSchema`, `velox columnar batch`, and they're all well handled by `TaskResources` and `recyclePayload`, `arrowArray` and `velox columnar batch` will be released during each iterator, and `cSchema` will be released until the iterator ends, `TaskResources` is the extra ensureance that all allocated memory will be released if the iterator interrupted abnormally. > I have a impression that parquet-mr can take use of offheap memory for columnar data. If so the best case is that we can avoid any memcpy during the conversion I'm not sure whether parquet-mr can directly use offheap memory, but spark can export parquet data as offheap columnar batch by enabling `spark.sql.columnVector.offheap.enabled`, but directly reuse offheap columnar batch to arrow array is not supported in this pr(by the way, can we do so? Not sure abt this) I can also try to do benchmarks comparing with VanillarColumnarBatchToRow->RowToVeloxColumnar if necessary, I think at least for memcpy, `RowToVeloxColumnar` also does an extra copy even if the row might from an offheap ColumnarBatch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
