kylebarron opened a new issue, #39017: URL: https://github.com/apache/arrow/issues/39017
### Describe the enhancement requested tl;dr, the minimal changes required to postMessage Arrow JS objects without monkey-patching them is to assign `typeId` as an attribute onto the `DataType` class, instead of having it only be a getter. I've been exploring multi threading with Arrow in the browser to allow operations to be done off the main thread. Arrow is well-suited to multithreading because its underlying `ArrayBuffer`s are [transferable objects](https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Transferable_objects), which can be moved instead of copied between threads. I've gotten a prototype working of using Arrow with web workers, but it required some monkey patching that should probably be upstreamed. In particular, classes can't be moved across threads; only their attributes are received. That means that if you send a `structuredClone(new arrow.Utf8())`, the worker receives an empty object with no identifying characteristics. My approach I've been incubating in [geoarrow-js](https://github.com/geoarrow/geoarrow-js) is: - Implement ["hard cloning"](https://geoarrow.github.io/geoarrow-js/functions/worker.hardClone.html) to detach the `Data` or `Vector` of interest from a larger `ArrayBuffer` if it's shared. This ensures that transferring one `Data` doesn't neuter an entire `Table` - Call [`preparePostMessage`](https://geoarrow.github.io/geoarrow-js/functions/worker.preparePostMessage.html). This takes in a `Vector` or `Data` instance, clones it if necessary, monkey patches the `type` attribute, and returns a list of ArrayBuffers that are backing the desired object. - Call [`rehydrateData`](https://geoarrow.github.io/geoarrow-js/functions/worker.rehydrateData.html) or [`rehydrateVector`](https://geoarrow.github.io/geoarrow-js/functions/worker.rehydrateVector.html) from the worker. This is necessary to convert what's received back into a class with a prototype and methods. The key about the rehydration, though, is that it requires being able to take the serialized form and know how to convert it back into a class. As mentioned above, this isn't currently possible with `DataType` because there's no identifying attribute. Right now, I recursively set [`__type == this.typeId`](https://github.com/geoarrow/geoarrow-js/blob/cbfac73d1b8ff022fec73f2b062f26c77f3f9cc9/src/worker/transferable.ts#L79-L90) in `preparePostMessage`, and then [switch on that](https://github.com/geoarrow/geoarrow-js/blob/cbfac73d1b8ff022fec73f2b062f26c77f3f9cc9/src/worker/rehydrate.ts#L41-L113) from the worker to reconstruct the `DataType`. So questions: - Can we set `typeId` as an attribute instead of a getter? - Are there any other utilities that should be upstreamed? cc @domoritz @trxcllnt ### Component(s) JavaScript -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
