kylebarron opened a new issue, #39017:
URL: https://github.com/apache/arrow/issues/39017

   ### Describe the enhancement requested
   
   tl;dr, the minimal changes required to postMessage Arrow JS objects without 
monkey-patching them is to assign `typeId` as an attribute onto the `DataType` 
class, instead of having it only be a getter.
   
   I've been exploring multi threading with Arrow in the browser to allow 
operations to be done off the main thread. Arrow is well-suited to 
multithreading because its underlying `ArrayBuffer`s are [transferable 
objects](https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Transferable_objects),
 which can be moved instead of copied between threads.
   
   I've gotten a prototype working of using Arrow with web workers, but it 
required some monkey patching that should probably be upstreamed. In 
particular, classes can't be moved across threads; only their attributes are 
received. That means that if you send a `structuredClone(new arrow.Utf8())`, 
the worker receives an empty object with no identifying characteristics. 
   
   My approach I've been incubating in 
[geoarrow-js](https://github.com/geoarrow/geoarrow-js) is:
   
   - Implement ["hard 
cloning"](https://geoarrow.github.io/geoarrow-js/functions/worker.hardClone.html)
 to detach the `Data` or `Vector` of interest from a larger `ArrayBuffer` if 
it's shared. This ensures that transferring one `Data` doesn't neuter an entire 
`Table`
   - Call 
[`preparePostMessage`](https://geoarrow.github.io/geoarrow-js/functions/worker.preparePostMessage.html).
 This takes in a `Vector` or `Data` instance, clones it if necessary, monkey 
patches the `type` attribute, and returns a list of ArrayBuffers that are 
backing the desired object.
   - Call 
[`rehydrateData`](https://geoarrow.github.io/geoarrow-js/functions/worker.rehydrateData.html)
 or 
[`rehydrateVector`](https://geoarrow.github.io/geoarrow-js/functions/worker.rehydrateVector.html)
 from the worker. This is necessary to convert what's received back into a 
class with a prototype and methods.
   
   The key about the rehydration, though, is that it requires being able to 
take the serialized form and know how to convert it back into a class. As 
mentioned above, this isn't currently possible with `DataType` because there's 
no identifying attribute. Right now, I recursively set [`__type == 
this.typeId`](https://github.com/geoarrow/geoarrow-js/blob/cbfac73d1b8ff022fec73f2b062f26c77f3f9cc9/src/worker/transferable.ts#L79-L90)
 in `preparePostMessage`, and then [switch on 
that](https://github.com/geoarrow/geoarrow-js/blob/cbfac73d1b8ff022fec73f2b062f26c77f3f9cc9/src/worker/rehydrate.ts#L41-L113)
 from the worker to reconstruct the `DataType`.
   
   So questions:
   
   - Can we set `typeId` as an attribute instead of a getter?
   - Are there any other utilities that should be upstreamed?
   
   cc @domoritz @trxcllnt 
   
   ### Component(s)
   
   JavaScript


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to