shuhaowu opened a new pull request, #34939:
URL: https://github.com/apache/arrow/pull/34939

   ### Rationale for this change
   
   Certain codebases that previously uses row-oriented way to access data may 
wish to migrate to Arrow to save serialization and deserialization cost, and to 
be able to gain access to fast column-oriented operations. As it stands, Arrow 
is sort of a drop-in replacement to row-oriented data such as a JavaScript 
Array of objects. This is great to incrementally migrate legacy codebases to 
Arrow, as it is frequently infeasible to rewrite the application to use the 
column-oriented data access patterns. For most data, 
JavaScript-object-compatible and row-oriented access is already provided via 
the `StructRowProxy`. However, if the structs themselves include a `Vector`, 
existing code will break as it assumes the `Vector` object to behave like a 
JavaScript array, which it does not due to the lack of index access. An example 
of such a data structure is as follows:
   
   ```
   [
     {x: 1, y: [1, 2]},
     {x: 2, y: [2, 3]},
   ]
   ```
   
   In this case, with the Arrow JS library as it is, the API consumer is unable 
to get individual element of the `y` array via `table[i].y[j]`. Instead, the 
API consumer must use the API `table.get(i).y.get(j)`. In the situation where 
we are migrating a legacy code base to Arrow, this requires a large refactor of 
the entire codebase, which is infeasible in a short time. This negates the 
advantage of using Arrow as a drop-in replacement and prevents incremental 
migration of code to Arrow.
   
   ### What changes are included in this PR?
   
   To address this problem, this patch adds a Proxy at the root of the 
prototype chain for the `Table`, `RecordBatch`, and `Vector` objects and allow 
index access for these objects for backward compatibility purposes. Basically, 
objects like `Vector` now supports `vector[i]` in addition to `vector.get(i)`.
   
   However, code should not be using `vector[i]` as it is ~1.5 orders of 
magnitude slower than `vector.get(i)` as ES6 Proxy objects are quite slow. This 
should only be used to provide compatibility for legacy codebases. For code 
that desires high performance, `.get` remains a much better solution. This is 
also why the Proxy object is added to the root of the prototype chain, as 
opposed to the usual pattern where a Proxy object is returned from a 
constructor.
   
   Documentation has been added to compare the performance of the various 
access.
   
   ### Are there any user-facing changes?
   
   `Table`, `RecordBatch`, and `Vector` elements can now be accessed via index 
operators, albeit much slower than `.get`. All changes should be backward 
compatible.
   
   ### Are these changes tested?
   
   Yes. The performance of the base objects does not seem to be affected. To 
establish the performance change, we first see how much variability in the 
`ops/s` there is between two runs of `yarn perf`. The left plot below shows the 
percent difference between the two runs. This shows a baseline level of 
"variability" (although may not be statistically significant, but gives an 
indication) between the runs. We see most benchmarks are within 20% of each 
other, although a few tests at the bottom show higher variability. The right 
plot shows the difference in `ops/s` before and after the patch. No benchmark 
showed significantly higher or lower performance and looks superficially 
similar to the variability comparison. As such, we can say that the performance 
of the objects are not impacted by this patch.
   
   
![image](https://user-images.githubusercontent.com/338100/230441327-6162a318-362a-4c18-bb81-68bc9a32e987.png)
   
   The following shows the performance of `vector.get(i)` vs `vector[i]`. The 
raw results shows the index access to have about 50 ops/s while `.get` are 
around 750 ops/s. This is about 1.5 orders of magnitude difference in 
performance. Basically, this shouldn't be used unless it is used for backward 
compatibility with code that expects native JavaScript arrays.
   
   <img 
src=https://user-images.githubusercontent.com/338100/230442000-0d7bd599-47bc-4268-a8fe-a9f858c2d319.png
 width=600 />
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to