Github user mn-mikke commented on the issue:
https://github.com/apache/spark/pull/20858
@maropu What other libraries do you mean? I'm not aware of any library
providing this functionality on top Spark SQL.
When using Spark SQL as an ETL tool for structured and nested data, people
are forced to use UDFs for transforming arrays since current api for array
columns is lacking. This approach brings several drawbacks:
- bad code readability
- Catalyst is blind when performing optimizations
- impossibility to track data lineage of the transformation (a key aspect
for the financial industry, see [Spline](https://absaoss.github.io/spline/) and
[Spline
paper](https://github.com/AbsaOSS/spline/releases/download/release%2F0.2.7/Spline_paper_IEEE_2018.pdf))
So my colleagues and I decided to extend the current Spark SQL API with
well-known collection functions like concat, flatten, zipWithIndex, etc. We
don't want to keep this functionality just in our fork of Spark, but would like
to share it with others.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]