[GitHub] spark issue #20858: [SPARK-23736][SQL] Implementation of the concat_arrays f...

mn-mikke Tue, 20 Mar 2018 03:43:16 -0700

Github user mn-mikke commented on the issue:

    https://github.com/apache/spark/pull/20858
  
    @maropu What other libraries do you mean? I'm not aware of any library 
providing this functionality on top Spark SQL.
    
    When using Spark SQL as an ETL tool for structured and nested data, people 
are forced to use UDFs for transforming arrays since current api for array 
columns is lacking. This approach brings several drawbacks:
    - bad code readability
    - Catalyst is blind when performing optimizations
    - impossibility to track data lineage of the transformation (a key aspect 
for the financial industry, see [Spline](https://absaoss.github.io/spline/) and 
[Spline 
paper](https://github.com/AbsaOSS/spline/releases/download/release%2F0.2.7/Spline_paper_IEEE_2018.pdf))
    
    So my colleagues and I decided to extend the current Spark SQL API with 
well-known collection functions like concat, flatten, zipWithIndex, etc. We 
don't want to keep this functionality just in our fork of Spark, but would like 
to share it with others.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #20858: [SPARK-23736][SQL] Implementation of the concat_arrays f...

Reply via email to