[ 
https://issues.apache.org/jira/browse/MADLIB-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16584292#comment-16584292
 ] 

Frank McQuillan edited comment on MADLIB-1270 at 8/17/18 7:11 PM:
------------------------------------------------------------------

Final implementation is a slight variation on above:

* Use the max array length of rows in `vector_col` to determine the number of 
columns
* It means some of the generated rows in the output table may have NULLs if the 
array for a particular row is smaller than the max.
* If the optional `feature_names` parameter is used, create the number of 
columns based on size of this array.  Pad/truncate the data to match.

Re-run failed examples from above:

{code}
DROP TABLE IF EXISTS test;
CREATE TABLE test (id int, t text[]);
INSERT INTO test VALUES
(1, ARRAY['a','b']),
(2, ARRAY['c','d']),
(3, ARRAY['e','f']),
(4, ARRAY['g','h', 'i']),
(5, ARRAY['j']);
{code}

1) Set number of columns to 3:

{code}
DROP TABLE IF EXISTS test_out_5;
select madlib.vec2cols('test',
                       'test_out_5',
                       't',
                       array['c1','c2','c3'],
                       'id');
SELECT * FROM test_out_5 ORDER BY id;
{code}
produces
{code}
 id | c1 | c2 | c3 
----+----+----+----
  1 | a  | b  | 
  2 | c  | d  | 
  3 | e  | f  | 
  4 | g  | h  | i
  5 | j  |    | 
(5 rows)
{code}

2) Set number of columns to 2:

{code}
DROP TABLE IF EXISTS test_out_5;
select madlib.vec2cols('test',
                       'test_out_5',
                       't',
                       array['c1','c2'],
                       'id');
SELECT * FROM test_out_5 ORDER BY id;
{code}
produces
{code}
 id | c1 | c2 
----+----+----
  1 | a  | b
  2 | c  | d
  3 | e  | f
  4 | g  | h
  5 | j  | 
(5 rows)
{code}

3)  Set number of columns to 1:

{code}
DROP TABLE IF EXISTS test_out_5;
select madlib.vec2cols('test',
                       'test_out_5',
                       't',
                       array['c1'],
                       'id');
SELECT * FROM test_out_5 ORDER BY id;
{code}
produces
{code}
 id | c1 
----+----
  1 | a
  2 | c
  3 | e
  4 | g
  5 | j
(5 rows)
{code}

4)  Default uses max array size:


{code}
DROP TABLE IF EXISTS test_out_5;
select madlib.vec2cols('test',
                       'test_out_5',
                       't',
                       NULL,
                       'id');
SELECT * FROM test_out_5 ORDER BY id;
{code}
produces
{code}
 id | f1 | f2 | f3 
----+----+----+----
  1 | a  | b  | 
  2 | c  | d  | 
  3 | e  | f  | 
  4 | g  | h  | i
  5 | j  |    | 
(5 rows)
{code}

LGTM 


 


was (Author: fmcquillan):
Final implementation is a slight variation on above:

* Use the max array length of rows in `vector_col` to determine the number of 
columns
* It means some of the generated rows in the output table may have NULLs if the 
array for a particular row is smaller than the max.
* If the optional `feature_names` parameter is used, create the number of 
columns based on size of this array.  Pad/truncate the data to match.

Re-run failed examples from above:

{code}
DROP TABLE IF EXISTS test;
CREATE TABLE test (id int, t text[]);
INSERT INTO test VALUES
(1, ARRAY['a','b']),
(2, ARRAY['c','d']),
(3, ARRAY['e','f']),
(4, ARRAY['g','h', 'i']),
(5, ARRAY['j']);
{code}

1) Set number of columns to 3:

{code}
DROP TABLE IF EXISTS test_out_5;
select madlib.vec2cols('test',
                       'test_out_5',
                       't',
                       array['c1','c2','c3'],
                       'id');
SELECT * FROM test_out_5 ORDER BY id;
{code}
produces
{code}
 id | c1 | c2 | c3 
----+----+----+----
  1 | a  | b  | 
  2 | c  | d  | 
  3 | e  | f  | 
  4 | g  | h  | i
  5 | j  |    | 
(5 rows)
{code}

2) Set number of columns to 2:

{code}
DROP TABLE IF EXISTS test_out_5;
select madlib.vec2cols('test',
                       'test_out_5',
                       't',
                       array['c1','c2'],
                       'id');
SELECT * FROM test_out_5 ORDER BY id;
{code}
produces
{code}
 id | c1 | c2 
----+----+----
  1 | a  | b
  2 | c  | d
  3 | e  | f
  4 | g  | h
  5 | j  | 
(5 rows)
{code}

3)  Set number of columns to 1:

{code}
DROP TABLE IF EXISTS test_out_5;
select madlib.vec2cols('test',
                       'test_out_5',
                       't',
                       array['c1'],
                       'id');
SELECT * FROM test_out_5 ORDER BY id;
{code}
produces
{code}
 id | c1 
----+----
  1 | a
  2 | c
  3 | e
  4 | g
  5 | j
(5 rows)
{code}

4)  Default


{code}
DROP TABLE IF EXISTS test_out_5;
select madlib.vec2cols('test',
                       'test_out_5',
                       't',
                       NULL,
                       'id');
SELECT * FROM test_out_5 ORDER BY id;
{code}
produces
{code}
 id | f1 | f2 | f3 
----+----+----+----
  1 | a  | b  | 
  2 | c  | d  | 
  3 | e  | f  | 
  4 | g  | h  | i
  5 | j  |    | 
(5 rows)
{code}

LGTM 


 

> Unexepcted behavior in vec2cols function
> ----------------------------------------
>
>                 Key: MADLIB-1270
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1270
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Utilities
>            Reporter: Rashmi Raghu
>            Priority: Minor
>             Fix For: v1.15.1
>
>
> There is some unexpected behavior when vector column to be split contains 
> different numbers of elements in the vectors. E.g.
> Input table:
> select * from test order by id;
>  id | t 
>  ----+---------
>  1 | \{a,b}
>  2 | \{c,d}
>  3 | \{e,f}
>  4 | \{g,h,i}
>  5 | \{j}
>  (5 rows)
>  
> select madlib.vec2cols('test','test_out_5','t',array['c1','c2','c3'],'id');
>  ERROR: plpy.Error: vec2cols: Mismatch between size of vector_col and number 
> of cols in feature_names.
>  CONTEXT: Traceback (most recent call last):
>  PL/Python function "vec2cols", line 23, in <module>
>  return vec2cols_obj.vec2cols(**globals())
>  PL/Python function "vec2cols", line 149, in vec2cols
>  PL/Python function "vec2cols", line 112, in get_names_for_split_output_cols
>  PL/Python function "vec2cols", line 77, in _assert
>  PL/Python function "vec2cols"
>  
> select madlib.vec2cols('test','test_out_5','t',array['c1','c2'],'id');
>  vec2cols 
>  ----------
> (1 row)
> select * from test_out_5 order by id;
>  id | c1 | c2 
>  ----++--------
>  1 | a | b
>  2 | c | d
>  3 | e | f
>  4 | g | h
>  5 | j | 
>  (5 rows)
>  
>  
> select madlib.vec2cols('test','test_out_6','t',array['c1'],'id');
> ERROR: plpy.Error: vec2cols: Mismatch between size of vector_col and number 
> of cols in feature_names.
>  CONTEXT: Traceback (most recent call last):
>  PL/Python function "vec2cols", line 23, in <module>
>  return vec2cols_obj.vec2cols(**globals())
>  PL/Python function "vec2cols", line 149, in vec2cols
>  PL/Python function "vec2cols", line 112, in get_names_for_split_output_cols
>  PL/Python function "vec2cols", line 77, in _assert
>  PL/Python function "vec2cols"
>  
> --- Update-----
> There are a couple of decisions to be made regarding supporting arrays of 
> different lengths:
> -If we choose the array with maximal length in the vector_col, what do we do 
> if the user's passed-in feature_names does not have the same number of 
> elements?
> -What are the performance issues with looking through our vector_col for the 
> array with maximal length?
> -How will we handle default feature names: will we create a feature name for 
> every element of the longest array entry?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to