[jira] [Comment Edited] (SPARK-23690) VectorAssembler should have handleInvalid to handle columns with null values

yogesh garg (JIRA) Mon, 19 Mar 2018 12:05:34 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-23690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405174#comment-16405174
 ]


yogesh garg edited comment on SPARK-23690 at 3/19/18 7:04 PM:
--------------------------------------------------------------

In an offline discussion with [~mrbago], we discussed the following behavior 
for `handleInvalid`. We have to get the lengths of the vector columns that are 
involved in the assembly, ideally this information is present in the 
`attributeGroup` of the column, but that might return `size == -1`, in which 
case we earlier used the `d.select.first` to infer the size of these columns. 
This could raise an exception in the corner case that the first row itself has 
null values. We are abandoning the idea that we can get this information by 
finding a non-null row in each of such columns because this approach has 
complicated logic, terrible run time (O(#columns) distributed queries) and 
fewer guarantees for any such data we might see in the future (even if we infer 
the size right now, there's no guarantee we can do it later, leading to an 
un-expected error).

1. *Error*: Find the remaining lengths from `d.select.first`
  * if we get NullPointerException while iterating on the cells for sizes, 
throw an (early) error
  * if we get NoSuchElementError while looking for the first row, -give the 
rows 0 sizes and warn- throw error about incomplete metadata

2. *Skip*: Find remaining lengths from `d.drop.first`
  * if we get NoSuchElementError, -warn- throw error about incomplete metadata
  * Note that we can't get NullPointerException in this case (yay!)

3. *Keep*: If any column does not have attribute sizes, it's dangerous to infer 
sizes from the data because even if we get the information form the current 
dataset, a future cut of the data is not guaranteed to be infer-able. Thus, 
throw an error encouraging `VectorSizeHint`

Please share thoughts and feedback on this!

____
edit: In an offline talk with [~josephkb] we decided to throw errors instead of 
warning about any size inference failures.


was (Author: yogeshgarg):
In an offline discussion with [~mrbago], we discussed the following behavior 
for `handleInvalid`. We have to get the lengths of the vector columns that are 
involved in the assembly, ideally this information is present in the 
`attributeGroup` of the column, but that might return `size == -1`, in which 
case we earlier used the `d.select.first` to infer the size of these columns. 
This could raise an exception in the corner case that the first row itself has 
null values. We are abandoning the idea that we can get this information by 
finding a non-null row in each of such columns because this approach has 
complicated logic, terrible run time (O(#columns) distributed queries) and 
fewer guarantees for any such data we might see in the future (even if we infer 
the size right now, there's no guarantee we can do it later, leading to an 
un-expected error).

1. *Error*: Find the remaining lengths from `d.select.first`
  * if we get NullPointerException while iterating on the cells for sizes, 
throw an (early) error
  * if we get NoSuchElementError while looking for the first row, -give the 
rows 0 sizes and warn- throw error about incomplete metadata

2. *Skip*: Find remaining lengths from `d.drop.first`
  * if we get NoSuchElementError, -warn- throw error about incomplete metadata
  * Note that we can't get NullPointerException in this case (yay!)

3. *Keep*: If any column does not have attribute sizes, it's dangerous to infer 
sizes from the data because even if we get the information form the current 
dataset, a future cut of the data is not guaranteed to be infer-able. Thus, 
throw an error encouraging `VectorSizeHint`

Please share thoughts and feedback on this!

____
edit: In an offline talk with @jkbradley we decided to throw errors instead of 
warning about any size inference failures.

> VectorAssembler should have handleInvalid to handle columns with null values
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-23690
>                 URL: https://issues.apache.org/jira/browse/SPARK-23690
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: yogesh garg
>            Priority: Major
>
> VectorAssembler only takes in numeric (and vectors (of numeric?)) columns as 
> an input and returns the assembled vector. It currently throws an error if it 
> sees a null value in any column. This behavior also affects `RFormula` that 
> uses VectorAssembler to assemble numeric columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-23690) VectorAssembler should have handleInvalid to handle columns with null values

Reply via email to