[
https://issues.apache.org/jira/browse/SPARK-23690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405174#comment-16405174
]
yogesh garg edited comment on SPARK-23690 at 3/19/18 7:04 PM:
--------------------------------------------------------------
In an offline discussion with [~mrbago], we discussed the following behavior
for `handleInvalid`. We have to get the lengths of the vector columns that are
involved in the assembly, ideally this information is present in the
`attributeGroup` of the column, but that might return `size == -1`, in which
case we earlier used the `d.select.first` to infer the size of these columns.
This could raise an exception in the corner case that the first row itself has
null values. We are abandoning the idea that we can get this information by
finding a non-null row in each of such columns because this approach has
complicated logic, terrible run time (O(#columns) distributed queries) and
fewer guarantees for any such data we might see in the future (even if we infer
the size right now, there's no guarantee we can do it later, leading to an
un-expected error).
1. *Error*: Find the remaining lengths from `d.select.first`
* if we get NullPointerException while iterating on the cells for sizes,
throw an (early) error
* if we get NoSuchElementError while looking for the first row, -give the
rows 0 sizes and warn- throw error about incomplete metadata
2. *Skip*: Find remaining lengths from `d.drop.first`
* if we get NoSuchElementError, -warn- throw error about incomplete metadata
* Note that we can't get NullPointerException in this case (yay!)
3. *Keep*: If any column does not have attribute sizes, it's dangerous to infer
sizes from the data because even if we get the information form the current
dataset, a future cut of the data is not guaranteed to be infer-able. Thus,
throw an error encouraging `VectorSizeHint`
Please share thoughts and feedback on this!
____
edit: In an offline talk with [~josephkb] we decided to throw errors instead of
warning about any size inference failures.
was (Author: yogeshgarg):
In an offline discussion with [~mrbago], we discussed the following behavior
for `handleInvalid`. We have to get the lengths of the vector columns that are
involved in the assembly, ideally this information is present in the
`attributeGroup` of the column, but that might return `size == -1`, in which
case we earlier used the `d.select.first` to infer the size of these columns.
This could raise an exception in the corner case that the first row itself has
null values. We are abandoning the idea that we can get this information by
finding a non-null row in each of such columns because this approach has
complicated logic, terrible run time (O(#columns) distributed queries) and
fewer guarantees for any such data we might see in the future (even if we infer
the size right now, there's no guarantee we can do it later, leading to an
un-expected error).
1. *Error*: Find the remaining lengths from `d.select.first`
* if we get NullPointerException while iterating on the cells for sizes,
throw an (early) error
* if we get NoSuchElementError while looking for the first row, -give the
rows 0 sizes and warn- throw error about incomplete metadata
2. *Skip*: Find remaining lengths from `d.drop.first`
* if we get NoSuchElementError, -warn- throw error about incomplete metadata
* Note that we can't get NullPointerException in this case (yay!)
3. *Keep*: If any column does not have attribute sizes, it's dangerous to infer
sizes from the data because even if we get the information form the current
dataset, a future cut of the data is not guaranteed to be infer-able. Thus,
throw an error encouraging `VectorSizeHint`
Please share thoughts and feedback on this!
____
edit: In an offline talk with @jkbradley we decided to throw errors instead of
warning about any size inference failures.
> VectorAssembler should have handleInvalid to handle columns with null values
> ----------------------------------------------------------------------------
>
> Key: SPARK-23690
> URL: https://issues.apache.org/jira/browse/SPARK-23690
> Project: Spark
> Issue Type: Sub-task
> Components: ML
> Affects Versions: 2.3.0
> Reporter: yogesh garg
> Priority: Major
>
> VectorAssembler only takes in numeric (and vectors (of numeric?)) columns as
> an input and returns the assembled vector. It currently throws an error if it
> sees a null value in any column. This behavior also affects `RFormula` that
> uses VectorAssembler to assemble numeric columns.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]