[ 
https://issues.apache.org/jira/browse/SPARK-14891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257004#comment-15257004
 ] 

Joseph K. Bradley commented on SPARK-14891:
-------------------------------------------

For most use cases, Int should be used to save on memory.  Supporting String in 
the future would be nice but would require internal indexing.  I'd say we 
should validate the input for now and require Int types.  Users who need Long 
can use the ALS.train API.

+1 for better docs & data validation.  For data validation, it could be nice to 
accept Long and other types but to make sure that the values are checked before 
casting to Int types.

> ALS in ML never validates input schema
> --------------------------------------
>
>                 Key: SPARK-14891
>                 URL: https://issues.apache.org/jira/browse/SPARK-14891
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>            Reporter: Nick Pentreath
>
> Currently, {{ALS.fit}} never validates the input schema. There is a 
> {{transformSchema}} impl that calls {{validateAndTransformSchema}}, but it is 
> never called in either {{ALS.fit}} or {{ALSModel.transform}}.
> This was highlighted in SPARK-13857 (and failing PySpark tests 
> [here|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56849/consoleFull])when
>  adding a call to {{transformSchema}} in {{ALSModel.transform}} that actually 
> validates the input schema. The PySpark docstring tests result in Long inputs 
> by default, which fail validation as Int is required.
> Currently, the inputs for user and item ids are cast to Int, with no input 
> type validation (or warning message). So users could pass in Long, Float, 
> Double, etc. It's also not made clear anywhere in the docs that only Int 
> types for user and item are supported.
> Enforcing validation seems the best option but might break user code that 
> previously "just worked" especially in PySpark. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to