[jira] [Commented] (SPARK-9237) Added Top N Column Values for DataFrames

Ted Malaska (JIRA) Wed, 22 Jul 2015 15:50:35 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-9237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637871#comment-14637871
 ]


Ted Malaska commented on SPARK-9237:
------------------------------------

It is very common in banks, to do data quality checks on data that comes in 
from third parties

> Added Top N Column Values for DataFrames
> ----------------------------------------
>
>                 Key: SPARK-9237
>                 URL: https://issues.apache.org/jira/browse/SPARK-9237
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Ted Malaska
>            Priority: Minor
>
> This jira is to add a very common data quality check into dataframes.
> A quick outline of this functionality can be seen in the following blog post
> http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/
> There are two parts to this Jira.
> 1. How to implement the Top N Count.  Which I will start with the 
> implementation in the blog
> 2. Where to add the function.  Ether straight off Dataframe, in Dataframe 
> describe or in DataFrameStatFunctions.  I will start with putting it into 
> DataFrameStatFunctions.
> Please let me know if you have any input.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-9237) Added Top N Column Values for DataFrames

Reply via email to