[
https://issues.apache.org/jira/browse/SPARK-9237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637871#comment-14637871
]
Ted Malaska commented on SPARK-9237:
------------------------------------
It is very common in banks, to do data quality checks on data that comes in
from third parties
> Added Top N Column Values for DataFrames
> ----------------------------------------
>
> Key: SPARK-9237
> URL: https://issues.apache.org/jira/browse/SPARK-9237
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Reporter: Ted Malaska
> Priority: Minor
>
> This jira is to add a very common data quality check into dataframes.
> A quick outline of this functionality can be seen in the following blog post
> http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/
> There are two parts to this Jira.
> 1. How to implement the Top N Count. Which I will start with the
> implementation in the blog
> 2. Where to add the function. Ether straight off Dataframe, in Dataframe
> describe or in DataFrameStatFunctions. I will start with putting it into
> DataFrameStatFunctions.
> Please let me know if you have any input.
> Thanks
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]