Hi,
how about
DB.flatMap {
info => fields.map(fieldName => (fieldName, getKey(fieldName)) )
}.distinct
.map { case (fieldName, distinctValue) => (fieldName, 1) }
.group.sum
?
-- Cyrille
Le 03/12/2017 à 07:44, [email protected] a écrit :
Consider the following snippet of scalding code:
val fields = List[String]("blue", "yellow", "red")
def getDistinctCount(DB: TypedPipe[Info])(implicit flowDef:
cascading.flow.FlowDef, mode: com.twitter.scalding.Mode) = {
val jsonValue: TypedPipe[String] = DB.map { info =>
getKey("blue" } //Returns a set of values (String) for the input
val distinctCount: ValuePipe[Int]= jsonValue.distinct.map { x
=> 1 }.sum //Returns Distinct Count
distinctCount.write(TypedTsv(("HDFS Location"))) //Writes the
value to a HDFS location
}
I have to calculate the Distinct count for every field in the List,
one way is to iterate through list and calculate distinct counts for
each String and write in a different HDFS locations and merge them to
local file.
In reality I have to calculate for at least twenty fields which makes
this Process really slow.
Is there any optimal way to calculate the distinct count for every
field and write it to a single HDFS location at one go. Thank you !!
--
You received this message because you are subscribed to the Google
Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to [email protected]
<mailto:[email protected]>.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Scalding
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.