Consider the following snippet of scalding code:

      val fields = List[String]("blue", "yellow", "red")

     def getDistinctCount(DB: TypedPipe[Info])(implicit flowDef: 
cascading.flow.FlowDef, mode: com.twitter.scalding.Mode) = {
        
        val jsonValue: TypedPipe[String] = DB.map { info => getKey("blue" } 
//Returns a set of values (String) for the input
   
        val distinctCount: ValuePipe[Int]= jsonValue.distinct.map { x => 1 
}.sum //Returns Distinct Count

        distinctCount.write(TypedTsv(("HDFS Location"))) //Writes the value 
to a HDFS location

      }

I have to calculate the Distinct count for every field in the List, one way 
is to iterate through list and calculate distinct counts for each String 
and write in a different HDFS locations and merge them to local file.
In reality I have to calculate for at least twenty fields which makes this 
Process really slow. 

Is there any optimal way to calculate the distinct count for every field 
and write it to a single HDFS location at one go. Thank you !!

-- 
You received this message because you are subscribed to the Google Groups 
"Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to