Small files not combined in mapper

Nikhil J Joshi Fri, 06 Jan 2017 14:53:13 -0800

Hi,


I recently converted a Pig script to an equivalent scalding. While running 
the pig script on the input consisting of many small files I see the inputs 
are combined as per logs here:


org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths 
to process : 1000 06-01-2017 14:37:58 PST 
referral-scoring_scoring_feature-generation-v2_extract-postfeast-fields-jobs-basic
 
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input 
paths to process : 1000 06-01-2017 14:37:58 PST 
referral-scoring_scoring_feature-generation-v2_extract-postfeast-fields-jobs-basic
 
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input 
paths (combined) to process : 77 06-01-2017 14:37:58 PST 
referral-scoring_scoring_feature-generation-v2_extract-postfeast-fields-jobs-basic
 
INFO - 2017-01-06 22:37:58,517 org.apache.hadoop.mapreduce.JobSubmitter - 
number of splits:77

However the scalding job doesn't seem to combine and run 1000 mappers, one 
per input file which is causing bad performance. Is there something wrong 
with the way I am executing the scalding job?

The part of the script responsible for the step above is 

private val ids: TypedPipe[Int] = TypedPipe
    .from(PackedAvroSource[Identifiers](args("identifiers")))
    .map{ featureNamePrefix match {
      case "member" => _.getMemberId.toInt
      case "item" => _.getItemId.toInt
    }}

Any help is greatly appreciated.
Thanks,
Nikhil

-- 
You received this message because you are subscribed to the Google Groups 
"Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Small files not combined in mapper

Reply via email to