Wow, per source config is really useful. I've needed this feature for a while, did not know it already existed.
Kostya On Fri, Jan 6, 2017 at 5:06 PM, 'Alex Levenson' via Scalding Development < [email protected]> wrote: > I think you can set this per-source as well (instead of for all sources) > by overriding `tapConfig` here: https://github.com/ > twitter/scalding/blob/develop/scalding-core/src/main/scala/ > com/twitter/scalding/HfsConfPropertySetter.scala#L55 > > On Fri, Jan 6, 2017 at 4:58 PM, 'Oscar Boykin' via Scalding Development < > [email protected]> wrote: > >> You want to set this config: >> >> http://docs.cascading.org/cascading/2.2/javadoc/constant- >> values.html#cascading.tap.hadoop.HfsProps.COMBINE_INPUT_FILES >> >> "cascading.hadoop.hfs.combine.files" -> true >> >> which you can do in the job: >> >> override def config = super.config + ("cascading.hadoop.hfs.combine.files" >> -> true) >> >> or with a -Dcascading.hadoop.hfs.combine.files=true >> >> >> option to hadoop. >> >> That should work. Let us know if it does not. >> >> On Fri, Jan 6, 2017 at 12:52 PM Nikhil J Joshi <[email protected]> >> wrote: >> >>> Hi, >>> >>> >>> I recently converted a Pig script to an equivalent scalding. While >>> running the pig script on the input consisting of many small files I see >>> the inputs are combined as per logs here: >>> >>> >>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input >>> paths to process : 1000 06-01-2017 14:37:58 PST >>> referral-scoring_scoring_feature-generation-v2_extract-postf >>> east-fields-jobs-basic >>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total >>> input paths to process : 1000 06-01-2017 14:37:58 PST >>> referral-scoring_scoring_feature-generation-v2_extract-postf >>> east-fields-jobs-basic >>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total >>> input paths (combined) to process : 77 06-01-2017 14:37:58 PST >>> referral-scoring_scoring_feature-generation-v2_extract-postfeast-fields-jobs-basic >>> INFO - 2017-01-06 22:37:58,517 org.apache.hadoop.mapreduce.JobSubmitter >>> - number of splits:77 >>> >>> However the scalding job doesn't seem to combine and run 1000 mappers, >>> one per input file which is causing bad performance. Is there something >>> wrong with the way I am executing the scalding job? >>> >>> The part of the script responsible for the step above is >>> >>> private val ids: TypedPipe[Int] = TypedPipe >>> .from(PackedAvroSource[Identifiers](args("identifiers"))) >>> .map{ featureNamePrefix match { >>> case "member" => _.getMemberId.toInt >>> case "item" => _.getItemId.toInt >>> }} >>> >>> Any help is greatly appreciated. >>> Thanks, >>> Nikhil >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Scalding Development" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "Scalding Development" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> For more options, visit https://groups.google.com/d/optout. >> > > > > -- > Alex Levenson > @THISWILLWORK > > -- > You received this message because you are subscribed to the Google Groups > "Scalding Development" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- Konstantin mailto:[email protected] -- You received this message because you are subscribed to the Google Groups "Scalding Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
