Wow, per source config is really useful. I've needed this feature for a
while, did not know it already existed.

Kostya

On Fri, Jan 6, 2017 at 5:06 PM, 'Alex Levenson' via Scalding Development <
[email protected]> wrote:

> I think you can set this per-source as well (instead of for all sources)
> by overriding `tapConfig` here: https://github.com/
> twitter/scalding/blob/develop/scalding-core/src/main/scala/
> com/twitter/scalding/HfsConfPropertySetter.scala#L55
>
> On Fri, Jan 6, 2017 at 4:58 PM, 'Oscar Boykin' via Scalding Development <
> [email protected]> wrote:
>
>> You want to set this config:
>>
>> http://docs.cascading.org/cascading/2.2/javadoc/constant-
>> values.html#cascading.tap.hadoop.HfsProps.COMBINE_INPUT_FILES
>>
>> "cascading.hadoop.hfs.combine.files" -> true
>>
>> which you can do in the job:
>>
>> override def config = super.config + ("cascading.hadoop.hfs.combine.files"
>> -> true)
>>
>> or with a -Dcascading.hadoop.hfs.combine.files=true
>>
>>
>> option to hadoop.
>>
>> That should work. Let us know if it does not.
>>
>> On Fri, Jan 6, 2017 at 12:52 PM Nikhil J Joshi <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>>
>>> I recently converted a Pig script to an equivalent scalding. While
>>> running the pig script on the input consisting of many small files I see
>>> the inputs are combined as per logs here:
>>>
>>>
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>>> paths to process : 1000 06-01-2017 14:37:58 PST
>>> referral-scoring_scoring_feature-generation-v2_extract-postf
>>> east-fields-jobs-basic
>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>>> input paths to process : 1000 06-01-2017 14:37:58 PST
>>> referral-scoring_scoring_feature-generation-v2_extract-postf
>>> east-fields-jobs-basic
>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>>> input paths (combined) to process : 77 06-01-2017 14:37:58 PST
>>> referral-scoring_scoring_feature-generation-v2_extract-postfeast-fields-jobs-basic
>>> INFO - 2017-01-06 22:37:58,517 org.apache.hadoop.mapreduce.JobSubmitter
>>> - number of splits:77
>>>
>>> However the scalding job doesn't seem to combine and run 1000 mappers,
>>> one per input file which is causing bad performance. Is there something
>>> wrong with the way I am executing the scalding job?
>>>
>>> The part of the script responsible for the step above is
>>>
>>> private val ids: TypedPipe[Int] = TypedPipe
>>>     .from(PackedAvroSource[Identifiers](args("identifiers")))
>>>     .map{ featureNamePrefix match {
>>>       case "member" => _.getMemberId.toInt
>>>       case "item" => _.getItemId.toInt
>>>     }}
>>>
>>> Any help is greatly appreciated.
>>> Thanks,
>>> Nikhil
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Scalding Development" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Scalding Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> --
> Alex Levenson
> @THISWILLWORK
>
> --
> You received this message because you are subscribed to the Google Groups
> "Scalding Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Konstantin                              mailto:[email protected]

-- 
You received this message because you are subscribed to the Google Groups 
"Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to