Re: Small files not combined in mapper

'Alex Levenson' via Scalding Development Tue, 10 Jan 2017 14:21:43 -0800

If you look at how HfsConfPropertySetter is implemented, you just need to
use a Tap that overrides sourceConfInit and adds some things to the
(mutable) config object there. So you can do that pretty easily yourself
w/o using HfsConfPropertySetter if you need to.
The important bit is getting your settings into the configuration via the
sourceConfInit method of the Tap.


Or just set it globally if that works for you -- all this is to keep the
configs separated for each source.

On Tue, Jan 10, 2017 at 2:11 PM, Nikhil J Joshi <[email protected]> wrote:

> Hi Alex,
>
> Thanks for the explanation. I realized that we are still on 0.13 with
> scala 2.10 and some of the things were not introduced before 0.16. I will
> need to figure out a work around this issue.
>
> Thanks,
> Nikhil
>
> On Tue, Jan 10, 2017 at 1:05 PM Alex Levenson <[email protected]>
> wrote:
>
>> If PackedAvroSource extends FileSource (which extends HfsTapProvider) --
>> or if it just extends HfsTapProvider on its own, then you can just do
>> something like:
>>
>> new PackedAvroSource[Identifiers](args("identifiers"))) with
>> HfsConfPropertySetter {
>>   override def tapConfig = Config(Map("foo" -> "bar"))
>> }
>>
>> Does that make sense?
>>
>> On Tue, Jan 10, 2017 at 9:49 AM, Nikhil J Joshi <[email protected]>
>> wrote:
>>
>> Hi Alex,
>>
>> I am trying the `HfsConfPropertySetter` way. I couldn't find an example
>> to implement it correctly, it seems. Could you share with me some more
>> details on this? An example code will be great.
>>
>> Thanks again,
>> Nikhil
>>
>> On Fri, Jan 6, 2017 at 6:23 PM Nikhil J Joshi <[email protected]>
>> wrote:
>>
>> Thanks Oscar and Alex. I will follow up and update you on these
>> incredible ideas.
>> Have a great weekend,
>> Nikhil
>>
>> On Fri, Jan 6, 2017 at 6:12 PM Alex Levenson <[email protected]>
>> wrote:
>>
>> Yeah per-source config is done via Tap.sourceConfInit and
>> Tap.sinkConfInit -- so these custom settings will only apply after one of
>> those methods is called.
>>
>> So it can't be used to control things that happen before then, eg, the
>> heap size of your mappers or things like that.
>>
>> On Fri, Jan 6, 2017 at 6:00 PM, Kostya Salomatin <[email protected]>
>> wrote:
>>
>> Wow, per source config is really useful. I've needed this feature for a
>> while, did not know it already existed.
>>
>> Kostya
>>
>> On Fri, Jan 6, 2017 at 5:06 PM, 'Alex Levenson' via Scalding Development
>> <[email protected]> wrote:
>>
>> I think you can set this per-source as well (instead of for all sources)
>> by overriding `tapConfig` here: https://github.com/
>> twitter/scalding/blob/develop/scalding-core/src/main/scala/
>> com/twitter/scalding/HfsConfPropertySetter.scala#L55
>>
>> On Fri, Jan 6, 2017 at 4:58 PM, 'Oscar Boykin' via Scalding Development <
>> [email protected]> wrote:
>>
>> You want to set this config:
>>
>> http://docs.cascading.org/cascading/2.2/javadoc/constant-values.html#
>> cascading.tap.hadoop.HfsProps.COMBINE_INPUT_FILES
>>
>> "cascading.hadoop.hfs.combine.files" -> true
>>
>> which you can do in the job:
>>
>> override def config = super.config + ("cascading.hadoop.hfs.combine.files"
>> -> true)
>>
>> or with a -Dcascading.hadoop.hfs.combine.files=true
>>
>>
>> option to hadoop.
>>
>> That should work. Let us know if it does not.
>>
>> On Fri, Jan 6, 2017 at 12:52 PM Nikhil J Joshi <[email protected]>
>> wrote:
>>
>> Hi,
>>
>>
>> I recently converted a Pig script to an equivalent scalding. While
>> running the pig script on the input consisting of many small files I see
>> the inputs are combined as per logs here:
>>
>>
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>> paths to process : 1000 06-01-2017 14:37:58 PST referral-scoring_scoring_
>> feature-generation-v2_extract-postfeast-fields-jobs-basic
>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> input paths to process : 1000 06-01-2017 14:37:58 PST
>> referral-scoring_scoring_feature-generation-v2_extract-
>> postfeast-fields-jobs-basic
>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> input paths (combined) to process : 77 06-01-2017 14:37:58 PST
>> referral-scoring_scoring_feature-generation-v2_extract-postfeast-fields-jobs-basic
>> INFO - 2017-01-06 22:37:58,517 org.apache.hadoop.mapreduce.JobSubmitter
>> - number of splits:77
>>
>> However the scalding job doesn't seem to combine and run 1000 mappers,
>> one per input file which is causing bad performance. Is there something
>> wrong with the way I am executing the scalding job?
>>
>> The part of the script responsible for the step above is
>>
>> private val ids: TypedPipe[Int] = TypedPipe
>>     .from(PackedAvroSource[Identifiers](args("identifiers")))
>>     .map{ featureNamePrefix match {
>>       case "member" => _.getMemberId.toInt
>>       case "item" => _.getItemId.toInt
>>     }}
>>
>> Any help is greatly appreciated.
>> Thanks,
>> Nikhil
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Scalding Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Scalding Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>>
>> --
>> Alex Levenson
>> @THISWILLWORK
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Scalding Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>>
>> --
>> Konstantin                              mailto:[email protected]
>>
>>
>>
>>
>> --
>> Alex Levenson
>> @THISWILLWORK
>>
>> --
>>
>> Nikhil J Joshi
>> Senior Applied Researcher - Machine Learning, Data Science
>> LinkedIn Corp.
>>
>> --
>>
>> Nikhil J Joshi
>> Senior Applied Researcher - Machine Learning, Data Science
>> LinkedIn Corp.
>>
>>
>>
>>
>> --
>> Alex Levenson
>> @THISWILLWORK
>>
> --
>
> Nikhil J Joshi
> Senior Applied Researcher - Machine Learning, Data Science
> LinkedIn Corp.
>



-- 
Alex Levenson
@THISWILLWORK

-- 
You received this message because you are subscribed to the Google Groups 
"Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Small files not combined in mapper

Reply via email to