Re: Small files not combined in mapper

Nikhil J Joshi Tue, 10 Jan 2017 09:50:16 -0800

Hi Alex,

I am trying the `HfsConfPropertySetter` way. I couldn't find an example to
implement it correctly, it seems. Could you share with me some more details
on this? An example code will be great.


Thanks again,
Nikhil

On Fri, Jan 6, 2017 at 6:23 PM Nikhil J Joshi <[email protected]> wrote:

> Thanks Oscar and Alex. I will follow up and update you on these incredible
> ideas.
> Have a great weekend,
> Nikhil
>
> On Fri, Jan 6, 2017 at 6:12 PM Alex Levenson <[email protected]>
> wrote:
>
> Yeah per-source config is done via Tap.sourceConfInit and Tap.sinkConfInit
> -- so these custom settings will only apply after one of those methods is
> called.
>
> So it can't be used to control things that happen before then, eg, the
> heap size of your mappers or things like that.
>
> On Fri, Jan 6, 2017 at 6:00 PM, Kostya Salomatin <[email protected]>
> wrote:
>
> Wow, per source config is really useful. I've needed this feature for a
> while, did not know it already existed.
>
> Kostya
>
> On Fri, Jan 6, 2017 at 5:06 PM, 'Alex Levenson' via Scalding Development <
> [email protected]> wrote:
>
> I think you can set this per-source as well (instead of for all sources)
> by overriding `tapConfig` here:
> https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/HfsConfPropertySetter.scala#L55
>
> On Fri, Jan 6, 2017 at 4:58 PM, 'Oscar Boykin' via Scalding Development <
> [email protected]> wrote:
>
> You want to set this config:
>
>
> http://docs.cascading.org/cascading/2.2/javadoc/constant-values.html#cascading.tap.hadoop.HfsProps.COMBINE_INPUT_FILES
>
> "cascading.hadoop.hfs.combine.files" -> true
>
> which you can do in the job:
>
> override def config = super.config + ("cascading.hadoop.hfs.combine.files"
> -> true)
>
> or with a -Dcascading.hadoop.hfs.combine.files=true
>
>
> option to hadoop.
>
> That should work. Let us know if it does not.
>
> On Fri, Jan 6, 2017 at 12:52 PM Nikhil J Joshi <[email protected]>
> wrote:
>
> Hi,
>
>
> I recently converted a Pig script to an equivalent scalding. While running
> the pig script on the input consisting of many small files I see the inputs
> are combined as per logs here:
>
>
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
> to process : 1000 06-01-2017 14:37:58 PST
> referral-scoring_scoring_feature-generation-v2_extract-postfeast-fields-jobs-basic
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths to process : 1000 06-01-2017 14:37:58 PST
> referral-scoring_scoring_feature-generation-v2_extract-postfeast-fields-jobs-basic
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths (combined) to process : 77 06-01-2017 14:37:58 PST
> referral-scoring_scoring_feature-generation-v2_extract-postfeast-fields-jobs-basic
> INFO - 2017-01-06 22:37:58,517 org.apache.hadoop.mapreduce.JobSubmitter -
> number of splits:77
>
> However the scalding job doesn't seem to combine and run 1000 mappers, one
> per input file which is causing bad performance. Is there something wrong
> with the way I am executing the scalding job?
>
> The part of the script responsible for the step above is
>
> private val ids: TypedPipe[Int] = TypedPipe
>     .from(PackedAvroSource[Identifiers](args("identifiers")))
>     .map{ featureNamePrefix match {
>       case "member" => _.getMemberId.toInt
>       case "item" => _.getItemId.toInt
>     }}
>
> Any help is greatly appreciated.
> Thanks,
> Nikhil
>
> --
> You received this message because you are subscribed to the Google Groups
> "Scalding Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Scalding Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
> --
> Alex Levenson
> @THISWILLWORK
>
> --
> You received this message because you are subscribed to the Google Groups
> "Scalding Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
> --
> Konstantin                              mailto:[email protected]
>
>
>
>
> --
> Alex Levenson
> @THISWILLWORK
>
> --
>
> Nikhil J Joshi
> Senior Applied Researcher - Machine Learning, Data Science
> LinkedIn Corp.
>
-- 

Nikhil J Joshi
Senior Applied Researcher - Machine Learning, Data Science
LinkedIn Corp.

-- 
You received this message because you are subscribed to the Google Groups 
"Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Small files not combined in mapper

Reply via email to