Re: hadoop input/output format advanced control

Koert Kuipers Wed, 25 Mar 2015 06:37:38 -0700

my personal preference would be something like a Map[String, String] that
only reflects the changes you want to make the Configuration for the given
input/output format (so system wide defaults continue to come from
sc.hadoopConfiguration), similarly to what cascading/scalding did, but am
arbitrary Configuration will work too.


i will make a jira and pullreq when i have some time.



On Wed, Mar 25, 2015 at 1:23 AM, Patrick Wendell <[email protected]> wrote:

> I see - if you look, in the saving functions we have the option for
> the user to pass an arbitrary Configuration.
>
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L894
>
> It seems fine to have the same option for the loading functions, if
> it's easy to just pass this config into the input format.
>
>
>
> On Tue, Mar 24, 2015 at 3:46 PM, Koert Kuipers <[email protected]> wrote:
> > the (compression) codec parameter that is now part of many saveAs...
> methods
> > came from a similar need. see SPARK-763
> > hadoop has many options like this. you either going to have to allow many
> > more of these optional arguments to all the methods that read from hadoop
> > inputformats and write to hadoop outputformats, or you force people to
> > re-create these methods using HadoopRDD, i think (if thats even
> possible).
> >
> > On Tue, Mar 24, 2015 at 6:40 PM, Koert Kuipers <[email protected]>
> wrote:
> >>
> >> i would like to use objectFile with some tweaks to the hadoop conf.
> >> currently there is no way to do that, except recreating objectFile
> myself.
> >> and some of the code objectFile uses i have no access to, since its
> private
> >> to spark.
> >>
> >>
> >> On Tue, Mar 24, 2015 at 2:59 PM, Patrick Wendell <[email protected]>
> >> wrote:
> >>>
> >>> Yeah - to Nick's point, I think the way to do this is to pass in a
> >>> custom conf when you create a Hadoop RDD (that's AFAIK why the conf
> >>> field is there). Is there anything you can't do with that feature?
> >>>
> >>> On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
> >>> <[email protected]> wrote:
> >>> > Imran, on your point to read multiple files together in a partition,
> is
> >>> > it
> >>> > not simpler to use the approach of copy Hadoop conf and set per-RDD
> >>> > settings for min split to control the input size per partition,
> >>> > together
> >>> > with something like CombineFileInputFormat?
> >>> >
> >>> > On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid <[email protected]>
> >>> > wrote:
> >>> >
> >>> >> I think this would be a great addition, I totally agree that you
> need
> >>> >> to be
> >>> >> able to set these at a finer context than just the SparkContext.
> >>> >>
> >>> >> Just to play devil's advocate, though -- the alternative is for you
> >>> >> just
> >>> >> subclass HadoopRDD yourself, or make a totally new RDD, and then you
> >>> >> could
> >>> >> expose whatever you need.  Why is this solution better?  IMO the
> >>> >> criteria
> >>> >> are:
> >>> >> (a) common operations
> >>> >> (b) error-prone / difficult to implement
> >>> >> (c) non-obvious, but important for performance
> >>> >>
> >>> >> I think this case fits (a) & (c), so I think its still worthwhile.
> >>> >> But its
> >>> >> also worth asking whether or not its too difficult for a user to
> >>> >> extend
> >>> >> HadoopRDD right now.  There have been several cases in the past week
> >>> >> where
> >>> >> we've suggested that a user should read from hdfs themselves (eg.,
> to
> >>> >> read
> >>> >> multiple files together in one partition) -- with*out* reusing the
> >>> >> code in
> >>> >> HadoopRDD, though they would lose things like the metric tracking &
> >>> >> preferred locations you get from HadoopRDD.  Does HadoopRDD need to
> >>> >> some
> >>> >> refactoring to make that easier to do?  Or do we just need a good
> >>> >> example?
> >>> >>
> >>> >> Imran
> >>> >>
> >>> >> (sorry for hijacking your thread, Koert)
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers <[email protected]>
> >>> >> wrote:
> >>> >>
> >>> >> > see email below. reynold suggested i send it to dev instead of
> user
> >>> >> >
> >>> >> > ---------- Forwarded message ----------
> >>> >> > From: Koert Kuipers <[email protected]>
> >>> >> > Date: Mon, Mar 23, 2015 at 4:36 PM
> >>> >> > Subject: hadoop input/output format advanced control
> >>> >> > To: "[email protected]" <[email protected]>
> >>> >> >
> >>> >> >
> >>> >> > currently its pretty hard to control the Hadoop Input/Output
> formats
> >>> >> > used
> >>> >> > in Spark. The conventions seems to be to add extra parameters to
> all
> >>> >> > methods and then somewhere deep inside the code (for example in
> >>> >> > PairRDDFunctions.saveAsHadoopFile) all these parameters get
> >>> >> > translated
> >>> >> into
> >>> >> > settings on the Hadoop Configuration object.
> >>> >> >
> >>> >> > for example for compression i see "codec: Option[Class[_ <:
> >>> >> > CompressionCodec]] = None" added to a bunch of methods.
> >>> >> >
> >>> >> > how scalable is this solution really?
> >>> >> >
> >>> >> > for example i need to read from a hadoop dataset and i dont want
> the
> >>> >> input
> >>> >> > (part) files to get split up. the way to do this is to set
> >>> >> > "mapred.min.split.size". now i dont want to set this at the level
> of
> >>> >> > the
> >>> >> > SparkContext (which can be done), since i dont want it to apply to
> >>> >> > input
> >>> >> > formats in general. i want it to apply to just this one specific
> >>> >> > input
> >>> >> > dataset i need to read. which leaves me with no options
> currently. i
> >>> >> could
> >>> >> > go add yet another input parameter to all the methods
> >>> >> > (SparkContext.textFile, SparkContext.hadoopFile,
> >>> >> > SparkContext.objectFile,
> >>> >> > etc.). but that seems ineffective.
> >>> >> >
> >>> >> > why can we not expose a Map[String, String] or some other generic
> >>> >> > way to
> >>> >> > manipulate settings for hadoop input/output formats? it would
> >>> >> > require
> >>> >> > adding one more parameter to all methods to deal with hadoop
> >>> >> > input/output
> >>> >> > formats, but after that its done. one parameter to rule them
> all....
> >>> >> >
> >>> >> > then i could do:
> >>> >> > val x = sc.textFile("/some/path", formatSettings =
> >>> >> > Map("mapred.min.split.size" -> "12345"))
> >>> >> >
> >>> >> > or
> >>> >> > rdd.saveAsTextFile("/some/path, formatSettings =
> >>> >> > Map(mapred.output.compress" -> "true",
> >>> >> > "mapred.output.compression.codec"
> >>> >> ->
> >>> >> > "somecodec"))
> >>> >> >
> >>> >>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [email protected]
> >>> For additional commands, e-mail: [email protected]
> >>>
> >>
> >
>

Re: hadoop input/output format advanced control

Reply via email to