Re: hadoop input/output format advanced control

2015-03-25 Thread Koert Kuipers
my personal preference would be something like a Map[String, String] that
only reflects the changes you want to make the Configuration for the given
input/output format (so system wide defaults continue to come from
sc.hadoopConfiguration), similarly to what cascading/scalding did, but am
arbitrary Configuration will work too.

i will make a jira and pullreq when i have some time.



On Wed, Mar 25, 2015 at 1:23 AM, Patrick Wendell pwend...@gmail.com wrote:

 I see - if you look, in the saving functions we have the option for
 the user to pass an arbitrary Configuration.


 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L894

 It seems fine to have the same option for the loading functions, if
 it's easy to just pass this config into the input format.



 On Tue, Mar 24, 2015 at 3:46 PM, Koert Kuipers ko...@tresata.com wrote:
  the (compression) codec parameter that is now part of many saveAs...
 methods
  came from a similar need. see SPARK-763
  hadoop has many options like this. you either going to have to allow many
  more of these optional arguments to all the methods that read from hadoop
  inputformats and write to hadoop outputformats, or you force people to
  re-create these methods using HadoopRDD, i think (if thats even
 possible).
 
  On Tue, Mar 24, 2015 at 6:40 PM, Koert Kuipers ko...@tresata.com
 wrote:
 
  i would like to use objectFile with some tweaks to the hadoop conf.
  currently there is no way to do that, except recreating objectFile
 myself.
  and some of the code objectFile uses i have no access to, since its
 private
  to spark.
 
 
  On Tue, Mar 24, 2015 at 2:59 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
  Yeah - to Nick's point, I think the way to do this is to pass in a
  custom conf when you create a Hadoop RDD (that's AFAIK why the conf
  field is there). Is there anything you can't do with that feature?
 
  On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
  nick.pentre...@gmail.com wrote:
   Imran, on your point to read multiple files together in a partition,
 is
   it
   not simpler to use the approach of copy Hadoop conf and set per-RDD
   settings for min split to control the input size per partition,
   together
   with something like CombineFileInputFormat?
  
   On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid iras...@cloudera.com
   wrote:
  
   I think this would be a great addition, I totally agree that you
 need
   to be
   able to set these at a finer context than just the SparkContext.
  
   Just to play devil's advocate, though -- the alternative is for you
   just
   subclass HadoopRDD yourself, or make a totally new RDD, and then you
   could
   expose whatever you need.  Why is this solution better?  IMO the
   criteria
   are:
   (a) common operations
   (b) error-prone / difficult to implement
   (c) non-obvious, but important for performance
  
   I think this case fits (a)  (c), so I think its still worthwhile.
   But its
   also worth asking whether or not its too difficult for a user to
   extend
   HadoopRDD right now.  There have been several cases in the past week
   where
   we've suggested that a user should read from hdfs themselves (eg.,
 to
   read
   multiple files together in one partition) -- with*out* reusing the
   code in
   HadoopRDD, though they would lose things like the metric tracking 
   preferred locations you get from HadoopRDD.  Does HadoopRDD need to
   some
   refactoring to make that easier to do?  Or do we just need a good
   example?
  
   Imran
  
   (sorry for hijacking your thread, Koert)
  
  
  
   On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers ko...@tresata.com
   wrote:
  
see email below. reynold suggested i send it to dev instead of
 user
   
-- Forwarded message --
From: Koert Kuipers ko...@tresata.com
Date: Mon, Mar 23, 2015 at 4:36 PM
Subject: hadoop input/output format advanced control
To: u...@spark.apache.org u...@spark.apache.org
   
   
currently its pretty hard to control the Hadoop Input/Output
 formats
used
in Spark. The conventions seems to be to add extra parameters to
 all
methods and then somewhere deep inside the code (for example in
PairRDDFunctions.saveAsHadoopFile) all these parameters get
translated
   into
settings on the Hadoop Configuration object.
   
for example for compression i see codec: Option[Class[_ :
CompressionCodec]] = None added to a bunch of methods.
   
how scalable is this solution really?
   
for example i need to read from a hadoop dataset and i dont want
 the
   input
(part) files to get split up. the way to do this is to set
mapred.min.split.size. now i dont want to set this at the level
 of
the
SparkContext (which can be done), since i dont want it to apply to
input
formats in general. i want it to apply to just this one specific
input
dataset i need to read. which leaves me with no options
 

Re: hadoop input/output format advanced control

2015-03-25 Thread Patrick Wendell
Yeah I agree that might have been nicer, but I think for consistency
with the input API's maybe we should do the same thing. We can also
give an example of how to clone sc.hadoopConfiguration and then set
some new values:

val conf = sc.hadoopConfiguration.clone()
  .set(k1, v1)
  .set(k2, v2)

val rdd = sc.objectFile(..., conf)

I have no idea if that's the correct syntax, but something like that
seems almost as easy as passing a hashmap with deltas.

- Patrick

On Wed, Mar 25, 2015 at 6:34 AM, Koert Kuipers ko...@tresata.com wrote:
 my personal preference would be something like a Map[String, String] that
 only reflects the changes you want to make the Configuration for the given
 input/output format (so system wide defaults continue to come from
 sc.hadoopConfiguration), similarly to what cascading/scalding did, but am
 arbitrary Configuration will work too.

 i will make a jira and pullreq when i have some time.



 On Wed, Mar 25, 2015 at 1:23 AM, Patrick Wendell pwend...@gmail.com wrote:

 I see - if you look, in the saving functions we have the option for
 the user to pass an arbitrary Configuration.


 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L894

 It seems fine to have the same option for the loading functions, if
 it's easy to just pass this config into the input format.



 On Tue, Mar 24, 2015 at 3:46 PM, Koert Kuipers ko...@tresata.com wrote:
  the (compression) codec parameter that is now part of many saveAs...
  methods
  came from a similar need. see SPARK-763
  hadoop has many options like this. you either going to have to allow
  many
  more of these optional arguments to all the methods that read from
  hadoop
  inputformats and write to hadoop outputformats, or you force people to
  re-create these methods using HadoopRDD, i think (if thats even
  possible).
 
  On Tue, Mar 24, 2015 at 6:40 PM, Koert Kuipers ko...@tresata.com
  wrote:
 
  i would like to use objectFile with some tweaks to the hadoop conf.
  currently there is no way to do that, except recreating objectFile
  myself.
  and some of the code objectFile uses i have no access to, since its
  private
  to spark.
 
 
  On Tue, Mar 24, 2015 at 2:59 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
  Yeah - to Nick's point, I think the way to do this is to pass in a
  custom conf when you create a Hadoop RDD (that's AFAIK why the conf
  field is there). Is there anything you can't do with that feature?
 
  On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
  nick.pentre...@gmail.com wrote:
   Imran, on your point to read multiple files together in a partition,
   is
   it
   not simpler to use the approach of copy Hadoop conf and set per-RDD
   settings for min split to control the input size per partition,
   together
   with something like CombineFileInputFormat?
  
   On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid iras...@cloudera.com
   wrote:
  
   I think this would be a great addition, I totally agree that you
   need
   to be
   able to set these at a finer context than just the SparkContext.
  
   Just to play devil's advocate, though -- the alternative is for you
   just
   subclass HadoopRDD yourself, or make a totally new RDD, and then
   you
   could
   expose whatever you need.  Why is this solution better?  IMO the
   criteria
   are:
   (a) common operations
   (b) error-prone / difficult to implement
   (c) non-obvious, but important for performance
  
   I think this case fits (a)  (c), so I think its still worthwhile.
   But its
   also worth asking whether or not its too difficult for a user to
   extend
   HadoopRDD right now.  There have been several cases in the past
   week
   where
   we've suggested that a user should read from hdfs themselves (eg.,
   to
   read
   multiple files together in one partition) -- with*out* reusing the
   code in
   HadoopRDD, though they would lose things like the metric tracking 
   preferred locations you get from HadoopRDD.  Does HadoopRDD need to
   some
   refactoring to make that easier to do?  Or do we just need a good
   example?
  
   Imran
  
   (sorry for hijacking your thread, Koert)
  
  
  
   On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers ko...@tresata.com
   wrote:
  
see email below. reynold suggested i send it to dev instead of
user
   
-- Forwarded message --
From: Koert Kuipers ko...@tresata.com
Date: Mon, Mar 23, 2015 at 4:36 PM
Subject: hadoop input/output format advanced control
To: u...@spark.apache.org u...@spark.apache.org
   
   
currently its pretty hard to control the Hadoop Input/Output
formats
used
in Spark. The conventions seems to be to add extra parameters to
all
methods and then somewhere deep inside the code (for example in
PairRDDFunctions.saveAsHadoopFile) all these parameters get
translated
   into
settings on the Hadoop Configuration object.
   
for example for compression i see codec: 

Re: hadoop input/output format advanced control

2015-03-25 Thread Sandy Ryza
Regarding Patrick's question, you can just do new Configuration(oldConf)
to get a cloned Configuration object and add any new properties to it.

-Sandy

On Wed, Mar 25, 2015 at 4:42 PM, Imran Rashid iras...@cloudera.com wrote:

 Hi Nick,

 I don't remember the exact details of these scenarios, but I think the user
 wanted a lot more control over how the files got grouped into partitions,
 to group the files together by some arbitrary function.  I didn't think
 that was possible w/ CombineFileInputFormat, but maybe there is a way?

 thanks

 On Tue, Mar 24, 2015 at 1:50 PM, Nick Pentreath nick.pentre...@gmail.com
 wrote:

  Imran, on your point to read multiple files together in a partition, is
 it
  not simpler to use the approach of copy Hadoop conf and set per-RDD
  settings for min split to control the input size per partition, together
  with something like CombineFileInputFormat?
 
  On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid iras...@cloudera.com
  wrote:
 
   I think this would be a great addition, I totally agree that you need
 to
  be
   able to set these at a finer context than just the SparkContext.
  
   Just to play devil's advocate, though -- the alternative is for you
 just
   subclass HadoopRDD yourself, or make a totally new RDD, and then you
  could
   expose whatever you need.  Why is this solution better?  IMO the
 criteria
   are:
   (a) common operations
   (b) error-prone / difficult to implement
   (c) non-obvious, but important for performance
  
   I think this case fits (a)  (c), so I think its still worthwhile.  But
  its
   also worth asking whether or not its too difficult for a user to extend
   HadoopRDD right now.  There have been several cases in the past week
  where
   we've suggested that a user should read from hdfs themselves (eg., to
  read
   multiple files together in one partition) -- with*out* reusing the code
  in
   HadoopRDD, though they would lose things like the metric tracking 
   preferred locations you get from HadoopRDD.  Does HadoopRDD need to
 some
   refactoring to make that easier to do?  Or do we just need a good
  example?
  
   Imran
  
   (sorry for hijacking your thread, Koert)
  
  
  
   On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers ko...@tresata.com
  wrote:
  
see email below. reynold suggested i send it to dev instead of user
   
-- Forwarded message --
From: Koert Kuipers ko...@tresata.com
Date: Mon, Mar 23, 2015 at 4:36 PM
Subject: hadoop input/output format advanced control
To: u...@spark.apache.org u...@spark.apache.org
   
   
currently its pretty hard to control the Hadoop Input/Output formats
  used
in Spark. The conventions seems to be to add extra parameters to all
methods and then somewhere deep inside the code (for example in
PairRDDFunctions.saveAsHadoopFile) all these parameters get
 translated
   into
settings on the Hadoop Configuration object.
   
for example for compression i see codec: Option[Class[_ :
CompressionCodec]] = None added to a bunch of methods.
   
how scalable is this solution really?
   
for example i need to read from a hadoop dataset and i dont want the
   input
(part) files to get split up. the way to do this is to set
mapred.min.split.size. now i dont want to set this at the level of
  the
SparkContext (which can be done), since i dont want it to apply to
  input
formats in general. i want it to apply to just this one specific
 input
dataset i need to read. which leaves me with no options currently. i
   could
go add yet another input parameter to all the methods
(SparkContext.textFile, SparkContext.hadoopFile,
  SparkContext.objectFile,
etc.). but that seems ineffective.
   
why can we not expose a Map[String, String] or some other generic way
  to
manipulate settings for hadoop input/output formats? it would require
adding one more parameter to all methods to deal with hadoop
  input/output
formats, but after that its done. one parameter to rule them all
   
then i could do:
val x = sc.textFile(/some/path, formatSettings =
Map(mapred.min.split.size - 12345))
   
or
rdd.saveAsTextFile(/some/path, formatSettings =
Map(mapred.output.compress - true,
  mapred.output.compression.codec
   -
somecodec))
   
  
 



Re: hadoop input/output format advanced control

2015-03-25 Thread Patrick Wendell
Great - that's even easier. Maybe we could have a simple example in the doc.

On Wed, Mar 25, 2015 at 7:06 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
 Regarding Patrick's question, you can just do new Configuration(oldConf)
 to get a cloned Configuration object and add any new properties to it.

 -Sandy

 On Wed, Mar 25, 2015 at 4:42 PM, Imran Rashid iras...@cloudera.com wrote:

 Hi Nick,

 I don't remember the exact details of these scenarios, but I think the user
 wanted a lot more control over how the files got grouped into partitions,
 to group the files together by some arbitrary function.  I didn't think
 that was possible w/ CombineFileInputFormat, but maybe there is a way?

 thanks

 On Tue, Mar 24, 2015 at 1:50 PM, Nick Pentreath nick.pentre...@gmail.com
 wrote:

  Imran, on your point to read multiple files together in a partition, is
 it
  not simpler to use the approach of copy Hadoop conf and set per-RDD
  settings for min split to control the input size per partition, together
  with something like CombineFileInputFormat?
 
  On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid iras...@cloudera.com
  wrote:
 
   I think this would be a great addition, I totally agree that you need
 to
  be
   able to set these at a finer context than just the SparkContext.
  
   Just to play devil's advocate, though -- the alternative is for you
 just
   subclass HadoopRDD yourself, or make a totally new RDD, and then you
  could
   expose whatever you need.  Why is this solution better?  IMO the
 criteria
   are:
   (a) common operations
   (b) error-prone / difficult to implement
   (c) non-obvious, but important for performance
  
   I think this case fits (a)  (c), so I think its still worthwhile.  But
  its
   also worth asking whether or not its too difficult for a user to extend
   HadoopRDD right now.  There have been several cases in the past week
  where
   we've suggested that a user should read from hdfs themselves (eg., to
  read
   multiple files together in one partition) -- with*out* reusing the code
  in
   HadoopRDD, though they would lose things like the metric tracking 
   preferred locations you get from HadoopRDD.  Does HadoopRDD need to
 some
   refactoring to make that easier to do?  Or do we just need a good
  example?
  
   Imran
  
   (sorry for hijacking your thread, Koert)
  
  
  
   On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers ko...@tresata.com
  wrote:
  
see email below. reynold suggested i send it to dev instead of user
   
-- Forwarded message --
From: Koert Kuipers ko...@tresata.com
Date: Mon, Mar 23, 2015 at 4:36 PM
Subject: hadoop input/output format advanced control
To: u...@spark.apache.org u...@spark.apache.org
   
   
currently its pretty hard to control the Hadoop Input/Output formats
  used
in Spark. The conventions seems to be to add extra parameters to all
methods and then somewhere deep inside the code (for example in
PairRDDFunctions.saveAsHadoopFile) all these parameters get
 translated
   into
settings on the Hadoop Configuration object.
   
for example for compression i see codec: Option[Class[_ :
CompressionCodec]] = None added to a bunch of methods.
   
how scalable is this solution really?
   
for example i need to read from a hadoop dataset and i dont want the
   input
(part) files to get split up. the way to do this is to set
mapred.min.split.size. now i dont want to set this at the level of
  the
SparkContext (which can be done), since i dont want it to apply to
  input
formats in general. i want it to apply to just this one specific
 input
dataset i need to read. which leaves me with no options currently. i
   could
go add yet another input parameter to all the methods
(SparkContext.textFile, SparkContext.hadoopFile,
  SparkContext.objectFile,
etc.). but that seems ineffective.
   
why can we not expose a Map[String, String] or some other generic way
  to
manipulate settings for hadoop input/output formats? it would require
adding one more parameter to all methods to deal with hadoop
  input/output
formats, but after that its done. one parameter to rule them all
   
then i could do:
val x = sc.textFile(/some/path, formatSettings =
Map(mapred.min.split.size - 12345))
   
or
rdd.saveAsTextFile(/some/path, formatSettings =
Map(mapred.output.compress - true,
  mapred.output.compression.codec
   -
somecodec))
   
  
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: hadoop input/output format advanced control

2015-03-25 Thread Aaron Davidson
Should we mention that you should synchronize
on HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK to avoid a possible race
condition in cloning Hadoop Configuration objects prior to Hadoop 2.7.0? :)

On Wed, Mar 25, 2015 at 7:16 PM, Patrick Wendell pwend...@gmail.com wrote:

 Great - that's even easier. Maybe we could have a simple example in the
 doc.

 On Wed, Mar 25, 2015 at 7:06 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:
  Regarding Patrick's question, you can just do new
 Configuration(oldConf)
  to get a cloned Configuration object and add any new properties to it.
 
  -Sandy
 
  On Wed, Mar 25, 2015 at 4:42 PM, Imran Rashid iras...@cloudera.com
 wrote:
 
  Hi Nick,
 
  I don't remember the exact details of these scenarios, but I think the
 user
  wanted a lot more control over how the files got grouped into
 partitions,
  to group the files together by some arbitrary function.  I didn't think
  that was possible w/ CombineFileInputFormat, but maybe there is a way?
 
  thanks
 
  On Tue, Mar 24, 2015 at 1:50 PM, Nick Pentreath 
 nick.pentre...@gmail.com
  wrote:
 
   Imran, on your point to read multiple files together in a partition,
 is
  it
   not simpler to use the approach of copy Hadoop conf and set per-RDD
   settings for min split to control the input size per partition,
 together
   with something like CombineFileInputFormat?
  
   On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid iras...@cloudera.com
   wrote:
  
I think this would be a great addition, I totally agree that you
 need
  to
   be
able to set these at a finer context than just the SparkContext.
   
Just to play devil's advocate, though -- the alternative is for you
  just
subclass HadoopRDD yourself, or make a totally new RDD, and then you
   could
expose whatever you need.  Why is this solution better?  IMO the
  criteria
are:
(a) common operations
(b) error-prone / difficult to implement
(c) non-obvious, but important for performance
   
I think this case fits (a)  (c), so I think its still worthwhile.
 But
   its
also worth asking whether or not its too difficult for a user to
 extend
HadoopRDD right now.  There have been several cases in the past week
   where
we've suggested that a user should read from hdfs themselves (eg.,
 to
   read
multiple files together in one partition) -- with*out* reusing the
 code
   in
HadoopRDD, though they would lose things like the metric tracking 
preferred locations you get from HadoopRDD.  Does HadoopRDD need to
  some
refactoring to make that easier to do?  Or do we just need a good
   example?
   
Imran
   
(sorry for hijacking your thread, Koert)
   
   
   
On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers ko...@tresata.com
   wrote:
   
 see email below. reynold suggested i send it to dev instead of
 user

 -- Forwarded message --
 From: Koert Kuipers ko...@tresata.com
 Date: Mon, Mar 23, 2015 at 4:36 PM
 Subject: hadoop input/output format advanced control
 To: u...@spark.apache.org u...@spark.apache.org


 currently its pretty hard to control the Hadoop Input/Output
 formats
   used
 in Spark. The conventions seems to be to add extra parameters to
 all
 methods and then somewhere deep inside the code (for example in
 PairRDDFunctions.saveAsHadoopFile) all these parameters get
  translated
into
 settings on the Hadoop Configuration object.

 for example for compression i see codec: Option[Class[_ :
 CompressionCodec]] = None added to a bunch of methods.

 how scalable is this solution really?

 for example i need to read from a hadoop dataset and i dont want
 the
input
 (part) files to get split up. the way to do this is to set
 mapred.min.split.size. now i dont want to set this at the level
 of
   the
 SparkContext (which can be done), since i dont want it to apply to
   input
 formats in general. i want it to apply to just this one specific
  input
 dataset i need to read. which leaves me with no options
 currently. i
could
 go add yet another input parameter to all the methods
 (SparkContext.textFile, SparkContext.hadoopFile,
   SparkContext.objectFile,
 etc.). but that seems ineffective.

 why can we not expose a Map[String, String] or some other generic
 way
   to
 manipulate settings for hadoop input/output formats? it would
 require
 adding one more parameter to all methods to deal with hadoop
   input/output
 formats, but after that its done. one parameter to rule them
 all

 then i could do:
 val x = sc.textFile(/some/path, formatSettings =
 Map(mapred.min.split.size - 12345))

 or
 rdd.saveAsTextFile(/some/path, formatSettings =
 Map(mapred.output.compress - true,
   mapred.output.compression.codec
-
 somecodec))

   
  
 

 -
 

Re: hadoop input/output format advanced control

2015-03-24 Thread Koert Kuipers
i would like to use objectFile with some tweaks to the hadoop conf.
currently there is no way to do that, except recreating objectFile myself.
and some of the code objectFile uses i have no access to, since its private
to spark.


On Tue, Mar 24, 2015 at 2:59 PM, Patrick Wendell pwend...@gmail.com wrote:

 Yeah - to Nick's point, I think the way to do this is to pass in a
 custom conf when you create a Hadoop RDD (that's AFAIK why the conf
 field is there). Is there anything you can't do with that feature?

 On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
 nick.pentre...@gmail.com wrote:
  Imran, on your point to read multiple files together in a partition, is
 it
  not simpler to use the approach of copy Hadoop conf and set per-RDD
  settings for min split to control the input size per partition, together
  with something like CombineFileInputFormat?
 
  On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid iras...@cloudera.com
 wrote:
 
  I think this would be a great addition, I totally agree that you need
 to be
  able to set these at a finer context than just the SparkContext.
 
  Just to play devil's advocate, though -- the alternative is for you just
  subclass HadoopRDD yourself, or make a totally new RDD, and then you
 could
  expose whatever you need.  Why is this solution better?  IMO the
 criteria
  are:
  (a) common operations
  (b) error-prone / difficult to implement
  (c) non-obvious, but important for performance
 
  I think this case fits (a)  (c), so I think its still worthwhile.  But
 its
  also worth asking whether or not its too difficult for a user to extend
  HadoopRDD right now.  There have been several cases in the past week
 where
  we've suggested that a user should read from hdfs themselves (eg., to
 read
  multiple files together in one partition) -- with*out* reusing the code
 in
  HadoopRDD, though they would lose things like the metric tracking 
  preferred locations you get from HadoopRDD.  Does HadoopRDD need to some
  refactoring to make that easier to do?  Or do we just need a good
 example?
 
  Imran
 
  (sorry for hijacking your thread, Koert)
 
 
 
  On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers ko...@tresata.com
 wrote:
 
   see email below. reynold suggested i send it to dev instead of user
  
   -- Forwarded message --
   From: Koert Kuipers ko...@tresata.com
   Date: Mon, Mar 23, 2015 at 4:36 PM
   Subject: hadoop input/output format advanced control
   To: u...@spark.apache.org u...@spark.apache.org
  
  
   currently its pretty hard to control the Hadoop Input/Output formats
 used
   in Spark. The conventions seems to be to add extra parameters to all
   methods and then somewhere deep inside the code (for example in
   PairRDDFunctions.saveAsHadoopFile) all these parameters get translated
  into
   settings on the Hadoop Configuration object.
  
   for example for compression i see codec: Option[Class[_ :
   CompressionCodec]] = None added to a bunch of methods.
  
   how scalable is this solution really?
  
   for example i need to read from a hadoop dataset and i dont want the
  input
   (part) files to get split up. the way to do this is to set
   mapred.min.split.size. now i dont want to set this at the level of
 the
   SparkContext (which can be done), since i dont want it to apply to
 input
   formats in general. i want it to apply to just this one specific input
   dataset i need to read. which leaves me with no options currently. i
  could
   go add yet another input parameter to all the methods
   (SparkContext.textFile, SparkContext.hadoopFile,
 SparkContext.objectFile,
   etc.). but that seems ineffective.
  
   why can we not expose a Map[String, String] or some other generic way
 to
   manipulate settings for hadoop input/output formats? it would require
   adding one more parameter to all methods to deal with hadoop
 input/output
   formats, but after that its done. one parameter to rule them all
  
   then i could do:
   val x = sc.textFile(/some/path, formatSettings =
   Map(mapred.min.split.size - 12345))
  
   or
   rdd.saveAsTextFile(/some/path, formatSettings =
   Map(mapred.output.compress - true,
 mapred.output.compression.codec
  -
   somecodec))
  
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: hadoop input/output format advanced control

2015-03-24 Thread Imran Rashid
I think this would be a great addition, I totally agree that you need to be
able to set these at a finer context than just the SparkContext.

Just to play devil's advocate, though -- the alternative is for you just
subclass HadoopRDD yourself, or make a totally new RDD, and then you could
expose whatever you need.  Why is this solution better?  IMO the criteria
are:
(a) common operations
(b) error-prone / difficult to implement
(c) non-obvious, but important for performance

I think this case fits (a)  (c), so I think its still worthwhile.  But its
also worth asking whether or not its too difficult for a user to extend
HadoopRDD right now.  There have been several cases in the past week where
we've suggested that a user should read from hdfs themselves (eg., to read
multiple files together in one partition) -- with*out* reusing the code in
HadoopRDD, though they would lose things like the metric tracking 
preferred locations you get from HadoopRDD.  Does HadoopRDD need to some
refactoring to make that easier to do?  Or do we just need a good example?

Imran

(sorry for hijacking your thread, Koert)



On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers ko...@tresata.com wrote:

 see email below. reynold suggested i send it to dev instead of user

 -- Forwarded message --
 From: Koert Kuipers ko...@tresata.com
 Date: Mon, Mar 23, 2015 at 4:36 PM
 Subject: hadoop input/output format advanced control
 To: u...@spark.apache.org u...@spark.apache.org


 currently its pretty hard to control the Hadoop Input/Output formats used
 in Spark. The conventions seems to be to add extra parameters to all
 methods and then somewhere deep inside the code (for example in
 PairRDDFunctions.saveAsHadoopFile) all these parameters get translated into
 settings on the Hadoop Configuration object.

 for example for compression i see codec: Option[Class[_ :
 CompressionCodec]] = None added to a bunch of methods.

 how scalable is this solution really?

 for example i need to read from a hadoop dataset and i dont want the input
 (part) files to get split up. the way to do this is to set
 mapred.min.split.size. now i dont want to set this at the level of the
 SparkContext (which can be done), since i dont want it to apply to input
 formats in general. i want it to apply to just this one specific input
 dataset i need to read. which leaves me with no options currently. i could
 go add yet another input parameter to all the methods
 (SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile,
 etc.). but that seems ineffective.

 why can we not expose a Map[String, String] or some other generic way to
 manipulate settings for hadoop input/output formats? it would require
 adding one more parameter to all methods to deal with hadoop input/output
 formats, but after that its done. one parameter to rule them all

 then i could do:
 val x = sc.textFile(/some/path, formatSettings =
 Map(mapred.min.split.size - 12345))

 or
 rdd.saveAsTextFile(/some/path, formatSettings =
 Map(mapred.output.compress - true, mapred.output.compression.codec -
 somecodec))



Re: hadoop input/output format advanced control

2015-03-24 Thread Nick Pentreath
Imran, on your point to read multiple files together in a partition, is it
not simpler to use the approach of copy Hadoop conf and set per-RDD
settings for min split to control the input size per partition, together
with something like CombineFileInputFormat?

On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid iras...@cloudera.com wrote:

 I think this would be a great addition, I totally agree that you need to be
 able to set these at a finer context than just the SparkContext.

 Just to play devil's advocate, though -- the alternative is for you just
 subclass HadoopRDD yourself, or make a totally new RDD, and then you could
 expose whatever you need.  Why is this solution better?  IMO the criteria
 are:
 (a) common operations
 (b) error-prone / difficult to implement
 (c) non-obvious, but important for performance

 I think this case fits (a)  (c), so I think its still worthwhile.  But its
 also worth asking whether or not its too difficult for a user to extend
 HadoopRDD right now.  There have been several cases in the past week where
 we've suggested that a user should read from hdfs themselves (eg., to read
 multiple files together in one partition) -- with*out* reusing the code in
 HadoopRDD, though they would lose things like the metric tracking 
 preferred locations you get from HadoopRDD.  Does HadoopRDD need to some
 refactoring to make that easier to do?  Or do we just need a good example?

 Imran

 (sorry for hijacking your thread, Koert)



 On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers ko...@tresata.com wrote:

  see email below. reynold suggested i send it to dev instead of user
 
  -- Forwarded message --
  From: Koert Kuipers ko...@tresata.com
  Date: Mon, Mar 23, 2015 at 4:36 PM
  Subject: hadoop input/output format advanced control
  To: u...@spark.apache.org u...@spark.apache.org
 
 
  currently its pretty hard to control the Hadoop Input/Output formats used
  in Spark. The conventions seems to be to add extra parameters to all
  methods and then somewhere deep inside the code (for example in
  PairRDDFunctions.saveAsHadoopFile) all these parameters get translated
 into
  settings on the Hadoop Configuration object.
 
  for example for compression i see codec: Option[Class[_ :
  CompressionCodec]] = None added to a bunch of methods.
 
  how scalable is this solution really?
 
  for example i need to read from a hadoop dataset and i dont want the
 input
  (part) files to get split up. the way to do this is to set
  mapred.min.split.size. now i dont want to set this at the level of the
  SparkContext (which can be done), since i dont want it to apply to input
  formats in general. i want it to apply to just this one specific input
  dataset i need to read. which leaves me with no options currently. i
 could
  go add yet another input parameter to all the methods
  (SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile,
  etc.). but that seems ineffective.
 
  why can we not expose a Map[String, String] or some other generic way to
  manipulate settings for hadoop input/output formats? it would require
  adding one more parameter to all methods to deal with hadoop input/output
  formats, but after that its done. one parameter to rule them all
 
  then i could do:
  val x = sc.textFile(/some/path, formatSettings =
  Map(mapred.min.split.size - 12345))
 
  or
  rdd.saveAsTextFile(/some/path, formatSettings =
  Map(mapred.output.compress - true, mapred.output.compression.codec
 -
  somecodec))
 



Re: hadoop input/output format advanced control

2015-03-24 Thread Patrick Wendell
Yeah - to Nick's point, I think the way to do this is to pass in a
custom conf when you create a Hadoop RDD (that's AFAIK why the conf
field is there). Is there anything you can't do with that feature?

On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
nick.pentre...@gmail.com wrote:
 Imran, on your point to read multiple files together in a partition, is it
 not simpler to use the approach of copy Hadoop conf and set per-RDD
 settings for min split to control the input size per partition, together
 with something like CombineFileInputFormat?

 On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid iras...@cloudera.com wrote:

 I think this would be a great addition, I totally agree that you need to be
 able to set these at a finer context than just the SparkContext.

 Just to play devil's advocate, though -- the alternative is for you just
 subclass HadoopRDD yourself, or make a totally new RDD, and then you could
 expose whatever you need.  Why is this solution better?  IMO the criteria
 are:
 (a) common operations
 (b) error-prone / difficult to implement
 (c) non-obvious, but important for performance

 I think this case fits (a)  (c), so I think its still worthwhile.  But its
 also worth asking whether or not its too difficult for a user to extend
 HadoopRDD right now.  There have been several cases in the past week where
 we've suggested that a user should read from hdfs themselves (eg., to read
 multiple files together in one partition) -- with*out* reusing the code in
 HadoopRDD, though they would lose things like the metric tracking 
 preferred locations you get from HadoopRDD.  Does HadoopRDD need to some
 refactoring to make that easier to do?  Or do we just need a good example?

 Imran

 (sorry for hijacking your thread, Koert)



 On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers ko...@tresata.com wrote:

  see email below. reynold suggested i send it to dev instead of user
 
  -- Forwarded message --
  From: Koert Kuipers ko...@tresata.com
  Date: Mon, Mar 23, 2015 at 4:36 PM
  Subject: hadoop input/output format advanced control
  To: u...@spark.apache.org u...@spark.apache.org
 
 
  currently its pretty hard to control the Hadoop Input/Output formats used
  in Spark. The conventions seems to be to add extra parameters to all
  methods and then somewhere deep inside the code (for example in
  PairRDDFunctions.saveAsHadoopFile) all these parameters get translated
 into
  settings on the Hadoop Configuration object.
 
  for example for compression i see codec: Option[Class[_ :
  CompressionCodec]] = None added to a bunch of methods.
 
  how scalable is this solution really?
 
  for example i need to read from a hadoop dataset and i dont want the
 input
  (part) files to get split up. the way to do this is to set
  mapred.min.split.size. now i dont want to set this at the level of the
  SparkContext (which can be done), since i dont want it to apply to input
  formats in general. i want it to apply to just this one specific input
  dataset i need to read. which leaves me with no options currently. i
 could
  go add yet another input parameter to all the methods
  (SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile,
  etc.). but that seems ineffective.
 
  why can we not expose a Map[String, String] or some other generic way to
  manipulate settings for hadoop input/output formats? it would require
  adding one more parameter to all methods to deal with hadoop input/output
  formats, but after that its done. one parameter to rule them all
 
  then i could do:
  val x = sc.textFile(/some/path, formatSettings =
  Map(mapred.min.split.size - 12345))
 
  or
  rdd.saveAsTextFile(/some/path, formatSettings =
  Map(mapred.output.compress - true, mapred.output.compression.codec
 -
  somecodec))
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org