Fwd: hadoop input/output format advanced control

2015-03-23 Thread Koert Kuipers
see email below. reynold suggested i send it to dev instead of user

-- Forwarded message --
From: Koert Kuipers ko...@tresata.com
Date: Mon, Mar 23, 2015 at 4:36 PM
Subject: hadoop input/output format advanced control
To: u...@spark.apache.org u...@spark.apache.org


currently its pretty hard to control the Hadoop Input/Output formats used
in Spark. The conventions seems to be to add extra parameters to all
methods and then somewhere deep inside the code (for example in
PairRDDFunctions.saveAsHadoopFile) all these parameters get translated into
settings on the Hadoop Configuration object.

for example for compression i see codec: Option[Class[_ :
CompressionCodec]] = None added to a bunch of methods.

how scalable is this solution really?

for example i need to read from a hadoop dataset and i dont want the input
(part) files to get split up. the way to do this is to set
mapred.min.split.size. now i dont want to set this at the level of the
SparkContext (which can be done), since i dont want it to apply to input
formats in general. i want it to apply to just this one specific input
dataset i need to read. which leaves me with no options currently. i could
go add yet another input parameter to all the methods
(SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile,
etc.). but that seems ineffective.

why can we not expose a Map[String, String] or some other generic way to
manipulate settings for hadoop input/output formats? it would require
adding one more parameter to all methods to deal with hadoop input/output
formats, but after that its done. one parameter to rule them all

then i could do:
val x = sc.textFile(/some/path, formatSettings =
Map(mapred.min.split.size - 12345))

or
rdd.saveAsTextFile(/some/path, formatSettings =
Map(mapred.output.compress - true, mapred.output.compression.codec -
somecodec))


Re: Starting sparkthrift server

2015-03-23 Thread Neil Dev
we are running this right now as root user and the folder /tmp/spark-events
was manually created and the Job has access to this folder

On Mon, Mar 23, 2015 at 3:38 PM, Denny Lee denny.g@gmail.com wrote:

 It appears that you are running the thrift-server using the spark-events
 account but the /tmp/spark-events folder doesn't exist or the user running
 thrift-server does not have access to it.  Have you been able to run Hive
 using the spark-events user so that way the /tmp/spark-events folder has
 been created.  If you need to reassign the scratch dir / log dir to another
 folder (instead of /tmp/spark-events), you could use  --hiveconf to assign
 those to another folder.


 On Mon, Mar 23, 2015 at 8:39 AM Neil Dev neilk...@gmail.com wrote:

 Hi,

 I am having issues starting spark-thriftserver. I'm running spark 1.3.o
 with Hadoop 2.4.0. I would like to be able to change its port too so, I
 can
 hive hive-thriftserver as well as spark-thriftserver running at the same
 time.

 Starting sparkthrift server:-
 sudo ./start-thriftserver.sh --master spark://ip-172-31-10-124:7077
 --executor-memory 2G

 Error:-
 I created the folder manually but still getting the following error
 Exception in thread main java.lang.IllegalArgumentException: Log
 directory /tmp/spark-events does not exist.


 I am getting the following error
 15/03/23 15:07:02 ERROR thrift.ThriftCLIService: Error:
 org.apache.thrift.transport.TTransportException: Could not create
 ServerSocket on address0.0.0.0/0.0.0.0:1.
 at
 org.apache.thrift.transport.TServerSocket.init(TServerSocket.java:93)
 at
 org.apache.thrift.transport.TServerSocket.init(TServerSocket.java:79)
 at
 org.apache.hive.service.auth.HiveAuthFactory.getServerSocket(
 HiveAuthFactory.java:236)
 at
 org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.
 run(ThriftBinaryCLIService.java:69)
 at java.lang.Thread.run(Thread.java:745)

 Thanks
 Neil




Re: enum-like types in Spark

2015-03-23 Thread Sean Owen
Yeah the fully realized #4, which gets back the ability to use it in
switch statements (? in Scala but not Java?) does end up being kind of
huge.

I confess I'm swayed a bit back to Java enums, seeing what it
involves. The hashCode() issue can be 'solved' with the hash of the
String representation.

On Mon, Mar 23, 2015 at 8:33 PM, Imran Rashid iras...@cloudera.com wrote:
 I've just switched some of my code over to the new format, and I just want
 to make sure everyone realizes what we are getting into.  I went from 10
 lines as java enums

 https://github.com/squito/spark/blob/fef66058612ebf225e58dd5f5fea6bae1afd5b31/core/src/main/java/org/apache/spark/status/api/StageStatus.java#L20

 to 30 lines with the new format:

 https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/status/api/v1/api.scala#L250

 its not just that its verbose.  each name has to be repeated 4 times, with
 potential typos in some locations that won't be caught by the compiler.
 Also, you have to manually maintain the values as you update the set of
 enums, the compiler won't do it for you.

 The only downside I've heard for java enums is enum.hashcode().  OTOH, the
 downsides for this version are: maintainability / verbosity, no values(),
 more cumbersome to use from java, no enum map / enumset.

 I did put together a little util to at least get back the equivalent of
 enum.valueOf() with this format

 https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/util/SparkEnum.scala

 I'm not trying to prevent us from moving forward on this, its fine if this
 is still what everyone wants, but I feel pretty strongly java enums make
 more sense.

 thanks,
 Imran

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Starting sparkthrift server

2015-03-23 Thread Denny Lee
When you say the job has access, do you mean that when you run spark-submit
or spark-shell (for example), it is able to write to the /tmp/spark-events
folder?


On Mon, Mar 23, 2015 at 1:02 PM Neil Dev neilk...@gmail.com wrote:

 we are running this right now as root user and the folder /tmp/spark-events
 was manually created and the Job has access to this folder

 On Mon, Mar 23, 2015 at 3:38 PM, Denny Lee denny.g@gmail.com wrote:

 It appears that you are running the thrift-server using the spark-events
 account but the /tmp/spark-events folder doesn't exist or the user running
 thrift-server does not have access to it.  Have you been able to run Hive
 using the spark-events user so that way the /tmp/spark-events folder has
 been created.  If you need to reassign the scratch dir / log dir to another
 folder (instead of /tmp/spark-events), you could use  --hiveconf to assign
 those to another folder.


 On Mon, Mar 23, 2015 at 8:39 AM Neil Dev neilk...@gmail.com wrote:

 Hi,

 I am having issues starting spark-thriftserver. I'm running spark 1.3.o
 with Hadoop 2.4.0. I would like to be able to change its port too so, I
 can
 hive hive-thriftserver as well as spark-thriftserver running at the same
 time.

 Starting sparkthrift server:-
 sudo ./start-thriftserver.sh --master spark://ip-172-31-10-124:7077
 --executor-memory 2G

 Error:-
 I created the folder manually but still getting the following error
 Exception in thread main java.lang.IllegalArgumentException: Log
 directory /tmp/spark-events does not exist.


 I am getting the following error
 15/03/23 15:07:02 ERROR thrift.ThriftCLIService: Error:
 org.apache.thrift.transport.TTransportException: Could not create
 ServerSocket on address0.0.0.0/0.0.0.0:1.
 at
 org.apache.thrift.transport.TServerSocket.init(TServerSocket.java:93)
 at
 org.apache.thrift.transport.TServerSocket.init(TServerSocket.java:79)
 at
 org.apache.hive.service.auth.HiveAuthFactory.getServerSocket(
 HiveAuthFactory.java:236)
 at
 org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.
 run(ThriftBinaryCLIService.java:69)
 at java.lang.Thread.run(Thread.java:745)

 Thanks
 Neil





Re: Starting sparkthrift server

2015-03-23 Thread Anubhav Agarwal
When I start spark-shell (for example) it does not write to the
/tmp/spark-events folder. It remains empty. I have even tried it after
giving that folder rwx permission for user, group and others.

Neil's colleague,
Anu

On Mon, Mar 23, 2015 at 4:50 PM, Denny Lee denny.g@gmail.com wrote:

 When you say the job has access, do you mean that when you run spark-submit
 or spark-shell (for example), it is able to write to the /tmp/spark-events
 folder?


 On Mon, Mar 23, 2015 at 1:02 PM Neil Dev neilk...@gmail.com wrote:

  we are running this right now as root user and the folder
 /tmp/spark-events
  was manually created and the Job has access to this folder
 
  On Mon, Mar 23, 2015 at 3:38 PM, Denny Lee denny.g@gmail.com
 wrote:
 
  It appears that you are running the thrift-server using the spark-events
  account but the /tmp/spark-events folder doesn't exist or the user
 running
  thrift-server does not have access to it.  Have you been able to run
 Hive
  using the spark-events user so that way the /tmp/spark-events folder has
  been created.  If you need to reassign the scratch dir / log dir to
 another
  folder (instead of /tmp/spark-events), you could use  --hiveconf to
 assign
  those to another folder.
 
 
  On Mon, Mar 23, 2015 at 8:39 AM Neil Dev neilk...@gmail.com wrote:
 
  Hi,
 
  I am having issues starting spark-thriftserver. I'm running spark 1.3.o
  with Hadoop 2.4.0. I would like to be able to change its port too so, I
  can
  hive hive-thriftserver as well as spark-thriftserver running at the
 same
  time.
 
  Starting sparkthrift server:-
  sudo ./start-thriftserver.sh --master spark://ip-172-31-10-124:7077
  --executor-memory 2G
 
  Error:-
  I created the folder manually but still getting the following error
  Exception in thread main java.lang.IllegalArgumentException: Log
  directory /tmp/spark-events does not exist.
 
 
  I am getting the following error
  15/03/23 15:07:02 ERROR thrift.ThriftCLIService: Error:
  org.apache.thrift.transport.TTransportException: Could not create
  ServerSocket on address0.0.0.0/0.0.0.0:1.
  at
  org.apache.thrift.transport.TServerSocket.init(TServerSocket.java:93)
  at
  org.apache.thrift.transport.TServerSocket.init(TServerSocket.java:79)
  at
  org.apache.hive.service.auth.HiveAuthFactory.getServerSocket(
  HiveAuthFactory.java:236)
  at
  org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.
  run(ThriftBinaryCLIService.java:69)
  at java.lang.Thread.run(Thread.java:745)
 
  Thanks
  Neil
 
 
 



Re: enum-like types in Spark

2015-03-23 Thread Reynold Xin
If scaladoc can show the Java enum types, I do think the best way is then
just Java enum types.


On Mon, Mar 23, 2015 at 2:11 PM, Patrick Wendell pwend...@gmail.com wrote:

 If the official solution from the Scala community is to use Java
 enums, then it seems strange they aren't generated in scaldoc? Maybe
 we can just fix that w/ Typesafe's help and then we can use them.

 On Mon, Mar 23, 2015 at 1:46 PM, Sean Owen so...@cloudera.com wrote:
  Yeah the fully realized #4, which gets back the ability to use it in
  switch statements (? in Scala but not Java?) does end up being kind of
  huge.
 
  I confess I'm swayed a bit back to Java enums, seeing what it
  involves. The hashCode() issue can be 'solved' with the hash of the
  String representation.
 
  On Mon, Mar 23, 2015 at 8:33 PM, Imran Rashid iras...@cloudera.com
 wrote:
  I've just switched some of my code over to the new format, and I just
 want
  to make sure everyone realizes what we are getting into.  I went from 10
  lines as java enums
 
 
 https://github.com/squito/spark/blob/fef66058612ebf225e58dd5f5fea6bae1afd5b31/core/src/main/java/org/apache/spark/status/api/StageStatus.java#L20
 
  to 30 lines with the new format:
 
 
 https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/status/api/v1/api.scala#L250
 
  its not just that its verbose.  each name has to be repeated 4 times,
 with
  potential typos in some locations that won't be caught by the compiler.
  Also, you have to manually maintain the values as you update the set
 of
  enums, the compiler won't do it for you.
 
  The only downside I've heard for java enums is enum.hashcode().  OTOH,
 the
  downsides for this version are: maintainability / verbosity, no
 values(),
  more cumbersome to use from java, no enum map / enumset.
 
  I did put together a little util to at least get back the equivalent of
  enum.valueOf() with this format
 
 
 https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/util/SparkEnum.scala
 
  I'm not trying to prevent us from moving forward on this, its fine if
 this
  is still what everyone wants, but I feel pretty strongly java enums make
  more sense.
 
  thanks,
  Imran
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Shuffle Spill Memory and Shuffle Spill Disk

2015-03-23 Thread Bijay Pathak
It looks this is not the right place for this question, I have send the
question to user group.

thank you,
bijay

On Mon, Mar 23, 2015 at 2:25 PM, Bijay Pathak bijay.pat...@cloudwick.com
wrote:

 Hello,

 I am running  TeraSort https://github.com/ehiggs/spark-terasort on
 100GB of data. The final metrics I am getting on Shuffle Spill are:

 Shuffle Spill(Memory): 122.5 GB
 Shuffle Spill(Disk): 3.4 GB

 What's the difference and relation between these two metrics? Does these
 mean 122.5 GB was spill from memory during the shuffle?

 thank you,
 bijay



Re: enum-like types in Spark

2015-03-23 Thread Imran Rashid
well, perhaps I overstated things a little, I wouldn't call it the
official solution, just a recommendation in the never-ending debate (and
the recommendation from folks with their hands on scala itself).

Even if we do get this fixed in scaladoc eventually -- as its not in the
current versions, where does that leave this proposal?  personally I'd
*still* prefer java enums, even if it doesn't get into scaladoc.  btw, even
with sealed traits, the scaladoc still isn't great -- you don't see the
values from the class, you only see them listed from the companion object.
 (though, that is somewhat standard for scaladoc, so maybe I'm reaching a
little)



On Mon, Mar 23, 2015 at 4:11 PM, Patrick Wendell pwend...@gmail.com wrote:

 If the official solution from the Scala community is to use Java
 enums, then it seems strange they aren't generated in scaldoc? Maybe
 we can just fix that w/ Typesafe's help and then we can use them.

 On Mon, Mar 23, 2015 at 1:46 PM, Sean Owen so...@cloudera.com wrote:
  Yeah the fully realized #4, which gets back the ability to use it in
  switch statements (? in Scala but not Java?) does end up being kind of
  huge.
 
  I confess I'm swayed a bit back to Java enums, seeing what it
  involves. The hashCode() issue can be 'solved' with the hash of the
  String representation.
 
  On Mon, Mar 23, 2015 at 8:33 PM, Imran Rashid iras...@cloudera.com
 wrote:
  I've just switched some of my code over to the new format, and I just
 want
  to make sure everyone realizes what we are getting into.  I went from 10
  lines as java enums
 
 
 https://github.com/squito/spark/blob/fef66058612ebf225e58dd5f5fea6bae1afd5b31/core/src/main/java/org/apache/spark/status/api/StageStatus.java#L20
 
  to 30 lines with the new format:
 
 
 https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/status/api/v1/api.scala#L250
 
  its not just that its verbose.  each name has to be repeated 4 times,
 with
  potential typos in some locations that won't be caught by the compiler.
  Also, you have to manually maintain the values as you update the set
 of
  enums, the compiler won't do it for you.
 
  The only downside I've heard for java enums is enum.hashcode().  OTOH,
 the
  downsides for this version are: maintainability / verbosity, no
 values(),
  more cumbersome to use from java, no enum map / enumset.
 
  I did put together a little util to at least get back the equivalent of
  enum.valueOf() with this format
 
 
 https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/util/SparkEnum.scala
 
  I'm not trying to prevent us from moving forward on this, its fine if
 this
  is still what everyone wants, but I feel pretty strongly java enums make
  more sense.
 
  thanks,
  Imran
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 



Re: enum-like types in Spark

2015-03-23 Thread Aaron Davidson
The only issue I knew of with Java enums was that it does not appear in the
Scala documentation.

On Mon, Mar 23, 2015 at 1:46 PM, Sean Owen so...@cloudera.com wrote:

 Yeah the fully realized #4, which gets back the ability to use it in
 switch statements (? in Scala but not Java?) does end up being kind of
 huge.

 I confess I'm swayed a bit back to Java enums, seeing what it
 involves. The hashCode() issue can be 'solved' with the hash of the
 String representation.

 On Mon, Mar 23, 2015 at 8:33 PM, Imran Rashid iras...@cloudera.com
 wrote:
  I've just switched some of my code over to the new format, and I just
 want
  to make sure everyone realizes what we are getting into.  I went from 10
  lines as java enums
 
 
 https://github.com/squito/spark/blob/fef66058612ebf225e58dd5f5fea6bae1afd5b31/core/src/main/java/org/apache/spark/status/api/StageStatus.java#L20
 
  to 30 lines with the new format:
 
 
 https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/status/api/v1/api.scala#L250
 
  its not just that its verbose.  each name has to be repeated 4 times,
 with
  potential typos in some locations that won't be caught by the compiler.
  Also, you have to manually maintain the values as you update the set of
  enums, the compiler won't do it for you.
 
  The only downside I've heard for java enums is enum.hashcode().  OTOH,
 the
  downsides for this version are: maintainability / verbosity, no values(),
  more cumbersome to use from java, no enum map / enumset.
 
  I did put together a little util to at least get back the equivalent of
  enum.valueOf() with this format
 
 
 https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/util/SparkEnum.scala
 
  I'm not trying to prevent us from moving forward on this, its fine if
 this
  is still what everyone wants, but I feel pretty strongly java enums make
  more sense.
 
  thanks,
  Imran



Re: enum-like types in Spark

2015-03-23 Thread Patrick Wendell
If the official solution from the Scala community is to use Java
enums, then it seems strange they aren't generated in scaldoc? Maybe
we can just fix that w/ Typesafe's help and then we can use them.

On Mon, Mar 23, 2015 at 1:46 PM, Sean Owen so...@cloudera.com wrote:
 Yeah the fully realized #4, which gets back the ability to use it in
 switch statements (? in Scala but not Java?) does end up being kind of
 huge.

 I confess I'm swayed a bit back to Java enums, seeing what it
 involves. The hashCode() issue can be 'solved' with the hash of the
 String representation.

 On Mon, Mar 23, 2015 at 8:33 PM, Imran Rashid iras...@cloudera.com wrote:
 I've just switched some of my code over to the new format, and I just want
 to make sure everyone realizes what we are getting into.  I went from 10
 lines as java enums

 https://github.com/squito/spark/blob/fef66058612ebf225e58dd5f5fea6bae1afd5b31/core/src/main/java/org/apache/spark/status/api/StageStatus.java#L20

 to 30 lines with the new format:

 https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/status/api/v1/api.scala#L250

 its not just that its verbose.  each name has to be repeated 4 times, with
 potential typos in some locations that won't be caught by the compiler.
 Also, you have to manually maintain the values as you update the set of
 enums, the compiler won't do it for you.

 The only downside I've heard for java enums is enum.hashcode().  OTOH, the
 downsides for this version are: maintainability / verbosity, no values(),
 more cumbersome to use from java, no enum map / enumset.

 I did put together a little util to at least get back the equivalent of
 enum.valueOf() with this format

 https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/util/SparkEnum.scala

 I'm not trying to prevent us from moving forward on this, its fine if this
 is still what everyone wants, but I feel pretty strongly java enums make
 more sense.

 thanks,
 Imran

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Review request for SPARK-6112:Provide OffHeap support through HDFS RAM_DISK

2015-03-23 Thread Zhan Zhang
Thanks Reynold,

Agree with you to open another JIRA to unify the block storage API.  I have 
upload the design doc to SPARK-6479 as well.

Thanks.

Zhan Zhang

On Mar 23, 2015, at 4:03 PM, Reynold Xin 
r...@databricks.commailto:r...@databricks.com wrote:

I created a ticket to separate the API refactoring from the implementation. 
Would be great to have these as two separate patches to make it easier to 
review (similar to the way we are doing RPC refactoring -- first introducing an 
internal RPC api, port akka to it, and then add an alternative implementation).

https://issues.apache.org/jira/browse/SPARK-6479

Can you upload your design doc there so we can discuss the block store api? 
Thanks.


On Mon, Mar 23, 2015 at 3:47 PM, Zhan Zhang 
zzh...@hortonworks.commailto:zzh...@hortonworks.com wrote:
Hi Folks,

I am planning to implement hdfs off heap support for spark, and have uploaded 
the design doc for the off heap support through hdfs ramdisk in jira 
SPARK-6112. Please review it and provide your feedback if anybody are 
interested.

https://issues.apache.org/jira/browse/SPARK-6112

Thanks.

Zhan Zhang




Re: Review request for SPARK-6112:Provide OffHeap support through HDFS RAM_DISK

2015-03-23 Thread Reynold Xin
I created a ticket to separate the API refactoring from the implementation.
Would be great to have these as two separate patches to make it easier to
review (similar to the way we are doing RPC refactoring -- first
introducing an internal RPC api, port akka to it, and then add an
alternative implementation).

https://issues.apache.org/jira/browse/SPARK-6479

Can you upload your design doc there so we can discuss the block store api?
Thanks.


On Mon, Mar 23, 2015 at 3:47 PM, Zhan Zhang zzh...@hortonworks.com wrote:

 Hi Folks,

 I am planning to implement hdfs off heap support for spark, and have
 uploaded the design doc for the off heap support through hdfs ramdisk in
 jira SPARK-6112. Please review it and provide your feedback if anybody are
 interested.

 https://issues.apache.org/jira/browse/SPARK-6112

 Thanks.

 Zhan Zhang