Fwd: hadoop input/output format advanced control
see email below. reynold suggested i send it to dev instead of user -- Forwarded message -- From: Koert Kuipers ko...@tresata.com Date: Mon, Mar 23, 2015 at 4:36 PM Subject: hadoop input/output format advanced control To: u...@spark.apache.org u...@spark.apache.org currently its pretty hard to control the Hadoop Input/Output formats used in Spark. The conventions seems to be to add extra parameters to all methods and then somewhere deep inside the code (for example in PairRDDFunctions.saveAsHadoopFile) all these parameters get translated into settings on the Hadoop Configuration object. for example for compression i see codec: Option[Class[_ : CompressionCodec]] = None added to a bunch of methods. how scalable is this solution really? for example i need to read from a hadoop dataset and i dont want the input (part) files to get split up. the way to do this is to set mapred.min.split.size. now i dont want to set this at the level of the SparkContext (which can be done), since i dont want it to apply to input formats in general. i want it to apply to just this one specific input dataset i need to read. which leaves me with no options currently. i could go add yet another input parameter to all the methods (SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile, etc.). but that seems ineffective. why can we not expose a Map[String, String] or some other generic way to manipulate settings for hadoop input/output formats? it would require adding one more parameter to all methods to deal with hadoop input/output formats, but after that its done. one parameter to rule them all then i could do: val x = sc.textFile(/some/path, formatSettings = Map(mapred.min.split.size - 12345)) or rdd.saveAsTextFile(/some/path, formatSettings = Map(mapred.output.compress - true, mapred.output.compression.codec - somecodec))
Re: Starting sparkthrift server
we are running this right now as root user and the folder /tmp/spark-events was manually created and the Job has access to this folder On Mon, Mar 23, 2015 at 3:38 PM, Denny Lee denny.g@gmail.com wrote: It appears that you are running the thrift-server using the spark-events account but the /tmp/spark-events folder doesn't exist or the user running thrift-server does not have access to it. Have you been able to run Hive using the spark-events user so that way the /tmp/spark-events folder has been created. If you need to reassign the scratch dir / log dir to another folder (instead of /tmp/spark-events), you could use --hiveconf to assign those to another folder. On Mon, Mar 23, 2015 at 8:39 AM Neil Dev neilk...@gmail.com wrote: Hi, I am having issues starting spark-thriftserver. I'm running spark 1.3.o with Hadoop 2.4.0. I would like to be able to change its port too so, I can hive hive-thriftserver as well as spark-thriftserver running at the same time. Starting sparkthrift server:- sudo ./start-thriftserver.sh --master spark://ip-172-31-10-124:7077 --executor-memory 2G Error:- I created the folder manually but still getting the following error Exception in thread main java.lang.IllegalArgumentException: Log directory /tmp/spark-events does not exist. I am getting the following error 15/03/23 15:07:02 ERROR thrift.ThriftCLIService: Error: org.apache.thrift.transport.TTransportException: Could not create ServerSocket on address0.0.0.0/0.0.0.0:1. at org.apache.thrift.transport.TServerSocket.init(TServerSocket.java:93) at org.apache.thrift.transport.TServerSocket.init(TServerSocket.java:79) at org.apache.hive.service.auth.HiveAuthFactory.getServerSocket( HiveAuthFactory.java:236) at org.apache.hive.service.cli.thrift.ThriftBinaryCLIService. run(ThriftBinaryCLIService.java:69) at java.lang.Thread.run(Thread.java:745) Thanks Neil
Re: enum-like types in Spark
Yeah the fully realized #4, which gets back the ability to use it in switch statements (? in Scala but not Java?) does end up being kind of huge. I confess I'm swayed a bit back to Java enums, seeing what it involves. The hashCode() issue can be 'solved' with the hash of the String representation. On Mon, Mar 23, 2015 at 8:33 PM, Imran Rashid iras...@cloudera.com wrote: I've just switched some of my code over to the new format, and I just want to make sure everyone realizes what we are getting into. I went from 10 lines as java enums https://github.com/squito/spark/blob/fef66058612ebf225e58dd5f5fea6bae1afd5b31/core/src/main/java/org/apache/spark/status/api/StageStatus.java#L20 to 30 lines with the new format: https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/status/api/v1/api.scala#L250 its not just that its verbose. each name has to be repeated 4 times, with potential typos in some locations that won't be caught by the compiler. Also, you have to manually maintain the values as you update the set of enums, the compiler won't do it for you. The only downside I've heard for java enums is enum.hashcode(). OTOH, the downsides for this version are: maintainability / verbosity, no values(), more cumbersome to use from java, no enum map / enumset. I did put together a little util to at least get back the equivalent of enum.valueOf() with this format https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/util/SparkEnum.scala I'm not trying to prevent us from moving forward on this, its fine if this is still what everyone wants, but I feel pretty strongly java enums make more sense. thanks, Imran - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Starting sparkthrift server
When you say the job has access, do you mean that when you run spark-submit or spark-shell (for example), it is able to write to the /tmp/spark-events folder? On Mon, Mar 23, 2015 at 1:02 PM Neil Dev neilk...@gmail.com wrote: we are running this right now as root user and the folder /tmp/spark-events was manually created and the Job has access to this folder On Mon, Mar 23, 2015 at 3:38 PM, Denny Lee denny.g@gmail.com wrote: It appears that you are running the thrift-server using the spark-events account but the /tmp/spark-events folder doesn't exist or the user running thrift-server does not have access to it. Have you been able to run Hive using the spark-events user so that way the /tmp/spark-events folder has been created. If you need to reassign the scratch dir / log dir to another folder (instead of /tmp/spark-events), you could use --hiveconf to assign those to another folder. On Mon, Mar 23, 2015 at 8:39 AM Neil Dev neilk...@gmail.com wrote: Hi, I am having issues starting spark-thriftserver. I'm running spark 1.3.o with Hadoop 2.4.0. I would like to be able to change its port too so, I can hive hive-thriftserver as well as spark-thriftserver running at the same time. Starting sparkthrift server:- sudo ./start-thriftserver.sh --master spark://ip-172-31-10-124:7077 --executor-memory 2G Error:- I created the folder manually but still getting the following error Exception in thread main java.lang.IllegalArgumentException: Log directory /tmp/spark-events does not exist. I am getting the following error 15/03/23 15:07:02 ERROR thrift.ThriftCLIService: Error: org.apache.thrift.transport.TTransportException: Could not create ServerSocket on address0.0.0.0/0.0.0.0:1. at org.apache.thrift.transport.TServerSocket.init(TServerSocket.java:93) at org.apache.thrift.transport.TServerSocket.init(TServerSocket.java:79) at org.apache.hive.service.auth.HiveAuthFactory.getServerSocket( HiveAuthFactory.java:236) at org.apache.hive.service.cli.thrift.ThriftBinaryCLIService. run(ThriftBinaryCLIService.java:69) at java.lang.Thread.run(Thread.java:745) Thanks Neil
Re: Starting sparkthrift server
When I start spark-shell (for example) it does not write to the /tmp/spark-events folder. It remains empty. I have even tried it after giving that folder rwx permission for user, group and others. Neil's colleague, Anu On Mon, Mar 23, 2015 at 4:50 PM, Denny Lee denny.g@gmail.com wrote: When you say the job has access, do you mean that when you run spark-submit or spark-shell (for example), it is able to write to the /tmp/spark-events folder? On Mon, Mar 23, 2015 at 1:02 PM Neil Dev neilk...@gmail.com wrote: we are running this right now as root user and the folder /tmp/spark-events was manually created and the Job has access to this folder On Mon, Mar 23, 2015 at 3:38 PM, Denny Lee denny.g@gmail.com wrote: It appears that you are running the thrift-server using the spark-events account but the /tmp/spark-events folder doesn't exist or the user running thrift-server does not have access to it. Have you been able to run Hive using the spark-events user so that way the /tmp/spark-events folder has been created. If you need to reassign the scratch dir / log dir to another folder (instead of /tmp/spark-events), you could use --hiveconf to assign those to another folder. On Mon, Mar 23, 2015 at 8:39 AM Neil Dev neilk...@gmail.com wrote: Hi, I am having issues starting spark-thriftserver. I'm running spark 1.3.o with Hadoop 2.4.0. I would like to be able to change its port too so, I can hive hive-thriftserver as well as spark-thriftserver running at the same time. Starting sparkthrift server:- sudo ./start-thriftserver.sh --master spark://ip-172-31-10-124:7077 --executor-memory 2G Error:- I created the folder manually but still getting the following error Exception in thread main java.lang.IllegalArgumentException: Log directory /tmp/spark-events does not exist. I am getting the following error 15/03/23 15:07:02 ERROR thrift.ThriftCLIService: Error: org.apache.thrift.transport.TTransportException: Could not create ServerSocket on address0.0.0.0/0.0.0.0:1. at org.apache.thrift.transport.TServerSocket.init(TServerSocket.java:93) at org.apache.thrift.transport.TServerSocket.init(TServerSocket.java:79) at org.apache.hive.service.auth.HiveAuthFactory.getServerSocket( HiveAuthFactory.java:236) at org.apache.hive.service.cli.thrift.ThriftBinaryCLIService. run(ThriftBinaryCLIService.java:69) at java.lang.Thread.run(Thread.java:745) Thanks Neil
Re: enum-like types in Spark
If scaladoc can show the Java enum types, I do think the best way is then just Java enum types. On Mon, Mar 23, 2015 at 2:11 PM, Patrick Wendell pwend...@gmail.com wrote: If the official solution from the Scala community is to use Java enums, then it seems strange they aren't generated in scaldoc? Maybe we can just fix that w/ Typesafe's help and then we can use them. On Mon, Mar 23, 2015 at 1:46 PM, Sean Owen so...@cloudera.com wrote: Yeah the fully realized #4, which gets back the ability to use it in switch statements (? in Scala but not Java?) does end up being kind of huge. I confess I'm swayed a bit back to Java enums, seeing what it involves. The hashCode() issue can be 'solved' with the hash of the String representation. On Mon, Mar 23, 2015 at 8:33 PM, Imran Rashid iras...@cloudera.com wrote: I've just switched some of my code over to the new format, and I just want to make sure everyone realizes what we are getting into. I went from 10 lines as java enums https://github.com/squito/spark/blob/fef66058612ebf225e58dd5f5fea6bae1afd5b31/core/src/main/java/org/apache/spark/status/api/StageStatus.java#L20 to 30 lines with the new format: https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/status/api/v1/api.scala#L250 its not just that its verbose. each name has to be repeated 4 times, with potential typos in some locations that won't be caught by the compiler. Also, you have to manually maintain the values as you update the set of enums, the compiler won't do it for you. The only downside I've heard for java enums is enum.hashcode(). OTOH, the downsides for this version are: maintainability / verbosity, no values(), more cumbersome to use from java, no enum map / enumset. I did put together a little util to at least get back the equivalent of enum.valueOf() with this format https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/util/SparkEnum.scala I'm not trying to prevent us from moving forward on this, its fine if this is still what everyone wants, but I feel pretty strongly java enums make more sense. thanks, Imran - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Shuffle Spill Memory and Shuffle Spill Disk
It looks this is not the right place for this question, I have send the question to user group. thank you, bijay On Mon, Mar 23, 2015 at 2:25 PM, Bijay Pathak bijay.pat...@cloudwick.com wrote: Hello, I am running TeraSort https://github.com/ehiggs/spark-terasort on 100GB of data. The final metrics I am getting on Shuffle Spill are: Shuffle Spill(Memory): 122.5 GB Shuffle Spill(Disk): 3.4 GB What's the difference and relation between these two metrics? Does these mean 122.5 GB was spill from memory during the shuffle? thank you, bijay
Re: enum-like types in Spark
well, perhaps I overstated things a little, I wouldn't call it the official solution, just a recommendation in the never-ending debate (and the recommendation from folks with their hands on scala itself). Even if we do get this fixed in scaladoc eventually -- as its not in the current versions, where does that leave this proposal? personally I'd *still* prefer java enums, even if it doesn't get into scaladoc. btw, even with sealed traits, the scaladoc still isn't great -- you don't see the values from the class, you only see them listed from the companion object. (though, that is somewhat standard for scaladoc, so maybe I'm reaching a little) On Mon, Mar 23, 2015 at 4:11 PM, Patrick Wendell pwend...@gmail.com wrote: If the official solution from the Scala community is to use Java enums, then it seems strange they aren't generated in scaldoc? Maybe we can just fix that w/ Typesafe's help and then we can use them. On Mon, Mar 23, 2015 at 1:46 PM, Sean Owen so...@cloudera.com wrote: Yeah the fully realized #4, which gets back the ability to use it in switch statements (? in Scala but not Java?) does end up being kind of huge. I confess I'm swayed a bit back to Java enums, seeing what it involves. The hashCode() issue can be 'solved' with the hash of the String representation. On Mon, Mar 23, 2015 at 8:33 PM, Imran Rashid iras...@cloudera.com wrote: I've just switched some of my code over to the new format, and I just want to make sure everyone realizes what we are getting into. I went from 10 lines as java enums https://github.com/squito/spark/blob/fef66058612ebf225e58dd5f5fea6bae1afd5b31/core/src/main/java/org/apache/spark/status/api/StageStatus.java#L20 to 30 lines with the new format: https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/status/api/v1/api.scala#L250 its not just that its verbose. each name has to be repeated 4 times, with potential typos in some locations that won't be caught by the compiler. Also, you have to manually maintain the values as you update the set of enums, the compiler won't do it for you. The only downside I've heard for java enums is enum.hashcode(). OTOH, the downsides for this version are: maintainability / verbosity, no values(), more cumbersome to use from java, no enum map / enumset. I did put together a little util to at least get back the equivalent of enum.valueOf() with this format https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/util/SparkEnum.scala I'm not trying to prevent us from moving forward on this, its fine if this is still what everyone wants, but I feel pretty strongly java enums make more sense. thanks, Imran - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: enum-like types in Spark
The only issue I knew of with Java enums was that it does not appear in the Scala documentation. On Mon, Mar 23, 2015 at 1:46 PM, Sean Owen so...@cloudera.com wrote: Yeah the fully realized #4, which gets back the ability to use it in switch statements (? in Scala but not Java?) does end up being kind of huge. I confess I'm swayed a bit back to Java enums, seeing what it involves. The hashCode() issue can be 'solved' with the hash of the String representation. On Mon, Mar 23, 2015 at 8:33 PM, Imran Rashid iras...@cloudera.com wrote: I've just switched some of my code over to the new format, and I just want to make sure everyone realizes what we are getting into. I went from 10 lines as java enums https://github.com/squito/spark/blob/fef66058612ebf225e58dd5f5fea6bae1afd5b31/core/src/main/java/org/apache/spark/status/api/StageStatus.java#L20 to 30 lines with the new format: https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/status/api/v1/api.scala#L250 its not just that its verbose. each name has to be repeated 4 times, with potential typos in some locations that won't be caught by the compiler. Also, you have to manually maintain the values as you update the set of enums, the compiler won't do it for you. The only downside I've heard for java enums is enum.hashcode(). OTOH, the downsides for this version are: maintainability / verbosity, no values(), more cumbersome to use from java, no enum map / enumset. I did put together a little util to at least get back the equivalent of enum.valueOf() with this format https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/util/SparkEnum.scala I'm not trying to prevent us from moving forward on this, its fine if this is still what everyone wants, but I feel pretty strongly java enums make more sense. thanks, Imran
Re: enum-like types in Spark
If the official solution from the Scala community is to use Java enums, then it seems strange they aren't generated in scaldoc? Maybe we can just fix that w/ Typesafe's help and then we can use them. On Mon, Mar 23, 2015 at 1:46 PM, Sean Owen so...@cloudera.com wrote: Yeah the fully realized #4, which gets back the ability to use it in switch statements (? in Scala but not Java?) does end up being kind of huge. I confess I'm swayed a bit back to Java enums, seeing what it involves. The hashCode() issue can be 'solved' with the hash of the String representation. On Mon, Mar 23, 2015 at 8:33 PM, Imran Rashid iras...@cloudera.com wrote: I've just switched some of my code over to the new format, and I just want to make sure everyone realizes what we are getting into. I went from 10 lines as java enums https://github.com/squito/spark/blob/fef66058612ebf225e58dd5f5fea6bae1afd5b31/core/src/main/java/org/apache/spark/status/api/StageStatus.java#L20 to 30 lines with the new format: https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/status/api/v1/api.scala#L250 its not just that its verbose. each name has to be repeated 4 times, with potential typos in some locations that won't be caught by the compiler. Also, you have to manually maintain the values as you update the set of enums, the compiler won't do it for you. The only downside I've heard for java enums is enum.hashcode(). OTOH, the downsides for this version are: maintainability / verbosity, no values(), more cumbersome to use from java, no enum map / enumset. I did put together a little util to at least get back the equivalent of enum.valueOf() with this format https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/util/SparkEnum.scala I'm not trying to prevent us from moving forward on this, its fine if this is still what everyone wants, but I feel pretty strongly java enums make more sense. thanks, Imran - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Review request for SPARK-6112:Provide OffHeap support through HDFS RAM_DISK
Thanks Reynold, Agree with you to open another JIRA to unify the block storage API. I have upload the design doc to SPARK-6479 as well. Thanks. Zhan Zhang On Mar 23, 2015, at 4:03 PM, Reynold Xin r...@databricks.commailto:r...@databricks.com wrote: I created a ticket to separate the API refactoring from the implementation. Would be great to have these as two separate patches to make it easier to review (similar to the way we are doing RPC refactoring -- first introducing an internal RPC api, port akka to it, and then add an alternative implementation). https://issues.apache.org/jira/browse/SPARK-6479 Can you upload your design doc there so we can discuss the block store api? Thanks. On Mon, Mar 23, 2015 at 3:47 PM, Zhan Zhang zzh...@hortonworks.commailto:zzh...@hortonworks.com wrote: Hi Folks, I am planning to implement hdfs off heap support for spark, and have uploaded the design doc for the off heap support through hdfs ramdisk in jira SPARK-6112. Please review it and provide your feedback if anybody are interested. https://issues.apache.org/jira/browse/SPARK-6112 Thanks. Zhan Zhang
Re: Review request for SPARK-6112:Provide OffHeap support through HDFS RAM_DISK
I created a ticket to separate the API refactoring from the implementation. Would be great to have these as two separate patches to make it easier to review (similar to the way we are doing RPC refactoring -- first introducing an internal RPC api, port akka to it, and then add an alternative implementation). https://issues.apache.org/jira/browse/SPARK-6479 Can you upload your design doc there so we can discuss the block store api? Thanks. On Mon, Mar 23, 2015 at 3:47 PM, Zhan Zhang zzh...@hortonworks.com wrote: Hi Folks, I am planning to implement hdfs off heap support for spark, and have uploaded the design doc for the off heap support through hdfs ramdisk in jira SPARK-6112. Please review it and provide your feedback if anybody are interested. https://issues.apache.org/jira/browse/SPARK-6112 Thanks. Zhan Zhang