Re: Requirements of objects stored in RDDs
An RDD can hold objects of any type. If you generally think of it as a distributed Collection, then you won't ever be that far off. As far as serialization, the contents of an RDD must be serializable. There are two serialization libraries you can use with Spark: normal Java serialization or Kryo serialization. See https://spark.apache.org/docs/latest/tuning.html#data-serialization for more details. If you are using Java serialization then just implementing the Serializable interface will work. If you're using Kryo, then The point that it works fine with local mode and tests but fails in Mesos, that makes me think there's an issue with the Mesos cluster deployment. First, does it work properly in standalone mode? Second, how are you getting the Clojure libraries onto the Mesos executors? Are they included in your executor URI bundle, or otherwise passing a parameter that points to the clojure jars? Cheers, Andrew On Thu, May 8, 2014 at 9:55 AM, Soren Macbeth so...@yieldbot.com wrote: Hi, What are the requirements of objects that are stored in RDDs? I'm still struggling with an exception I've already posted about several times. My questions are: 1) What interfaces are objects stored in RDDs expected to implement, if any? 2) Are collections (be they scala, java or otherwise) handled differently than other objects? The bug I'm hitting is when I try to use my clojure DSL (which wraps the java api) with clojure collections, specifically clojure.lang.PersistentVectors in my RDDs. Here is the exception message: org.apache.spark.SparkException: Job aborted: Exception while deserializing and fetching task: com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Can not set final scala.collecti on.convert.Wrappers field scala.collection.convert.Wrappers$SeqWrapper.$outer to clojure.lang.PersistentVector Now, this same application works fine in local mode and tests, but it fails when run under mesos. That would seem to me to point to something around RDD partitioning for tasks, but I'm not sure. I don't know much scala, but according to google, SeqWrapper is part of the implicit JavaConversion functionality of scala collections. Under what circumstances would spark be trying to wrap my RDD objects in scala collections? Finally - I'd like to point out that this is not a serialization issue with my clojure collection objects. I have registered serializers for them and have verified they serialize and deserialize perfectly well in spark. One last note is that this failure occurs after all the tasks for finished for a reduce stage and the results are returned to the driver. TIA
Re: Updating docs for running on Mesos
As far as I know, the upstream doesn't release binaries, only source code. The downloads page https://mesos.apache.org/downloads/ for 0.18.0 only has a source tarball. Is there a binary release somewhere from Mesos that I'm missing? On Sun, May 11, 2014 at 2:16 PM, Patrick Wendell pwend...@gmail.com wrote: Andrew, Updating these docs would be great! I think this would be a welcome change. In terms of packaging, it would be good to mention the binaries produced by the upstream project as well, in addition to Mesosphere. - Patrick On Thu, May 8, 2014 at 12:51 AM, Andrew Ash and...@andrewash.com wrote: The docs for how to run Spark on Mesos have changed very little since 0.6.0, but setting it up is much easier now than then. Does it make sense to revamp with the below changes? You no longer need to build mesos yourself as pre-built versions are available from Mesosphere: http://mesosphere.io/downloads/ And the instructions guide you towards compiling your own distribution of Spark, when you can use the prebuilt versions of Spark as well. I'd like to split that portion of the documentation into two sections, a build-from-scratch section and a use-prebuilt section. The new outline would look something like this: *Running Spark on Mesos* Installing Mesos - using prebuilt (recommended) - pointer to mesosphere's packages - from scratch - (similar to current) Connecting Spark to Mesos - loading distribution into an accessible location - Spark settings Mesos Run Modes - (same as current) Running Alongside Hadoop - (trim this down) Does that work for people? Thanks! Andrew PS Basically all the same: http://spark.apache.org/docs/0.6.0/running-on-mesos.html http://spark.apache.org/docs/0.6.2/running-on-mesos.html http://spark.apache.org/docs/0.7.3/running-on-mesos.html http://spark.apache.org/docs/0.8.1/running-on-mesos.html http://spark.apache.org/docs/0.9.1/running-on-mesos.html https://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/running-on-mesos.html
Re: Kryo not default?
The main reason is that it doesn't always work (e.g. sometimes application program has special serialization / externalization written already for Java which don't work in Kryo). On Mon, May 12, 2014 at 5:47 PM, Anand Avati av...@gluster.org wrote: Hi, Can someone share the reason why Kryo serializer is not the default? Is there anything to be careful about (because of which it is not enabled by default)? Thanks!
Re: Preliminary Parquet numbers and including .count() in Catalyst
Thanks for the experiments and analysis! I think Michael already submitted a patch that avoids scanning all columns for count(*) or count(1). On Mon, May 12, 2014 at 9:46 PM, Andrew Ash and...@andrewash.com wrote: Hi Spark devs, First of all, huge congrats on the parquet integration with SparkSQL! This is an incredible direction forward and something I can see being very broadly useful. I was doing some preliminary tests to see how it works with one of my workflows, and wanted to share some numbers that people might want to know about. I also wanted to point out that .count() doesn't seem integrated with the rest of the optimization framework, and some big gains could be possible. So, the numbers: I took a table extracted from a SQL database and stored in HDFS: - 115 columns (several always-empty, mostly strings, some enums, some numbers) - 253,887,080 rows - 182,150,295,881 bytes (raw uncompressed) - 42,826,820,222 bytes (lzo compressed with .index file) And I converted it to Parquet using SparkSQL's SchemaRDD.saveAsParquet() call: - Converting from .lzo in HDFS to .parquet in HDFS took 635s using 42 cores across 4 machines - 17,517,922,117 bytes (parquet per SparkSQL defaults) So storing in parquet format vs lzo compresses the data down to less than 50% of the .lzo size, and under 10% of the raw uncompressed size. Nice! I then did some basic interactions on it: *Row count* - LZO - lzoFile(/path/to/lzo).count - 31.632305953s - Parquet - sqlContext.parquetFile(/path/to/parquet).count - 289.129487003s Reassembling rows from the separate column storage is clearly really expensive. Median task length is 33s vs 4s, and of that 33s in each task (319 tasks total) about 1.75 seconds are spent in GC (inefficient object allocation?) *Count number of rows with a particular key:* - LZO - lzoFile(/path/to/lzo).filter(_.split(\\|)(0) == 1234567890).count - 73.988897511s - Parquet - sqlContext.parquetFile(/path/to/parquet).where('COL === 1234567890).count - 293.410470418s - Parquet (hand-tuned to count on just one column) - sqlContext.parquetFile(/path/to/parquet).where('COL === 1234567890).select('IDCOL).count - 1.160449187s It looks like currently the .count() on parquet is handled incredibly inefficiently and all the columns are materialized. But if I select just that relevant column and then count, then the column-oriented storage of Parquet really shines. There ought to be a potential optimization here such that a .count() on a SchemaRDD backed by Parquet doesn't require re-assembling the rows, as that's expensive. I don't think .count() is handled specially in SchemaRDDs, but it seems ripe for optimization. *Count number of distinct values in a column* - LZO - lzoFile(/path/to/lzo).map(sel(0)).distinct.count - 115.582916866s - Parquet - sqlContext.parquetFile(/path/to/parquet).select('COL).distinct.count - 16.839004826 s It turns out column selectivity is very useful! I'm guessing that if I could get byte counts read out of HDFS, that would just about match up with the difference in read times. Any thoughts on how to embed the knowledge of my hand-tuned additional .select('IDCOL) into Catalyst? Thanks again for all the hard work and prep for the 1.0 release! Andrew
Is this supported? : Spark on Windows, Hadoop YARN on Linux.
I'm trying to run spark-shell on Windows that uses Hadoop YARN on Linux. Specifically, the environment is as follows: - Client - OS: Windows 7 - Spark version: 1.0.0-SNAPSHOT (git cloned 2014.5.8) - Server - Platform: hortonworks sandbox 2.1 I has to modify the spark source code to apply https://issues.apache.org/jira/browse/YARN-1824, so that the cross-platform issues can be addressed. (that is, change $() to $$(), File.pathSeparator to ApplicationConstants.CLASS_PATH_SEPARATOR). Seeing this, I suspect that the Spark code for now is not prepared to support cross-platform submit, that is, Spark on Windows - Hadoop YARN on Linux. Anyways, after the modification and some configuration tweak, at least the yarn-client mode spark-shell submitted from Windows 7 seems to try to start. But the ApplicationManager fails to register. Yarn server log is as follows: ('owner' is the user name of the Windows 7 machine.) Log Type: stderr Log Length: 1356 log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. 14/05/12 01:13:54 INFO YarnSparkHadoopUtil: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/05/12 01:13:54 INFO SecurityManager: Changing view acls to: yarn,owner 14/05/12 01:13:54 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, owner) 14/05/12 01:13:55 INFO Slf4jLogger: Slf4jLogger started 14/05/12 01:13:56 INFO Remoting: Starting remoting 14/05/12 01:13:56 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkyar...@sandbox.hortonworks.com:47074] 14/05/12 01:13:56 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkyar...@sandbox.hortonworks.com:47074] 14/05/12 01:13:56 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8030 14/05/12 01:13:56 INFO ExecutorLauncher: ApplicationAttemptId: appattempt_1399856448891_0018_01 14/05/12 01:13:56 INFO ExecutorLauncher: Registering the ApplicationMaster 14/05/12 01:13:56 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN] How can I handle this error? Or, should I give up and use Linux for my client machine? (I want to use Windows for client, since for me it's more comfortable to develop applications.) BTW, I'm a newbie for Spark and Hadoop. Thanks in advance.
Re: Updating docs for running on Mesos
I’ll ask the Mesos folks about this. Unfortunately it might be tough to link only to a company’s builds; but we can perhaps include them in addition to instructions for building Mesos from Apache. Matei On May 12, 2014, at 11:55 PM, Gerard Maas gerard.m...@gmail.com wrote: Andrew, Mesosphere has binary releases here: http://mesosphere.io/downloads/ (Anecdote: I actually burned a CPU building Mesos from source. No kidding - it was coming, as the laptop was crashing from time to time, but the mesos build was that one drop too much) kr, Gerard. On Tue, May 13, 2014 at 6:57 AM, Andrew Ash and...@andrewash.com wrote: As far as I know, the upstream doesn't release binaries, only source code. The downloads page https://mesos.apache.org/downloads/ for 0.18.0 only has a source tarball. Is there a binary release somewhere from Mesos that I'm missing? On Sun, May 11, 2014 at 2:16 PM, Patrick Wendell pwend...@gmail.com wrote: Andrew, Updating these docs would be great! I think this would be a welcome change. In terms of packaging, it would be good to mention the binaries produced by the upstream project as well, in addition to Mesosphere. - Patrick On Thu, May 8, 2014 at 12:51 AM, Andrew Ash and...@andrewash.com wrote: The docs for how to run Spark on Mesos have changed very little since 0.6.0, but setting it up is much easier now than then. Does it make sense to revamp with the below changes? You no longer need to build mesos yourself as pre-built versions are available from Mesosphere: http://mesosphere.io/downloads/ And the instructions guide you towards compiling your own distribution of Spark, when you can use the prebuilt versions of Spark as well. I'd like to split that portion of the documentation into two sections, a build-from-scratch section and a use-prebuilt section. The new outline would look something like this: *Running Spark on Mesos* Installing Mesos - using prebuilt (recommended) - pointer to mesosphere's packages - from scratch - (similar to current) Connecting Spark to Mesos - loading distribution into an accessible location - Spark settings Mesos Run Modes - (same as current) Running Alongside Hadoop - (trim this down) Does that work for people? Thanks! Andrew PS Basically all the same: http://spark.apache.org/docs/0.6.0/running-on-mesos.html http://spark.apache.org/docs/0.6.2/running-on-mesos.html http://spark.apache.org/docs/0.7.3/running-on-mesos.html http://spark.apache.org/docs/0.8.1/running-on-mesos.html http://spark.apache.org/docs/0.9.1/running-on-mesos.html https://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/running-on-mesos.html
Re: Kryo not default?
On Mon, May 12, 2014 at 2:47 PM, Anand Avati av...@gluster.org wrote: Hi, Can someone share the reason why Kryo serializer is not the default? why should it be? On top of it, the only way to serialize a closure into the backend (even now) is java serialization (which means java serialization is required of all closure attributes) Is there anything to be careful about (because of which it is not enabled by default)? Yes. Kind of stems from above. There's still a number of api calls that use closure attributes to serialize data to backend (see fold(), for example). which means even if you enable kryo, some api still requires java serialization of an attribute. I fixed parallelize(), collect() and something else that i don't remember already in that regard, but i think even up till now there's still a number of apis lingering whose data parameters wouldn't work with kryo. Thanks!
Serializable different behavior Spark Shell vs. Scala Shell
Reposting here on dev since I didn't see a response on user: I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In the Spark Shell, equals() fails when I use the canonical equals() pattern of match{}, but works when I subsitute with isInstanceOf[]. I am using Spark 0.9.0/Scala 2.10.3. Is this a bug? Spark Shell (equals uses match{}) = class C(val s:String) extends Serializable { override def equals(o: Any) = o match { case that: C = that.s == s case _ = false } } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = false Spark Shell (equals uses isInstanceOf[]) class C(val s:String) extends Serializable { override def equals(o: Any) = if (o.isInstanceOf[C]) (o.asInstanceOf[C].s == s) else false } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = true Scala Shell (equals uses match{}) = class C(val s:String) extends Serializable { override def equals(o: Any) = o match { case that: C = that.s == s case _ = false } } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = true
Re: Multinomial Logistic Regression
Hi Deb, For K possible outcomes in multinomial logistic regression, we can have K-1 independent binary logistic regression models, in which one outcome is chosen as a pivot and then the other K-1 outcomes are separately regressed against the pivot outcome. See my presentation for technical detail http://www.slideshare.net/dbtsai/2014-0501-mlor Since mllib only supports one linear model per classification model, there will be some infrastructure work to support MLOR in mllib. But if you want to implement yourself with the L-BFGS solver in mllib, you can follow the equation in my slide, and it will not be too difficult. I can give you the gradient method for multinomial logistic regression, you just need to put the K-1 intercepts in the right place. def computeGradient(y: Double, x: Array[Double], lambda: Double, w: Array[Array[Double]], b: Array[Double], gradient: Array[Double]): (Double, Int) = { val classes = b.length + 1 val yy = y.toInt def alpha(i: Int): Int = { if (i == 0) 1 else 0 } def delta(i: Int, j: Int): Int = { if (i == j) 1 else 0 } var denominator: Double = 1.0 val numerators: Array[Double] = Array.ofDim[Double](b.length) var predicted = 1 { var i = 0 var j = 0 var acc: Double = 0 while (i b.length) { acc = b(i) j = 0 while (j x.length) { acc += x(j) * w(i)(j) j += 1 } numerators(i) = math.exp(acc) if (i 0 numerators(i) numerators(predicted - 1)) { predicted = i + 1 } denominator += numerators(i) i += 1 } if (numerators(predicted - 1) 1) { predicted = 0 } } { // gradient has dim of (classes-1) * (x.length+1) var i = 0 var m1: Int = 0 var l1: Int = 0 while (i (classes - 1) * (x.length + 1)) { m1 = i % (x.length + 1) // m0 is intercept l1 = (i - m1) / (x.length + 1) // l + 1 is class if (m1 == 0) { gradient(i) += (1 - alpha(yy)) * delta(yy, l1 + 1) - numerators(l1) / denominator } else { gradient(i) += ((1 - alpha(yy)) * delta(yy, l1 + 1) - numerators(l1) / denominator) * x(m1 - 1) } i += 1 } } val loglike: Double = math.round(y).toInt match { case 0 = math.log(1.0 / denominator) case _ = math.log(numerators(math.round(y - 1).toInt) / denominator) } (loglike, predicted) } Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, May 13, 2014 at 4:08 AM, Debasish Das debasish.da...@gmail.comwrote: Hi, Is there a PR for multinomial logistic regression which does one-vs-all and compare it to the other possibilities ? @dbtsai in your strata presentation you used one vs all ? Did you add some constraints on the fact that you penalize if mis-predicted labels are not very far from the true label ? Thanks. Deb
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
There were a few early/test RCs this cycle that were never put to a vote. On Tue, May 13, 2014 at 8:07 AM, Nan Zhu zhunanmcg...@gmail.com wrote: just curious, where is rc4 VOTE? I searched my gmail but didn't find that? On Tue, May 13, 2014 at 9:49 AM, Sean Owen so...@cloudera.com wrote: On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell pwend...@gmail.com wrote: The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc5/ Good news is that the sigs, MD5 and SHA are all correct. Tiny note: the Maven artifacts use SHA1, while the binary artifacts use SHA512, which took me a bit of head-scratching to figure out. If another RC comes out, I might suggest making it SHA1 everywhere? But there is nothing wrong with these signatures and checksums. Now to look at the contents...
Re: Serializable different behavior Spark Shell vs. Scala Shell
Thank you for your investigation into this! Just for completeness, I've confirmed it's a problem only in REPL, not in compiled Spark programs. But within REPL, a direct consequence of non-same classes after serialization/deserialization also means that lookup() doesn't work: scala class C(val s:String) extends Serializable { | override def equals(o: Any) = if (o.isInstanceOf[C]) o.asInstanceOf[C].s == s else false | override def toString = s | } defined class C scala val r = sc.parallelize(Array((new C(a),11),(new C(a),12))) r: org.apache.spark.rdd.RDD[(C, Int)] = ParallelCollectionRDD[3] at parallelize at console:14 scala r.lookup(new C(a)) console:17: error: type mismatch; found : C required: C r.lookup(new C(a)) ^ On Tuesday, May 13, 2014 3:05 PM, Anand Avati av...@gluster.org wrote: On Tue, May 13, 2014 at 8:26 AM, Michael Malak michaelma...@yahoo.com wrote: Reposting here on dev since I didn't see a response on user: I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In the Spark Shell, equals() fails when I use the canonical equals() pattern of match{}, but works when I subsitute with isInstanceOf[]. I am using Spark 0.9.0/Scala 2.10.3. Is this a bug? Spark Shell (equals uses match{}) = class C(val s:String) extends Serializable { override def equals(o: Any) = o match { case that: C = that.s == s case _ = false } } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = false Spark Shell (equals uses isInstanceOf[]) class C(val s:String) extends Serializable { override def equals(o: Any) = if (o.isInstanceOf[C]) (o.asInstanceOf[C].s == s) else false } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = true Scala Shell (equals uses match{}) = class C(val s:String) extends Serializable { override def equals(o: Any) = o match { case that: C = that.s == s case _ = false } } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = true Hmm. I see that this can be reproduced without Spark in Scala 2.11, with and without -Yrepl-class-based command line flag to the repl. Spark's REPL has the effective behavior of 2.11's -Yrepl-class-based flag. Inspecting the byte code generated, it appears -Yrepl-class-based results in the creation of $outer field in the generated classes (including class C). The first case match in equals() is resulting code along the lines of (simplified): if (o isinstanceof Cstr this.$outer == that.$outer) { // do string compare // } $outer is the synthetic field object to the outer object in which the object was created (in this case, the repl environment). Now obviously, when x is taken through the bytestream and deserialized, it would have a new $outer created (it may have deserialized in a different jvm or machine for all we know). So the $outer's mismatching is expected. However I'm still trying to understand why the outers need to be the same for the case match.
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
-1 The following bug should be fixed: https://issues.apache.org/jira/browse/SPARK-1817 https://issues.apache.org/jira/browse/SPARK-1712 -- Original -- From: Patrick Wendell;pwend...@gmail.com; Date: Wed, May 14, 2014 04:07 AM To: dev@spark.apache.orgdev@spark.apache.org; Subject: Re: [VOTE] Release Apache Spark 1.0.0 (rc5) Hey all - there were some earlier RC's that were not presented to the dev list because issues were found with them. Also, there seems to be some issues with the reliability of the dev list e-mail. Just a heads up. I'll lead with a +1 for this. On Tue, May 13, 2014 at 8:07 AM, Nan Zhu zhunanmcg...@gmail.com wrote: just curious, where is rc4 VOTE? I searched my gmail but didn't find that? On Tue, May 13, 2014 at 9:49 AM, Sean Owen so...@cloudera.com wrote: On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell pwend...@gmail.com wrote: The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc5/ Good news is that the sigs, MD5 and SHA are all correct. Tiny note: the Maven artifacts use SHA1, while the binary artifacts use SHA512, which took me a bit of head-scratching to figure out. If another RC comes out, I might suggest making it SHA1 everywhere? But there is nothing wrong with these signatures and checksums. Now to look at the contents... .
Re: Serializable different behavior Spark Shell vs. Scala Shell
On Tue, May 13, 2014 at 8:26 AM, Michael Malak michaelma...@yahoo.comwrote: Reposting here on dev since I didn't see a response on user: I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In the Spark Shell, equals() fails when I use the canonical equals() pattern of match{}, but works when I subsitute with isInstanceOf[]. I am using Spark 0.9.0/Scala 2.10.3. Is this a bug? Spark Shell (equals uses match{}) = class C(val s:String) extends Serializable { override def equals(o: Any) = o match { case that: C = that.s == s case _ = false } } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = false Spark Shell (equals uses isInstanceOf[]) class C(val s:String) extends Serializable { override def equals(o: Any) = if (o.isInstanceOf[C]) (o.asInstanceOf[C].s == s) else false } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = true Scala Shell (equals uses match{}) = class C(val s:String) extends Serializable { override def equals(o: Any) = o match { case that: C = that.s == s case _ = false } } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = true Hmm. I see that this can be reproduced without Spark in Scala 2.11, with and without -Yrepl-class-based command line flag to the repl. Spark's REPL has the effective behavior of 2.11's -Yrepl-class-based flag. Inspecting the byte code generated, it appears -Yrepl-class-based results in the creation of $outer field in the generated classes (including class C). The first case match in equals() is resulting code along the lines of (simplified): if (o isinstanceof Cstr this.$outer == that.$outer) { // do string compare // } $outer is the synthetic field object to the outer object in which the object was created (in this case, the repl environment). Now obviously, when x is taken through the bytestream and deserialized, it would have a new $outer created (it may have deserialized in a different jvm or machine for all we know). So the $outer's mismatching is expected. However I'm still trying to understand why the outers need to be the same for the case match.
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
I just built rc5 on Windows 7 and tried to reproduce the problem described in https://issues.apache.org/jira/browse/SPARK-1712 It works on my machine: 14/05/13 21:06:47 INFO DAGScheduler: Stage 1 (sum at console:17) finished in 4.548 s 14/05/13 21:06:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 14/05/13 21:06:47 INFO SparkContext: Job finished: sum at console:17, took 4.814991993 s res1: Double = 5.05E11 I used all defaults, no config files were changed. Not sure if that makes a difference... -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc5-tp6542p6560.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Preliminary Parquet numbers and including .count() in Catalyst
Thanks for filing -- I'm keeping my eye out for updates on that ticket. Cheers! Andrew On Tue, May 13, 2014 at 2:40 PM, Michael Armbrust mich...@databricks.comwrote: It looks like currently the .count() on parquet is handled incredibly inefficiently and all the columns are materialized. But if I select just that relevant column and then count, then the column-oriented storage of Parquet really shines. There ought to be a potential optimization here such that a .count() on a SchemaRDD backed by Parquet doesn't require re-assembling the rows, as that's expensive. I don't think .count() is handled specially in SchemaRDDs, but it seems ripe for optimization. Yeah, you are right. Thanks for pointing this out! If you call .count() that is just the native Spark count, which is not aware of the potential optimizations. We could just override count() in a schema RDD to be something like groupBy()(Count(Literal(1))).collect().head.getInt(0) Here is a JIRA: SPARK-1822 - SchemaRDD.count() should use the optimizer.https://issues.apache.org/jira/browse/SPARK-1822 Michael