Re: Requirements of objects stored in RDDs

2014-05-13 Thread Andrew Ash
An RDD can hold objects of any type.  If you generally think of it as a
distributed Collection, then you won't ever be that far off.

As far as serialization, the contents of an RDD must be serializable.
 There are two serialization libraries you can use with Spark: normal Java
serialization or Kryo serialization.  See
https://spark.apache.org/docs/latest/tuning.html#data-serialization for
more details.

If you are using Java serialization then just implementing the Serializable
interface will work.  If you're using Kryo, then

The point that it works fine with local mode and tests but fails in Mesos,
that makes me think there's an issue with the Mesos cluster deployment.
 First, does it work properly in standalone mode?  Second, how are you
getting the Clojure libraries onto the Mesos executors?  Are they included
in your executor URI bundle, or otherwise passing a parameter that points
to the clojure jars?

Cheers,
Andrew


On Thu, May 8, 2014 at 9:55 AM, Soren Macbeth so...@yieldbot.com wrote:

 Hi,

 What are the requirements of objects that are stored in RDDs?

 I'm still struggling with an exception I've already posted about several
 times. My questions are:

 1) What interfaces are objects stored in RDDs expected to implement, if
 any?
 2) Are collections (be they scala, java or otherwise) handled differently
 than other objects?

 The bug I'm hitting is when I try to use my clojure DSL (which wraps the
 java api) with clojure collections, specifically
 clojure.lang.PersistentVectors in my RDDs. Here is the exception message:

 org.apache.spark.SparkException: Job aborted: Exception while deserializing
 and fetching task: com.esotericsoftware.kryo.KryoException:
 java.lang.IllegalArgumentException: Can not set final scala.collecti
 on.convert.Wrappers field
 scala.collection.convert.Wrappers$SeqWrapper.$outer to
 clojure.lang.PersistentVector

 Now, this same application works fine in local mode and tests, but it fails
 when run under mesos. That would seem to me to point to something around
 RDD partitioning for tasks, but I'm not sure.

 I don't know much scala, but according to google, SeqWrapper is part of the
 implicit JavaConversion functionality of scala collections. Under what
 circumstances would spark be trying to wrap my RDD objects in scala
 collections?

 Finally - I'd like to point out that this is not a serialization issue with
 my clojure collection objects. I have registered serializers for them and
 have verified they serialize and deserialize perfectly well in spark.

 One last note is that this failure occurs after all the tasks for finished
 for a reduce stage and the results are returned to the driver.

 TIA



Re: Updating docs for running on Mesos

2014-05-13 Thread Andrew Ash
As far as I know, the upstream doesn't release binaries, only source code.
 The downloads page https://mesos.apache.org/downloads/ for 0.18.0 only
has a source tarball.  Is there a binary release somewhere from Mesos that
I'm missing?


On Sun, May 11, 2014 at 2:16 PM, Patrick Wendell pwend...@gmail.com wrote:

 Andrew,

 Updating these docs would be great! I think this would be a welcome change.

 In terms of packaging, it would be good to mention the binaries
 produced by the upstream project as well, in addition to Mesosphere.

 - Patrick

 On Thu, May 8, 2014 at 12:51 AM, Andrew Ash and...@andrewash.com wrote:
  The docs for how to run Spark on Mesos have changed very little since
  0.6.0, but setting it up is much easier now than then.  Does it make
 sense
  to revamp with the below changes?
 
 
  You no longer need to build mesos yourself as pre-built versions are
  available from Mesosphere: http://mesosphere.io/downloads/
 
  And the instructions guide you towards compiling your own distribution of
  Spark, when you can use the prebuilt versions of Spark as well.
 
 
  I'd like to split that portion of the documentation into two sections, a
  build-from-scratch section and a use-prebuilt section.  The new outline
  would look something like this:
 
 
  *Running Spark on Mesos*
 
  Installing Mesos
  - using prebuilt (recommended)
   - pointer to mesosphere's packages
  - from scratch
   - (similar to current)
 
 
  Connecting Spark to Mesos
  - loading distribution into an accessible location
  - Spark settings
 
  Mesos Run Modes
  - (same as current)
 
  Running Alongside Hadoop
  - (trim this down)
 
 
 
  Does that work for people?
 
 
  Thanks!
  Andrew
 
 
  PS Basically all the same:
 
  http://spark.apache.org/docs/0.6.0/running-on-mesos.html
  http://spark.apache.org/docs/0.6.2/running-on-mesos.html
  http://spark.apache.org/docs/0.7.3/running-on-mesos.html
  http://spark.apache.org/docs/0.8.1/running-on-mesos.html
  http://spark.apache.org/docs/0.9.1/running-on-mesos.html
 
 https://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/running-on-mesos.html



Re: Kryo not default?

2014-05-13 Thread Reynold Xin
The main reason is that it doesn't always work (e.g. sometimes application
program has special serialization / externalization written already for
Java which don't work in Kryo).

On Mon, May 12, 2014 at 5:47 PM, Anand Avati av...@gluster.org wrote:

 Hi,
 Can someone share the reason why Kryo serializer is not the default? Is
 there anything to be careful about (because of which it is not enabled by
 default)?
 Thanks!



Re: Preliminary Parquet numbers and including .count() in Catalyst

2014-05-13 Thread Reynold Xin
Thanks for the experiments and analysis!

I think Michael already submitted a patch that avoids scanning all columns
for count(*) or count(1).


On Mon, May 12, 2014 at 9:46 PM, Andrew Ash and...@andrewash.com wrote:

 Hi Spark devs,

 First of all, huge congrats on the parquet integration with SparkSQL!  This
 is an incredible direction forward and something I can see being very
 broadly useful.

 I was doing some preliminary tests to see how it works with one of my
 workflows, and wanted to share some numbers that people might want to know
 about.

 I also wanted to point out that .count() doesn't seem integrated with the
 rest of the optimization framework, and some big gains could be possible.


 So, the numbers:

 I took a table extracted from a SQL database and stored in HDFS:

- 115 columns (several always-empty, mostly strings, some enums, some
numbers)
- 253,887,080 rows
- 182,150,295,881 bytes (raw uncompressed)
- 42,826,820,222 bytes (lzo compressed with .index file)

 And I converted it to Parquet using SparkSQL's SchemaRDD.saveAsParquet()
 call:

- Converting from .lzo in HDFS to .parquet in HDFS took 635s using 42
cores across 4 machines
- 17,517,922,117 bytes (parquet per SparkSQL defaults)

 So storing in parquet format vs lzo compresses the data down to less than
 50% of the .lzo size, and under 10% of the raw uncompressed size.  Nice!


 I then did some basic interactions on it:

 *Row count*

- LZO
   - lzoFile(/path/to/lzo).count
   - 31.632305953s
- Parquet
   - sqlContext.parquetFile(/path/to/parquet).count
   - 289.129487003s

 Reassembling rows from the separate column storage is clearly really
 expensive.  Median task length is 33s vs 4s, and of that 33s in each task
 (319 tasks total) about 1.75 seconds are spent in GC (inefficient object
 allocation?)



 *Count number of rows with a particular key:*

- LZO
- lzoFile(/path/to/lzo).filter(_.split(\\|)(0) ==
 1234567890).count
   - 73.988897511s
- Parquet
- sqlContext.parquetFile(/path/to/parquet).where('COL ===
   1234567890).count
   - 293.410470418s
- Parquet (hand-tuned to count on just one column)
- sqlContext.parquetFile(/path/to/parquet).where('COL ===
   1234567890).select('IDCOL).count
   - 1.160449187s

 It looks like currently the .count() on parquet is handled incredibly
 inefficiently and all the columns are materialized.  But if I select just
 that relevant column and then count, then the column-oriented storage of
 Parquet really shines.

 There ought to be a potential optimization here such that a .count() on a
 SchemaRDD backed by Parquet doesn't require re-assembling the rows, as
 that's expensive.  I don't think .count() is handled specially in
 SchemaRDDs, but it seems ripe for optimization.


 *Count number of distinct values in a column*

- LZO
- lzoFile(/path/to/lzo).map(sel(0)).distinct.count
   - 115.582916866s
- Parquet
- sqlContext.parquetFile(/path/to/parquet).select('COL).distinct.count
   - 16.839004826 s

 It turns out column selectivity is very useful!  I'm guessing that if I
 could get byte counts read out of HDFS, that would just about match up with
 the difference in read times.




 Any thoughts on how to embed the knowledge of my hand-tuned additional
 .select('IDCOL)
 into Catalyst?


 Thanks again for all the hard work and prep for the 1.0 release!

 Andrew



Is this supported? : Spark on Windows, Hadoop YARN on Linux.

2014-05-13 Thread innowireless TaeYun Kim
I'm trying to run spark-shell on Windows that uses Hadoop YARN on Linux.
Specifically, the environment is as follows:

- Client
  - OS: Windows 7
  - Spark version: 1.0.0-SNAPSHOT (git cloned 2014.5.8)
- Server
  - Platform: hortonworks sandbox 2.1

I has to modify the spark source code to apply
https://issues.apache.org/jira/browse/YARN-1824, so that the cross-platform
issues can be addressed. (that is, change $() to $$(), File.pathSeparator to
ApplicationConstants.CLASS_PATH_SEPARATOR).
Seeing this, I suspect that the Spark code for now is not prepared to
support cross-platform submit, that is, Spark on Windows - Hadoop YARN on
Linux.

Anyways, after the modification and some configuration tweak, at least the
yarn-client mode spark-shell submitted from Windows 7 seems to try to start.
But the ApplicationManager fails to register.
Yarn server log is as follows: ('owner' is the user name of the Windows 7
machine.)

Log Type: stderr
Log Length: 1356
log4j:WARN No appenders could be found for logger
(org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
more info.
14/05/12 01:13:54 INFO YarnSparkHadoopUtil: Using Spark's default log4j
profile: org/apache/spark/log4j-defaults.properties
14/05/12 01:13:54 INFO SecurityManager: Changing view acls to: yarn,owner
14/05/12 01:13:54 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(yarn, owner)
14/05/12 01:13:55 INFO Slf4jLogger: Slf4jLogger started
14/05/12 01:13:56 INFO Remoting: Starting remoting
14/05/12 01:13:56 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkyar...@sandbox.hortonworks.com:47074]
14/05/12 01:13:56 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://sparkyar...@sandbox.hortonworks.com:47074]
14/05/12 01:13:56 INFO RMProxy: Connecting to ResourceManager at
/0.0.0.0:8030
14/05/12 01:13:56 INFO ExecutorLauncher: ApplicationAttemptId:
appattempt_1399856448891_0018_01
14/05/12 01:13:56 INFO ExecutorLauncher: Registering the ApplicationMaster
14/05/12 01:13:56 WARN Client: Exception encountered while connecting to the
server : org.apache.hadoop.security.AccessControlException: Client cannot
authenticate via:[TOKEN]

How can I handle this error?
Or, should I give up and use Linux for my client machine?
(I want to use Windows for client, since for me it's more comfortable to
develop applications.)
BTW, I'm a newbie for Spark and Hadoop.

Thanks in advance.




Re: Updating docs for running on Mesos

2014-05-13 Thread Matei Zaharia
I’ll ask the Mesos folks about this. Unfortunately it might be tough to link 
only to a company’s builds; but we can perhaps include them in addition to 
instructions for building Mesos from Apache.

Matei

On May 12, 2014, at 11:55 PM, Gerard Maas gerard.m...@gmail.com wrote:

 Andrew,
 
 Mesosphere has binary releases here:
 http://mesosphere.io/downloads/
 
 (Anecdote: I actually burned a CPU building Mesos from source. No kidding -
 it was coming, as the laptop was crashing from time to time, but the mesos
 build was that one drop too much)
 
 kr, Gerard.
 
 
 
 On Tue, May 13, 2014 at 6:57 AM, Andrew Ash and...@andrewash.com wrote:
 
 As far as I know, the upstream doesn't release binaries, only source code.
 The downloads page https://mesos.apache.org/downloads/ for 0.18.0 only
 has a source tarball.  Is there a binary release somewhere from Mesos that
 I'm missing?
 
 
 On Sun, May 11, 2014 at 2:16 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 
 Andrew,
 
 Updating these docs would be great! I think this would be a welcome
 change.
 
 In terms of packaging, it would be good to mention the binaries
 produced by the upstream project as well, in addition to Mesosphere.
 
 - Patrick
 
 On Thu, May 8, 2014 at 12:51 AM, Andrew Ash and...@andrewash.com
 wrote:
 The docs for how to run Spark on Mesos have changed very little since
 0.6.0, but setting it up is much easier now than then.  Does it make
 sense
 to revamp with the below changes?
 
 
 You no longer need to build mesos yourself as pre-built versions are
 available from Mesosphere: http://mesosphere.io/downloads/
 
 And the instructions guide you towards compiling your own distribution
 of
 Spark, when you can use the prebuilt versions of Spark as well.
 
 
 I'd like to split that portion of the documentation into two sections,
 a
 build-from-scratch section and a use-prebuilt section.  The new outline
 would look something like this:
 
 
 *Running Spark on Mesos*
 
 Installing Mesos
 - using prebuilt (recommended)
 - pointer to mesosphere's packages
 - from scratch
 - (similar to current)
 
 
 Connecting Spark to Mesos
 - loading distribution into an accessible location
 - Spark settings
 
 Mesos Run Modes
 - (same as current)
 
 Running Alongside Hadoop
 - (trim this down)
 
 
 
 Does that work for people?
 
 
 Thanks!
 Andrew
 
 
 PS Basically all the same:
 
 http://spark.apache.org/docs/0.6.0/running-on-mesos.html
 http://spark.apache.org/docs/0.6.2/running-on-mesos.html
 http://spark.apache.org/docs/0.7.3/running-on-mesos.html
 http://spark.apache.org/docs/0.8.1/running-on-mesos.html
 http://spark.apache.org/docs/0.9.1/running-on-mesos.html
 
 
 https://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/running-on-mesos.html
 
 



Re: Kryo not default?

2014-05-13 Thread Dmitriy Lyubimov
On Mon, May 12, 2014 at 2:47 PM, Anand Avati av...@gluster.org wrote:

 Hi,
 Can someone share the reason why Kryo serializer is not the default?

why should it be?

On top of it, the only way to serialize a closure into the backend (even
now) is java serialization (which means java serialization is required of
all closure attributes)


 Is
 there anything to be careful about (because of which it is not enabled by
 default)?


Yes. Kind of stems from above. There's still a number of api calls that use
closure attributes to serialize data to backend (see fold(), for example).
which means even if you enable kryo, some api still requires java
serialization of an attribute.

I fixed parallelize(), collect() and something else that i don't remember
already in that regard, but i think even up till now there's still a number
of apis lingering whose data parameters  wouldn't work with kryo.


 Thanks!



Serializable different behavior Spark Shell vs. Scala Shell

2014-05-13 Thread Michael Malak
Reposting here on dev since I didn't see a response on user:

I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In 
the Spark Shell, equals() fails when I use the canonical equals() pattern of 
match{}, but works when I subsitute with isInstanceOf[]. I am using Spark 
0.9.0/Scala 2.10.3.

Is this a bug?

Spark Shell (equals uses match{})
=

class C(val s:String) extends Serializable {
  override def equals(o: Any) = o match {
    case that: C = that.s == s
    case _ = false
  }
}

val x = new C(a)
val bos = new java.io.ByteArrayOutputStream()
val out = new java.io.ObjectOutputStream(bos)
out.writeObject(x);
val b = bos.toByteArray();
out.close
bos.close
val y = new java.io.ObjectInputStream(new 
java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
x.equals(y)

res: Boolean = false

Spark Shell (equals uses isInstanceOf[])


class C(val s:String) extends Serializable {
  override def equals(o: Any) = if (o.isInstanceOf[C]) (o.asInstanceOf[C].s == 
s) else false
}

val x = new C(a)
val bos = new java.io.ByteArrayOutputStream()
val out = new java.io.ObjectOutputStream(bos)
out.writeObject(x);
val b = bos.toByteArray();
out.close
bos.close
val y = new java.io.ObjectInputStream(new 
java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
x.equals(y)

res: Boolean = true

Scala Shell (equals uses match{})
=

class C(val s:String) extends Serializable {
  override def equals(o: Any) = o match {
    case that: C = that.s == s
    case _ = false
  }
}

val x = new C(a)
val bos = new java.io.ByteArrayOutputStream()
val out = new java.io.ObjectOutputStream(bos)
out.writeObject(x);
val b = bos.toByteArray();
out.close
bos.close
val y = new java.io.ObjectInputStream(new 
java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
x.equals(y)

res: Boolean = true


Re: Multinomial Logistic Regression

2014-05-13 Thread DB Tsai
Hi Deb,

For K possible outcomes in multinomial logistic regression,  we can have
K-1 independent binary logistic regression models, in which one outcome is
chosen as a pivot and then the other K-1 outcomes are separately
regressed against the pivot outcome. See my presentation for technical
detail http://www.slideshare.net/dbtsai/2014-0501-mlor

Since mllib only supports one linear model per classification model, there
will be some infrastructure work to support MLOR in mllib. But if you want
to implement yourself with the L-BFGS solver in mllib, you can follow the
equation in my slide, and it will not be too difficult.

I can give you the gradient method for multinomial logistic regression, you
just need to put the K-1 intercepts in the right place.

  def computeGradient(y: Double, x: Array[Double], lambda: Double, w:
Array[Array[Double]], b: Array[Double],
gradient: Array[Double]): (Double, Int) = {
val classes = b.length + 1
val yy = y.toInt

def alpha(i: Int): Int = {
  if (i == 0) 1 else 0
}

def delta(i: Int, j: Int): Int = {
  if (i == j) 1 else 0
}

var denominator: Double = 1.0
val numerators: Array[Double] = Array.ofDim[Double](b.length)

var predicted = 1

{
  var i = 0
  var j = 0
  var acc: Double = 0
  while (i  b.length) {
acc = b(i)
j = 0
while (j  x.length) {
  acc += x(j) * w(i)(j)
  j += 1
}
numerators(i) = math.exp(acc)
if (i  0  numerators(i)  numerators(predicted - 1)) {
  predicted = i + 1
}
denominator += numerators(i)
i += 1
  }
  if (numerators(predicted - 1)  1) {
predicted = 0
  }
}

{
  // gradient has dim of (classes-1) * (x.length+1)
  var i = 0
  var m1: Int = 0
  var l1: Int = 0
  while (i  (classes - 1) * (x.length + 1)) {
m1 = i % (x.length + 1) // m0 is intercept
l1 = (i - m1) / (x.length + 1) // l + 1 is class
if (m1 == 0) {
  gradient(i) += (1 - alpha(yy)) * delta(yy, l1 + 1) -
numerators(l1) / denominator
} else {
  gradient(i) += ((1 - alpha(yy)) * delta(yy, l1 + 1) -
numerators(l1) / denominator) * x(m1 - 1)
}
i += 1
  }
}
val loglike: Double = math.round(y).toInt match {
  case 0 = math.log(1.0 / denominator)
  case _ = math.log(numerators(math.round(y - 1).toInt) / denominator)
}
(loglike, predicted)
  }



Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Tue, May 13, 2014 at 4:08 AM, Debasish Das debasish.da...@gmail.comwrote:

 Hi,

 Is there a PR for multinomial logistic regression which does one-vs-all and
 compare it to the other possibilities ?

 @dbtsai in your strata presentation you used one vs all ? Did you add some
 constraints on the fact that you penalize if mis-predicted labels are not
 very far from the true label ?

 Thanks.
 Deb



Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Mark Hamstra
There were a few early/test RCs this cycle that were never put to a vote.


On Tue, May 13, 2014 at 8:07 AM, Nan Zhu zhunanmcg...@gmail.com wrote:

 just curious, where is rc4 VOTE?

 I searched my gmail but didn't find that?




 On Tue, May 13, 2014 at 9:49 AM, Sean Owen so...@cloudera.com wrote:

  On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell pwend...@gmail.com
  wrote:
   The release files, including signatures, digests, etc. can be found at:
   http://people.apache.org/~pwendell/spark-1.0.0-rc5/
 
  Good news is that the sigs, MD5 and SHA are all correct.
 
  Tiny note: the Maven artifacts use SHA1, while the binary artifacts
  use SHA512, which took me a bit of head-scratching to figure out.
 
  If another RC comes out, I might suggest making it SHA1 everywhere?
  But there is nothing wrong with these signatures and checksums.
 
  Now to look at the contents...
 



Re: Serializable different behavior Spark Shell vs. Scala Shell

2014-05-13 Thread Michael Malak
Thank you for your investigation into this!

Just for completeness, I've confirmed it's a problem only in REPL, not in 
compiled Spark programs.

But within REPL, a direct consequence of non-same classes after 
serialization/deserialization also means that lookup() doesn't work:

scala class C(val s:String) extends Serializable {
 |   override def equals(o: Any) = if (o.isInstanceOf[C]) 
o.asInstanceOf[C].s == s else false
 |   override def toString = s
 | }
defined class C

scala val r = sc.parallelize(Array((new C(a),11),(new C(a),12)))
r: org.apache.spark.rdd.RDD[(C, Int)] = ParallelCollectionRDD[3] at parallelize 
at console:14

scala r.lookup(new C(a))
console:17: error: type mismatch;
 found   : C
 required: C
  r.lookup(new C(a))
   ^



On Tuesday, May 13, 2014 3:05 PM, Anand Avati av...@gluster.org wrote:

On Tue, May 13, 2014 at 8:26 AM, Michael Malak michaelma...@yahoo.com wrote:

Reposting here on dev since I didn't see a response on user:

I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In 
the Spark Shell, equals() fails when I use the canonical equals() pattern of 
match{}, but works when I subsitute with isInstanceOf[]. I am using Spark 
0.9.0/Scala 2.10.3.

Is this a bug?

Spark Shell (equals uses match{})
=

class C(val s:String) extends Serializable {
  override def equals(o: Any) = o match {
    case that: C = that.s == s
    case _ = false
  }
}

val x = new C(a)
val bos = new java.io.ByteArrayOutputStream()
val out = new java.io.ObjectOutputStream(bos)
out.writeObject(x);
val b = bos.toByteArray();
out.close
bos.close
val y = new java.io.ObjectInputStream(new 
java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
x.equals(y)

res: Boolean = false

Spark Shell (equals uses isInstanceOf[])


class C(val s:String) extends Serializable {
  override def equals(o: Any) = if (o.isInstanceOf[C]) (o.asInstanceOf[C].s == 
s) else false
}

val x = new C(a)
val bos = new java.io.ByteArrayOutputStream()
val out = new java.io.ObjectOutputStream(bos)
out.writeObject(x);
val b = bos.toByteArray();
out.close
bos.close
val y = new java.io.ObjectInputStream(new 
java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
x.equals(y)

res: Boolean = true

Scala Shell (equals uses match{})
=

class C(val s:String) extends Serializable {
  override def equals(o: Any) = o match {
    case that: C = that.s == s
    case _ = false
  }
}

val x = new C(a)
val bos = new java.io.ByteArrayOutputStream()
val out = new java.io.ObjectOutputStream(bos)
out.writeObject(x);
val b = bos.toByteArray();
out.close
bos.close
val y = new java.io.ObjectInputStream(new 
java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
x.equals(y)

res: Boolean = true



Hmm. I see that this can be reproduced without Spark in Scala 2.11, with and 
without -Yrepl-class-based command line flag to the repl. Spark's REPL has the 
effective behavior of 2.11's -Yrepl-class-based flag. Inspecting the byte code 
generated, it appears -Yrepl-class-based results in the creation of $outer 
field in the generated classes (including class C). The first case match in 
equals() is resulting code along the lines of (simplified):

if (o isinstanceof Cstr  this.$outer == that.$outer) { // do string compare 
// }

$outer is the synthetic field object to the outer object in which the object 
was created (in this case, the repl environment). Now obviously, when x is 
taken through the bytestream and deserialized, it would have a new $outer 
created (it may have deserialized in a different jvm or machine for all we 
know). So the $outer's mismatching is expected. However I'm still trying to 
understand why the outers need to be the same for the case match.


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread witgo
-1 
The following bug should be fixed: 
https://issues.apache.org/jira/browse/SPARK-1817
https://issues.apache.org/jira/browse/SPARK-1712


-- Original --
From:  Patrick Wendell;pwend...@gmail.com;
Date:  Wed, May 14, 2014 04:07 AM
To:  dev@spark.apache.orgdev@spark.apache.org; 

Subject:  Re: [VOTE] Release Apache Spark 1.0.0 (rc5)



Hey all - there were some earlier RC's that were not presented to the
dev list because issues were found with them. Also, there seems to be
some issues with the reliability of the dev list e-mail. Just a heads
up.

I'll lead with a +1 for this.

On Tue, May 13, 2014 at 8:07 AM, Nan Zhu zhunanmcg...@gmail.com wrote:
 just curious, where is rc4 VOTE?

 I searched my gmail but didn't find that?




 On Tue, May 13, 2014 at 9:49 AM, Sean Owen so...@cloudera.com wrote:

 On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell pwend...@gmail.com
 wrote:
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.0.0-rc5/

 Good news is that the sigs, MD5 and SHA are all correct.

 Tiny note: the Maven artifacts use SHA1, while the binary artifacts
 use SHA512, which took me a bit of head-scratching to figure out.

 If another RC comes out, I might suggest making it SHA1 everywhere?
 But there is nothing wrong with these signatures and checksums.

 Now to look at the contents...

.

Re: Serializable different behavior Spark Shell vs. Scala Shell

2014-05-13 Thread Anand Avati
On Tue, May 13, 2014 at 8:26 AM, Michael Malak michaelma...@yahoo.comwrote:

 Reposting here on dev since I didn't see a response on user:

 I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell.
 In the Spark Shell, equals() fails when I use the canonical equals()
 pattern of match{}, but works when I subsitute with isInstanceOf[]. I am
 using Spark 0.9.0/Scala 2.10.3.

 Is this a bug?

 Spark Shell (equals uses match{})
 =

 class C(val s:String) extends Serializable {
   override def equals(o: Any) = o match {
 case that: C = that.s == s
 case _ = false
   }
 }

 val x = new C(a)
 val bos = new java.io.ByteArrayOutputStream()
 val out = new java.io.ObjectOutputStream(bos)
 out.writeObject(x);
 val b = bos.toByteArray();
 out.close
 bos.close
 val y = new java.io.ObjectInputStream(new
 java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
 x.equals(y)

 res: Boolean = false

 Spark Shell (equals uses isInstanceOf[])
 

 class C(val s:String) extends Serializable {
   override def equals(o: Any) = if (o.isInstanceOf[C])
 (o.asInstanceOf[C].s == s) else false
 }

 val x = new C(a)
 val bos = new java.io.ByteArrayOutputStream()
 val out = new java.io.ObjectOutputStream(bos)
 out.writeObject(x);
 val b = bos.toByteArray();
 out.close
 bos.close
 val y = new java.io.ObjectInputStream(new
 java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
 x.equals(y)

 res: Boolean = true

 Scala Shell (equals uses match{})
 =

 class C(val s:String) extends Serializable {
   override def equals(o: Any) = o match {
 case that: C = that.s == s
 case _ = false
   }
 }

 val x = new C(a)
 val bos = new java.io.ByteArrayOutputStream()
 val out = new java.io.ObjectOutputStream(bos)
 out.writeObject(x);
 val b = bos.toByteArray();
 out.close
 bos.close
 val y = new java.io.ObjectInputStream(new
 java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
 x.equals(y)

 res: Boolean = true



Hmm. I see that this can be reproduced without Spark in Scala 2.11, with
and without -Yrepl-class-based command line flag to the repl. Spark's REPL
has the effective behavior of 2.11's -Yrepl-class-based flag. Inspecting
the byte code generated, it appears -Yrepl-class-based results in the
creation of $outer field in the generated classes (including class C).
The first case match in equals() is resulting code along the lines of
(simplified):

if (o isinstanceof Cstr  this.$outer == that.$outer) { // do string
compare // }

$outer is the synthetic field object to the outer object in which the
object was created (in this case, the repl environment). Now obviously,
when x is taken through the bytestream and deserialized, it would have a
new $outer created (it may have deserialized in a different jvm or machine
for all we know). So the $outer's mismatching is expected. However I'm
still trying to understand why the outers need to be the same for the case
match.


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Madhu
I just built rc5 on Windows 7 and tried to reproduce the problem described in

https://issues.apache.org/jira/browse/SPARK-1712

It works on my machine:

14/05/13 21:06:47 INFO DAGScheduler: Stage 1 (sum at console:17) finished
in 4.548 s
14/05/13 21:06:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks
have all completed, from pool
14/05/13 21:06:47 INFO SparkContext: Job finished: sum at console:17, took
4.814991993 s
res1: Double = 5.05E11

I used all defaults, no config files were changed.
Not sure if that makes a difference...



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc5-tp6542p6560.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: Preliminary Parquet numbers and including .count() in Catalyst

2014-05-13 Thread Andrew Ash
Thanks for filing -- I'm keeping my eye out for updates on that ticket.

Cheers!
Andrew


On Tue, May 13, 2014 at 2:40 PM, Michael Armbrust mich...@databricks.comwrote:

 
  It looks like currently the .count() on parquet is handled incredibly
  inefficiently and all the columns are materialized.  But if I select just
  that relevant column and then count, then the column-oriented storage of
  Parquet really shines.
 
  There ought to be a potential optimization here such that a .count() on a
  SchemaRDD backed by Parquet doesn't require re-assembling the rows, as
  that's expensive.  I don't think .count() is handled specially in
  SchemaRDDs, but it seems ripe for optimization.
 

 Yeah, you are right.  Thanks for pointing this out!

 If you call .count() that is just the native Spark count, which is not
 aware of the potential optimizations.  We could just override count() in a
 schema RDD to be something like
 groupBy()(Count(Literal(1))).collect().head.getInt(0)

 Here is a JIRA: SPARK-1822 - SchemaRDD.count() should use the
 optimizer.https://issues.apache.org/jira/browse/SPARK-1822

 Michael