Suggestion: RDD cache depth

2014-05-29 Thread innowireless TaeYun Kim
It would be nice if the RDD cache() method incorporate a depth information.

That is,

 

void test()
{

JavaRDD. rdd = .;

 

rdd.cache();  // to depth 1. actual caching happens.

rdd.cache();  // to depth 2. Nop as long as the storage level is the same.
Else, exception.

.

rdd.uncache();  // to depth 1. Nop.

rdd.uncache();  // to depth 0. Actual unpersist happens.

}

 

This can be useful when writing code in modular way.

When a function receives an rdd as an argument, it doesn't necessarily know
the cache status of the rdd.

But it could want to cache the rdd, since it will use the rdd multiple
times.

But with the current RDD API, it cannot determine whether it should
unpersist it or leave it alone (so that caller can continue to use that rdd
without rebuilding).

 

Thanks.

 



GraphX triplets on 5-node graph

2014-05-29 Thread Michael Malak
Shouldn't I be seeing N2 and N4 in the output below? (Spark 0.9.0 REPL) Or am I 
missing something fundamental?


val nodes = sc.parallelize(Array((1L, N1), (2L, N2), (3L, N3), (4L, 
N4), (5L, N5))) 
val edges = sc.parallelize(Array(Edge(1L, 2L, E1), Edge(1L, 3L, E2), 
Edge(2L, 4L, E3), Edge(3L, 5L, E4))) 
Graph(nodes, edges).triplets.collect 
res1: Array[org.apache.spark.graphx.EdgeTriplet[String,String]] = 
Array(((1,N1),(3,N3),E2), ((1,N1),(3,N3),E2), ((3,N3),(5,N5),E4), 
((3,N3),(5,N5),E4))


Re: Suggestion: RDD cache depth

2014-05-29 Thread Matei Zaharia
This is a pretty cool idea — instead of cache depth I’d call it something like 
reference counting. Would you mind opening a JIRA issue about it?

The issue of really composing together libraries that use RDDs nicely isn’t 
fully explored, but this is certainly one thing that would help with it. I’d 
love to look at other ones too, e.g. how to allow libraries to share scans over 
the same dataset.

Unfortunately using multiple cache() calls for this is probably not feasible 
because it would change the current meaning of multiple calls. But we can add a 
new API, or a parameter to the method.

Matei

On May 28, 2014, at 11:46 PM, innowireless TaeYun Kim 
taeyun@innowireless.co.kr wrote:

 It would be nice if the RDD cache() method incorporate a depth information.
 
 That is,
 
 
 
 void test()
 {
 
 JavaRDD. rdd = .;
 
 
 
 rdd.cache();  // to depth 1. actual caching happens.
 
 rdd.cache();  // to depth 2. Nop as long as the storage level is the same.
 Else, exception.
 
 .
 
 rdd.uncache();  // to depth 1. Nop.
 
 rdd.uncache();  // to depth 0. Actual unpersist happens.
 
 }
 
 
 
 This can be useful when writing code in modular way.
 
 When a function receives an rdd as an argument, it doesn't necessarily know
 the cache status of the rdd.
 
 But it could want to cache the rdd, since it will use the rdd multiple
 times.
 
 But with the current RDD API, it cannot determine whether it should
 unpersist it or leave it alone (so that caller can continue to use that rdd
 without rebuilding).
 
 
 
 Thanks.
 
 
 



Re: GraphX triplets on 5-node graph

2014-05-29 Thread Reynold Xin
Take a look at this one: https://issues.apache.org/jira/browse/SPARK-1188

It was an optimization that added user inconvenience. We got rid of that
now in Spark 1.0.



On Wed, May 28, 2014 at 11:48 PM, Michael Malak michaelma...@yahoo.comwrote:

 Shouldn't I be seeing N2 and N4 in the output below? (Spark 0.9.0 REPL) Or
 am I missing something fundamental?


 val nodes = sc.parallelize(Array((1L, N1), (2L, N2), (3L, N3), (4L,
 N4), (5L, N5)))
 val edges = sc.parallelize(Array(Edge(1L, 2L, E1), Edge(1L, 3L, E2),
 Edge(2L, 4L, E3), Edge(3L, 5L, E4)))
 Graph(nodes, edges).triplets.collect
 res1: Array[org.apache.spark.graphx.EdgeTriplet[String,String]] =
 Array(((1,N1),(3,N3),E2), ((1,N1),(3,N3),E2), ((3,N3),(5,N5),E4),
 ((3,N3),(5,N5),E4))



RE: Suggestion: RDD cache depth

2014-05-29 Thread innowireless TaeYun Kim
Opened a JIRA issue. (https://issues.apache.org/jira/browse/SPARK-1962)

Thanks.


-Original Message-
From: Matei Zaharia [mailto:matei.zaha...@gmail.com] 
Sent: Thursday, May 29, 2014 3:54 PM
To: dev@spark.apache.org
Subject: Re: Suggestion: RDD cache depth

This is a pretty cool idea - instead of cache depth I'd call it something
like reference counting. Would you mind opening a JIRA issue about it?

The issue of really composing together libraries that use RDDs nicely isn't
fully explored, but this is certainly one thing that would help with it. I'd
love to look at other ones too, e.g. how to allow libraries to share scans
over the same dataset.

Unfortunately using multiple cache() calls for this is probably not feasible
because it would change the current meaning of multiple calls. But we can
add a new API, or a parameter to the method.

Matei

On May 28, 2014, at 11:46 PM, innowireless TaeYun Kim
taeyun@innowireless.co.kr wrote:

 It would be nice if the RDD cache() method incorporate a depth
information.
 
 That is,
 
 
 
 void test()
 {
 
 JavaRDD. rdd = .;
 
 
 
 rdd.cache();  // to depth 1. actual caching happens.
 
 rdd.cache();  // to depth 2. Nop as long as the storage level is the same.
 Else, exception.
 
 .
 
 rdd.uncache();  // to depth 1. Nop.
 
 rdd.uncache();  // to depth 0. Actual unpersist happens.
 
 }
 
 
 
 This can be useful when writing code in modular way.
 
 When a function receives an rdd as an argument, it doesn't necessarily 
 know the cache status of the rdd.
 
 But it could want to cache the rdd, since it will use the rdd multiple 
 times.
 
 But with the current RDD API, it cannot determine whether it should 
 unpersist it or leave it alone (so that caller can continue to use 
 that rdd without rebuilding).
 
 
 
 Thanks.
 
 
 



Re: LogisticRegression: Predicting continuous outcomes

2014-05-29 Thread Bharath Ravi Kumar
Xiangrui, Christopher,

Thanks for responding.  I'll  go through the code in detail to evaluate if
the loss function used is suitable to our dataset. I'll also go through the
referred paper since I was unaware of the underlying theory. Thanks again.

-Bharath


On Thu, May 29, 2014 at 8:16 AM, Christopher Nguyen c...@adatao.com wrote:

 Bharath, (apologies if you're already familiar with the theory): the
 proposed approach may or may not be appropriate depending on the overall
 transfer function in your data. In general, a single logistic regressor
 cannot approximate arbitrary non-linear functions (of linear combinations
 of the inputs). You can review works by, e.g., Hornik and Cybenko in the
 late 80's to see if you need something more, such as a simple, one
 hidden-layer neural network.

 This is a good summary:

 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.2647rep=rep1type=pdf

 --
 Christopher T. Nguyen
 Co-founder  CEO, Adatao http://adatao.com
 linkedin.com/in/ctnguyen



 On Wed, May 28, 2014 at 11:18 AM, Bharath Ravi Kumar reachb...@gmail.com
 wrote:

  I'm looking to reuse the LogisticRegression model (with SGD) to predict a
  real-valued outcome variable. (I understand that logistic regression is
  generally applied to predict binary outcome, but for various reasons,
 this
  model suits our needs better than LinearRegression). Related to that I
 have
  the following questions:
 
  1) Can the current LogisticRegression model be used as is to train based
 on
  binary input (i.e. explanatory) features, or is there an assumption that
  the explanatory features must be continuous?
 
  2) I intend to reuse the current class to train a model on LabeledPoints
  where the label is a real value (and not 0 / 1). I'd like to know if
  invoking setValidateData(false) would suffice or if one must override the
  validator to achieve this.
 
  3) I recall seeing an experimental method on the class (
 
 
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
  )
  that clears the threshold separating positive  negative predictions.
 Once
  the model is trained on real valued labels, would clearing this flag
  suffice to predict an outcome that is continous in nature?
 
  Thanks,
  Bharath
 
  P.S: I'm writing to dev@ and not user@ assuming that lib changes might
 be
  necessary. Apologies if the mailing list is incorrect.
 



Please change instruction about Launching Applications Inside the Cluster

2014-05-29 Thread Lizhengbing (bing, BIPA)
The instruction address is in 
http://spark.apache.org/docs/0.9.0/spark-standalone.html#launching-applications-inside-the-cluster
 or 
http://spark.apache.org/docs/0.9.1/spark-standalone.html#launching-applications-inside-the-cluster

Origin instruction is:
./bin/spark-class org.apache.spark.deploy.Client launch
   [client-options] \
   cluster-url application-jar-url main-class \
   [application-options] 

If I follow this instruction, I will not run my program deployed in a spark 
standalone cluster properly.

Based on source code, This instruction should be changed to
./bin/spark-class org.apache.spark.deploy.Client [client-options] launch \
   cluster-url application-jar-url main-class \
   [application-options] 

That is to say: [client-options] must be put ahead of launch


Re: Standard preprocessing/scaling

2014-05-29 Thread dataginjaninja
I do see the issue for centering sparse data. Actually, the centering is less
important than the scaling by the standard deviation. Not having unit
variance causes the convergence issues and long runtimes. 

RowMatrix will compute variance of a column?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Standard-preprocessing-scaling-tp6826p6849.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Timestamp support in v1.0

2014-05-29 Thread dataginjaninja
Can anyone verify which rc  [SPARK-1360] Add Timestamp Support for SQL #275
https://github.com/apache/spark/pull/275   is included in? I am running
rc3, but receiving errors with TIMESTAMP as a datatype in my Hive tables
when trying to use them in pyspark.

*The error I get:
*
14/05/29 15:44:47 INFO ParseDriver: Parsing command: SELECT COUNT(*) FROM
aol
14/05/29 15:44:48 INFO ParseDriver: Parse Completed
14/05/29 15:44:48 INFO metastore: Trying to connect to metastore with URI
thrift:
14/05/29 15:44:48 INFO metastore: Waiting 1 seconds before next connection
attempt.
14/05/29 15:44:49 INFO metastore: Connected to metastore.
Traceback (most recent call last):
  File stdin, line 1, in module
  File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 189, in hql
return self.hiveql(hqlQuery)
  File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 183, in hiveql
return SchemaRDD(self._ssql_ctx.hiveql(hqlQuery), self)
  File
/opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
line 537, in __call__
  File
/opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o14.hiveql.
: java.lang.RuntimeException: Unsupported dataType: timestamp

*The table I loaded:*
DROP TABLE IF EXISTS aol; 
CREATE EXTERNAL TABLE aol (
userid STRING,
query STRING,
query_time TIMESTAMP,
item_rank INT,
click_url STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION '/tmp/data/aol';

*The pyspark commands:*
from pyspark.sql import HiveContext
hctx= HiveContext(sc)
results = hctx.hql(SELECT COUNT(*) FROM aol).collect()






--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: Timestamp support in v1.0

2014-05-29 Thread Andrew Ash
I can confirm that the commit is included in the 1.0.0 release candidates
(it was committed before branch-1.0 split off from master), but I can't
confirm that it works in PySpark.  Generally the Python and Java interfaces
lag a little behind the Scala interface to Spark, but we're working to keep
that diff much smaller going forward.

Can you try the same thing in Scala?


On Thu, May 29, 2014 at 8:54 AM, dataginjaninja rickett.stepha...@gmail.com
 wrote:

 Can anyone verify which rc  [SPARK-1360] Add Timestamp Support for SQL #275
 https://github.com/apache/spark/pull/275   is included in? I am running
 rc3, but receiving errors with TIMESTAMP as a datatype in my Hive tables
 when trying to use them in pyspark.

 *The error I get:
 *
 14/05/29 15:44:47 INFO ParseDriver: Parsing command: SELECT COUNT(*) FROM
 aol
 14/05/29 15:44:48 INFO ParseDriver: Parse Completed
 14/05/29 15:44:48 INFO metastore: Trying to connect to metastore with URI
 thrift:
 14/05/29 15:44:48 INFO metastore: Waiting 1 seconds before next connection
 attempt.
 14/05/29 15:44:49 INFO metastore: Connected to metastore.
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 189, in hql
 return self.hiveql(hqlQuery)
   File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 183, in hiveql
 return SchemaRDD(self._ssql_ctx.hiveql(hqlQuery), self)
   File
 /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
 line 537, in __call__
   File
 /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line
 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling o14.hiveql.
 : java.lang.RuntimeException: Unsupported dataType: timestamp

 *The table I loaded:*
 DROP TABLE IF EXISTS aol;
 CREATE EXTERNAL TABLE aol (
 userid STRING,
 query STRING,
 query_time TIMESTAMP,
 item_rank INT,
 click_url STRING)
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY '\t'
 LOCATION '/tmp/data/aol';

 *The pyspark commands:*
 from pyspark.sql import HiveContext
 hctx= HiveContext(sc)
 results = hctx.hql(SELECT COUNT(*) FROM aol).collect()






 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.



Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Patrick Wendell
+1

I spun up a few EC2 clusters and ran my normal audit checks. Tests
passing, sigs, CHANGES and NOTICE look good

Thanks TD for helping cut this RC!

On Wed, May 28, 2014 at 9:38 PM, Kevin Markey kevin.mar...@oracle.com wrote:
 +1

 Built -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0
 Ran current version of one of my applications on 1-node pseudocluster
 (sorry, unable to test on full cluster).
 yarn-cluster mode
 Ran regression tests.

 Thanks
 Kevin


 On 05/28/2014 09:55 PM, Krishna Sankar wrote:

 +1
 Pulled  built on MacOS X, EC2 Amazon Linux
 Ran test programs on OS X, 5 node c3.4xlarge cluster
 Cheers
 k/


 On Wed, May 28, 2014 at 7:36 PM, Andy Konwinski
 andykonwin...@gmail.comwrote:

 +1
 On May 28, 2014 7:05 PM, Xiangrui Meng men...@gmail.com wrote:

 +1

 Tested apps with standalone client mode and yarn cluster and client

 modes.

 Xiangrui

 On Wed, May 28, 2014 at 1:07 PM, Sean McNamara
 sean.mcnam...@webtrends.com wrote:

 Pulled down, compiled, and tested examples on OS X and ubuntu.
 Deployed app we are building on spark and poured data through it.

 +1

 Sean


 On May 26, 2014, at 8:39 AM, Tathagata Das 

 tathagata.das1...@gmail.com

 wrote:

 Please vote on releasing the following candidate as Apache Spark

 version 1.0.0!

 This has a few important bug fixes on top of rc10:
 SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
 SPARK-1870: https://github.com/apache/spark/pull/848
 SPARK-1897: https://github.com/apache/spark/pull/849

 The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):


 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a

 The release files, including signatures, digests, etc. can be found

 at:

 http://people.apache.org/~tdas/spark-1.0.0-rc11/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/tdas.asc

 The staging repository for this release can be found at:

 https://repository.apache.org/content/repositories/orgapachespark-1019/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/

 Please vote on releasing this package as Apache Spark 1.0.0!

 The vote is open until Thursday, May 29, at 16:00 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.0.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == API Changes ==
 We welcome users to compile Spark applications against 1.0. There are
 a few API changes in this release. Here are links to the associated
 upgrade guides - user facing changes have been kept as small as
 possible.

 Changes to ML vector specification:


 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10

 Changes to the Java API:


 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

 Changes to the streaming API:


 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

 Changes to the GraphX API:


 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

 Other changes:
 coGroup and related functions now return Iterable[T] instead of Seq[T]
 == Call toSeq on the result to restore the old behavior

 SparkContext.jarOfClass returns Option[String] instead of Seq[String]
 == Call toSeq on the result to restore old behavior




Re: Timestamp support in v1.0

2014-05-29 Thread Michael Armbrust
Thanks for reporting this!

https://issues.apache.org/jira/browse/SPARK-1964
https://github.com/apache/spark/pull/913

If you could test out that PR and see if it fixes your problems I'd really
appreciate it!

Michael


On Thu, May 29, 2014 at 9:09 AM, Andrew Ash and...@andrewash.com wrote:

 I can confirm that the commit is included in the 1.0.0 release candidates
 (it was committed before branch-1.0 split off from master), but I can't
 confirm that it works in PySpark.  Generally the Python and Java interfaces
 lag a little behind the Scala interface to Spark, but we're working to keep
 that diff much smaller going forward.

 Can you try the same thing in Scala?


 On Thu, May 29, 2014 at 8:54 AM, dataginjaninja 
 rickett.stepha...@gmail.com
  wrote:

  Can anyone verify which rc  [SPARK-1360] Add Timestamp Support for SQL
 #275
  https://github.com/apache/spark/pull/275   is included in? I am
 running
  rc3, but receiving errors with TIMESTAMP as a datatype in my Hive tables
  when trying to use them in pyspark.
 
  *The error I get:
  *
  14/05/29 15:44:47 INFO ParseDriver: Parsing command: SELECT COUNT(*) FROM
  aol
  14/05/29 15:44:48 INFO ParseDriver: Parse Completed
  14/05/29 15:44:48 INFO metastore: Trying to connect to metastore with URI
  thrift:
  14/05/29 15:44:48 INFO metastore: Waiting 1 seconds before next
 connection
  attempt.
  14/05/29 15:44:49 INFO metastore: Connected to metastore.
  Traceback (most recent call last):
File stdin, line 1, in module
File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 189, in hql
  return self.hiveql(hqlQuery)
File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 183, in hiveql
  return SchemaRDD(self._ssql_ctx.hiveql(hqlQuery), self)
File
 
 /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 537, in __call__
File
  /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py,
 line
  300, in get_return_value
  py4j.protocol.Py4JJavaError: An error occurred while calling o14.hiveql.
  : java.lang.RuntimeException: Unsupported dataType: timestamp
 
  *The table I loaded:*
  DROP TABLE IF EXISTS aol;
  CREATE EXTERNAL TABLE aol (
  userid STRING,
  query STRING,
  query_time TIMESTAMP,
  item_rank INT,
  click_url STRING)
  ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t'
  LOCATION '/tmp/data/aol';
 
  *The pyspark commands:*
  from pyspark.sql import HiveContext
  hctx= HiveContext(sc)
  results = hctx.hql(SELECT COUNT(*) FROM aol).collect()
 
 
 
 
 
 
  --
  View this message in context:
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850.html
  Sent from the Apache Spark Developers List mailing list archive at
  Nabble.com.
 



Re: Timestamp support in v1.0

2014-05-29 Thread dataginjaninja
Yes, I get the same error:

scala val hc = new org.apache.spark.sql.hive.HiveContext(sc)
14/05/29 16:53:40 INFO deprecation: mapred.input.dir.recursive is
deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
14/05/29 16:53:40 INFO deprecation: mapred.max.split.size is deprecated.
Instead, use mapreduce.input.fileinputformat.split.maxsize
14/05/29 16:53:40 INFO deprecation: mapred.min.split.size is deprecated.
Instead, use mapreduce.input.fileinputformat.split.minsize
14/05/29 16:53:40 INFO deprecation: mapred.min.split.size.per.rack is
deprecated. Instead, use
mapreduce.input.fileinputformat.split.minsize.per.rack
14/05/29 16:53:40 INFO deprecation: mapred.min.split.size.per.node is
deprecated. Instead, use
mapreduce.input.fileinputformat.split.minsize.per.node
14/05/29 16:53:40 INFO deprecation: mapred.reduce.tasks is deprecated.
Instead, use mapreduce.job.reduces
14/05/29 16:53:40 INFO deprecation:
mapred.reduce.tasks.speculative.execution is deprecated. Instead, use
mapreduce.reduce.speculative
14/05/29 16:53:41 WARN HiveConf: DEPRECATED: Configuration property
hive.metastore.local no longer has any effect. Make sure to provide a valid
value for hive.metastore.uris if you are connecting to a remote metastore.
14/05/29 16:53:42 WARN HiveConf: DEPRECATED: Configuration property
hive.metastore.local no longer has any effect. Make sure to provide a valid
value for hive.metastore.uris if you are connecting to a remote metastore.
hc: org.apache.spark.sql.hive.HiveContext =
org.apache.spark.sql.hive.HiveContext@36482814

scala val results = hc.hql(SELECT COUNT(*) FROM aol).collect() 
14/05/29 16:53:46 INFO ParseDriver: Parsing command: SELECT COUNT(*) FROM
aol
14/05/29 16:53:46 INFO ParseDriver: Parse Completed
14/05/29 16:53:47 INFO metastore: Trying to connect to metastore with URI th
14/05/29 16:53:47 INFO metastore: Waiting 1 seconds before next connection
attempt.
14/05/29 16:53:48 INFO metastore: Connected to metastore.
java.lang.RuntimeException: Unsupported dataType: timestamp
at scala.sys.package$.error(package.scala:27)




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850p6853.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: Timestamp support in v1.0

2014-05-29 Thread dataginjaninja
Michael,

Will I have to rebuild after adding the change? Thanks



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850p6855.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: Timestamp support in v1.0

2014-05-29 Thread dataginjaninja
Darn, I was hoping just to sneak it in that file. I am not the only person
working on the cluster; if I rebuild it that means I have to redeploy
everything to all the nodes as well.  So I cannot do that ... today. If
someone else doesn't beat me to it. I can rebuild at another time. 



-
Cheers,

Stephanie
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850p6857.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: Timestamp support in v1.0

2014-05-29 Thread Michael Armbrust
Yes, you'll need to download the code from that PR and reassemble Spark
(sbt/sbt assembly).


On Thu, May 29, 2014 at 10:02 AM, dataginjaninja 
rickett.stepha...@gmail.com wrote:

 Michael,

 Will I have to rebuild after adding the change? Thanks



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850p6855.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.



Re: Timestamp support in v1.0

2014-05-29 Thread Michael Armbrust
You should be able to get away with only doing it locally.  This bug is
happening during analysis which only occurs on the driver.


On Thu, May 29, 2014 at 10:17 AM, dataginjaninja 
rickett.stepha...@gmail.com wrote:

 Darn, I was hoping just to sneak it in that file. I am not the only person
 working on the cluster; if I rebuild it that means I have to redeploy
 everything to all the nodes as well.  So I cannot do that ... today. If
 someone else doesn't beat me to it. I can rebuild at another time.



 -
 Cheers,

 Stephanie
 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850p6857.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.



Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-29 Thread Patrick Wendell
[tl;dr stable API's are important - sorry, this is slightly meandering]

Hey - just wanted to chime in on this as I was travelling. Sean, you
bring up great points here about the velocity and stability of Spark.
Many projects have fairly customized semantics around what versions
actually mean (HBase is a good, if somewhat hard-to-comprehend,
example).

What the 1.X label means to Spark is that we are willing to guarantee
stability for Spark's core API. This is something that actually, Spark
has been doing for a while already (we've made few or no breaking
changes to the Spark core API for several years) and we want to codify
this for application developers. In this regard Spark has made a bunch
of changes to enforce the integrity of our API's:

- We went through and clearly annotated internal, or experimental
API's. This was a huge project-wide effort and included Scaladoc and
several other components to make it clear to users.
- We implemented automated byte-code verification of all proposed pull
requests that they don't break public API's. Pull requests after 1.0
will fail if they break API's that are not explicitly declared private
or experimental.

I can't possibly emphasize enough the importance of API stability.
What we want to avoid is the Hadoop approach. Candidly, Hadoop does a
poor job on this. There really isn't a well defined stable API for any
of the Hadoop components, for a few reasons:

1. Hadoop projects don't do any rigorous checking that new patches
don't break API's. Of course, the results in regular API breaks and a
poor understanding of what is a public API.
2. In several cases it's not possible to do basic things in Hadoop
without using deprecated or private API's.
3. There is significant vendor fragmentation of API's.

The main focus of the Hadoop vendors is making consistent cuts of the
core projects work together (HDFS/Pig/Hive/etc) - so API breaks are
sometimes considered fixed as long as the other projects work around
them (see [1]). We also regularly need to do archaeology (see [2]) and
directly interact with Hadoop committers to understand what API's are
stable and in which versions.

One goal of Spark is to deal with the pain of inter-operating with
Hadoop so that application writers don't to. We'd like to retain the
property that if you build an application against the (well defined,
stable) Spark API's right now, you'll be able to run it across many
Hadoop vendors and versions for the entire Spark 1.X release cycle.

Writing apps against Hadoop can be very difficult... consider how much
more engineering effort we spent maintaining YARN support than Mesos
support. There are many factors, but one is that Mesos has a single,
narrow, stable API. We've never had to make a change in Mesos due to
an API change, for several years. YARN on the other hand, there are at
least 3 YARN API's that currently exist, which are all binary
incompatible. We'd like to offer apps the ability to build against
Spark's API and just let us deal with it.

As more vendors packaging Spark, I'd like to see us put tools in the
upstream Spark repo that do validation for vendor packages of Spark,
so that we don't end up with fragmentation. Of course, vendors can
enhance the API and are encouraged to, but we need a kernel of API's
that vendors must maintain (think POSIX) to be considered compliant
with Apache Spark. I believe some other projects like OpenStack have
done this to avoid fragmentation.

- Patrick

[1] https://issues.apache.org/jira/browse/MAPREDUCE-5830
[2] 
http://2.bp.blogspot.com/-GO6HF0OAFHw/UOfNEH-4sEI/AD0/dEWFFYTRgYw/s1600/output-file.png

On Sun, May 18, 2014 at 2:13 AM, Mridul Muralidharan mri...@gmail.com wrote:
 So I think I need to clarify a few things here - particularly since
 this mail went to the wrong mailing list and a much wider audience
 than I intended it for :-)


 Most of the issues I mentioned are internal implementation detail of
 spark core : which means, we can enhance them in future without
 disruption to our userbase (ability to support large number of
 input/output partitions. Note: this is of order of 100k input and
 output partitions with uniform spread of keys - very rarely seen
 outside of some crazy jobs).

 Some of the issues I mentioned would reqiure DeveloperApi changes -
 which are not user exposed : they would impact developer use of these
 api's - which are mostly internally provided by spark. (Like fixing
 blocks  2G would require change to Serializer api)

 A smaller faction might require interface changes - note, I am
 referring specifically to configuration changes (removing/deprecating
 some) and possibly newer options to submit/env, etc - I dont envision
 any programming api change itself.
 The only api change we did was from Seq - Iterable - which is
 actually to address some of the issues I mentioned (join/cogroup).

 Remaining are bugs which need to be addressed or the feature
 removed/enhanced like shuffle consolidation.

 There might be 

[RESULT][VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Tathagata Das
Hello everyone,

The vote on Spark 1.0.0 RC11 passes with13 +1 votes, one 0 vote and no
-1 vote.

Thanks to everyone who tested the RC and voted. Here are the totals:

+1: (13 votes)
Matei Zaharia*
Mark Hamstra*
Holden Karau
Nick Pentreath*
Will Benton
Henry Saputra
Sean McNamara*
Xiangrui Meng*
Andy Konwinski*
Krishna Sankar
Kevin Markey
Patrick Wendell*
Tathagata Das*

0: (1 vote)
Ankur Dave*

-1: (0 vote)

Please hold off announcing Spark 1.0.0 until Apache Software Foundation
makes the press release tomorrow. Thank you very much for your cooperation.

TD


Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Tathagata Das
Let me put in my +1 as well!

This voting is now closed, and it successfully passes with 13 +1
votes and one 0 vote.
Thanks to everyone who tested the RC and voted. Here are the totals:

+1: (13 votes)
Matei Zaharia*
Mark Hamstra*
Holden Karau
Nick Pentreath*
Will Benton
Henry Saputra
Sean McNamara*
Xiangrui Meng*
Andy Konwinski*
Krishna Sankar
Kevin Markey
Patrick Wendell*
Tathagata Das*

0: (1 vote)
Ankur Dave*

-1: (0 vote)

* = binding

Please hold off announcing Spark 1.0.0 until Apache Software
Foundation makes the press release tomorrow. Thank you very much for
your cooperation.

TD

On Thu, May 29, 2014 at 9:14 AM, Patrick Wendell pwend...@gmail.com wrote:
 +1

 I spun up a few EC2 clusters and ran my normal audit checks. Tests
 passing, sigs, CHANGES and NOTICE look good

 Thanks TD for helping cut this RC!

 On Wed, May 28, 2014 at 9:38 PM, Kevin Markey kevin.mar...@oracle.com wrote:
 +1

 Built -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0
 Ran current version of one of my applications on 1-node pseudocluster
 (sorry, unable to test on full cluster).
 yarn-cluster mode
 Ran regression tests.

 Thanks
 Kevin


 On 05/28/2014 09:55 PM, Krishna Sankar wrote:

 +1
 Pulled  built on MacOS X, EC2 Amazon Linux
 Ran test programs on OS X, 5 node c3.4xlarge cluster
 Cheers
 k/


 On Wed, May 28, 2014 at 7:36 PM, Andy Konwinski
 andykonwin...@gmail.comwrote:

 +1
 On May 28, 2014 7:05 PM, Xiangrui Meng men...@gmail.com wrote:

 +1

 Tested apps with standalone client mode and yarn cluster and client

 modes.

 Xiangrui

 On Wed, May 28, 2014 at 1:07 PM, Sean McNamara
 sean.mcnam...@webtrends.com wrote:

 Pulled down, compiled, and tested examples on OS X and ubuntu.
 Deployed app we are building on spark and poured data through it.

 +1

 Sean


 On May 26, 2014, at 8:39 AM, Tathagata Das 

 tathagata.das1...@gmail.com

 wrote:

 Please vote on releasing the following candidate as Apache Spark

 version 1.0.0!

 This has a few important bug fixes on top of rc10:
 SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
 SPARK-1870: https://github.com/apache/spark/pull/848
 SPARK-1897: https://github.com/apache/spark/pull/849

 The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):


 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a

 The release files, including signatures, digests, etc. can be found

 at:

 http://people.apache.org/~tdas/spark-1.0.0-rc11/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/tdas.asc

 The staging repository for this release can be found at:

 https://repository.apache.org/content/repositories/orgapachespark-1019/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/

 Please vote on releasing this package as Apache Spark 1.0.0!

 The vote is open until Thursday, May 29, at 16:00 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.0.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == API Changes ==
 We welcome users to compile Spark applications against 1.0. There are
 a few API changes in this release. Here are links to the associated
 upgrade guides - user facing changes have been kept as small as
 possible.

 Changes to ML vector specification:


 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10

 Changes to the Java API:


 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

 Changes to the streaming API:


 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

 Changes to the GraphX API:


 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

 Other changes:
 coGroup and related functions now return Iterable[T] instead of Seq[T]
 == Call toSeq on the result to restore the old behavior

 SparkContext.jarOfClass returns Option[String] instead of Seq[String]
 == Call toSeq on the result to restore old behavior




Re: [RESULT][VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Matei Zaharia
Yup, congrats all. The most impressive thing is the number of contributors to 
this release — with over 100 contributors, it’s becoming hard to even write the 
credits. Look forward to the Apache press release tomorrow.

Matei

On May 29, 2014, at 1:33 PM, Patrick Wendell pwend...@gmail.com wrote:

 Congrats everyone! This is a huge accomplishment!
 
 On Thu, May 29, 2014 at 1:24 PM, Tathagata Das
 tathagata.das1...@gmail.com wrote:
 Hello everyone,
 
 The vote on Spark 1.0.0 RC11 passes with13 +1 votes, one 0 vote and no
 -1 vote.
 
 Thanks to everyone who tested the RC and voted. Here are the totals:
 
 +1: (13 votes)
 Matei Zaharia*
 Mark Hamstra*
 Holden Karau
 Nick Pentreath*
 Will Benton
 Henry Saputra
 Sean McNamara*
 Xiangrui Meng*
 Andy Konwinski*
 Krishna Sankar
 Kevin Markey
 Patrick Wendell*
 Tathagata Das*
 
 0: (1 vote)
 Ankur Dave*
 
 -1: (0 vote)
 
 Please hold off announcing Spark 1.0.0 until Apache Software Foundation
 makes the press release tomorrow. Thank you very much for your cooperation.
 
 TD



Re: [RESULT][VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Andy Konwinski
Yes great work all. Special thanks to Patrick (and TD) for excellent
leadership!
On May 29, 2014 5:39 PM, Usman Ghani us...@platfora.com wrote:

 Congrats everyone. Really pumped about this.


 On Thu, May 29, 2014 at 2:57 PM, Henry Saputra henry.sapu...@gmail.com
 wrote:

  Congrats guys! Another milestone for Apache Spark indeed =)
 
  - Henry
 
  On Thu, May 29, 2014 at 2:08 PM, Matei Zaharia matei.zaha...@gmail.com
  wrote:
   Yup, congrats all. The most impressive thing is the number of
  contributors to this release — with over 100 contributors, it’s becoming
  hard to even write the credits. Look forward to the Apache press release
  tomorrow.
  
   Matei
  
   On May 29, 2014, at 1:33 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  
   Congrats everyone! This is a huge accomplishment!
  
   On Thu, May 29, 2014 at 1:24 PM, Tathagata Das
   tathagata.das1...@gmail.com wrote:
   Hello everyone,
  
   The vote on Spark 1.0.0 RC11 passes with13 +1 votes, one 0 vote
  and no
   -1 vote.
  
   Thanks to everyone who tested the RC and voted. Here are the totals:
  
   +1: (13 votes)
   Matei Zaharia*
   Mark Hamstra*
   Holden Karau
   Nick Pentreath*
   Will Benton
   Henry Saputra
   Sean McNamara*
   Xiangrui Meng*
   Andy Konwinski*
   Krishna Sankar
   Kevin Markey
   Patrick Wendell*
   Tathagata Das*
  
   0: (1 vote)
   Ankur Dave*
  
   -1: (0 vote)
  
   Please hold off announcing Spark 1.0.0 until Apache Software
 Foundation
   makes the press release tomorrow. Thank you very much for your
  cooperation.
  
   TD
  
 



how spark partition data when creating table like create table xxx as select * from xxx

2014-05-29 Thread qingyang li
hi,  spark-developers, i am using shark/spark,  and i am puzzled by such
question, and  can not find any info from the web, so i ask you.
1.  how spark partition data in memory when creating table when using
create table a tblproperties(shark.cache=memory) as select * from
table b  ,  in another words, how many rdds will be created ? how spark
decide the number of rdds ?

2.  how spark partition data on tachyon when creating table when using
create table a tblproperties(shark.cache=tachyon) as select * from
table b .  in another words, how many files will be created ? how spark
decide the number of files?
i found this settings about tachyon tachyon.user.default.block.size.byte
,  what it means?  could i set it to control each file size ?

thanks for any guiding  .