Re: HA support for Spark

2014-12-11 Thread Tathagata Das
Spark Streaming essentially does this by saving the DAG of DStreams, which
can deterministically regenerate the DAG of RDDs upon recovery from
failure. Along with that the progress information (which batches have
finished, which batches are queued, etc.) is also saved, so that upon
recovery the system can restart from where it was before failure. This was
conceptually easy to do because the RDDs are very deterministically
generated in every batch. Extending this to a very general Spark program
with arbitrary RDD computations is definitely conceptually possible but not
that easy to do.

On Wed, Dec 10, 2014 at 7:34 PM, Jun Feng Liu liuj...@cn.ibm.com wrote:

 Right, perhaps also need preserve some DAG information? I am wondering if
 there is any work around this.


 [image: Inactive hide details for Sandy Ryza ---2014-12-11
 01:36:35---Sandy Ryza sandy.r...@cloudera.com]Sandy Ryza ---2014-12-11
 01:36:35---Sandy Ryza sandy.r...@cloudera.com


*Sandy Ryza sandy.r...@cloudera.com sandy.r...@cloudera.com*

2014-12-11 01:34


 To


Jun Feng Liu/China/IBM@IBMCN,


 cc


Reynold Xin r...@databricks.com, dev@spark.apache.org 
dev@spark.apache.org


 Subject


Re: HA support for Spark


 I think that if we were able to maintain the full set of created RDDs as
 well as some scheduler and block manager state, it would be enough for most
 apps to recover.

 On Wed, Dec 10, 2014 at 5:30 AM, Jun Feng Liu liuj...@cn.ibm.com wrote:

  Well, it should not be mission impossible thinking there are so many HA
  solution existing today. I would interest to know if there is any
 specific
  difficult.
 
  Best Regards
 
 
  *Jun Feng Liu*
  IBM China Systems  Technology Laboratory in Beijing
 
--
   [image: 2D barcode - encoded with contact information] *Phone:
 *86-10-82452683
 
  * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com
  [image: IBM]
 
  BLD 28,ZGC Software Park
  No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
  China
 
 
 
 
 
   *Reynold Xin r...@databricks.com r...@databricks.com*
 
  2014/12/10 16:30
To
  Jun Feng Liu/China/IBM@IBMCN,
  cc
  dev@spark.apache.org dev@spark.apache.org
  Subject
  Re: HA support for Spark
 
 
 
 
  This would be plausible for specific purposes such as Spark streaming or
  Spark SQL, but I don't think it is doable for general Spark driver since
 it
  is just a normal JVM process with arbitrary program state.
 
  On Wed, Dec 10, 2014 at 12:25 AM, Jun Feng Liu liuj...@cn.ibm.com
 wrote:
 
   Do we have any high availability support in Spark driver level? For
   example, if we want spark drive can move to another node continue
  execution
   when failure happen. I can see the RDD checkpoint can help to
  serialization
   the status of RDD. I can image to load the check point from another
 node
   when error happen, but seems like will lost track all tasks status or
  even
   executor information that maintain in spark context. I am not sure if
  there
   is any existing stuff I can leverage to do that. thanks for any
 suggests
  
   Best Regards
  
  
   *Jun Feng Liu*
   IBM China Systems  Technology Laboratory in Beijing
  
 --
[image: 2D barcode - encoded with contact information] *Phone:
  *86-10-82452683
  
   * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com
   [image: IBM]
  
   BLD 28,ZGC Software Park
   No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
   China
  
  
  
  
  
 
 




Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-11 Thread Madhu
+1 (non-binding)

Built and tested on Windows 7:

cd apache-spark
git fetch
git checkout v1.2.0-rc2
sbt assembly
[warn]
...
[warn]
[success] Total time: 720 s, completed Dec 11, 2014 8:57:36 AM

dir assembly\target\scala-2.10\spark-assembly-1.2.0-hadoop1.0.4.jar
110,361,054 spark-assembly-1.2.0-hadoop1.0.4.jar

Ran some of my 1.2 code successfully.
Review some docs, looks good.
spark-shell.cmd works as expected.

Env details:
sbtconfig.txt:
-Xmx1024M
-XX:MaxPermSize=256m
-XX:ReservedCodeCacheSize=128m

sbt --version
sbt launcher version 0.13.1




-
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-0-RC2-tp9713p9728.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: HA support for Spark

2014-12-11 Thread Jun Feng Liu
Interesting, you saying StreamContext checkpoint can regenerate DAG stuff? 

 
Best Regards
 
Jun Feng Liu
IBM China Systems  Technology Laboratory in Beijing



Phone: 86-10-82452683 
E-mail: liuj...@cn.ibm.com


BLD 28,ZGC Software Park 
No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 
China 
 

 



Tathagata Das tathagata.das1...@gmail.com 
2014/12/11 20:20

To
Jun Feng Liu/China/IBM@IBMCN, 
cc
Sandy Ryza sandy.r...@cloudera.com, dev@spark.apache.org 
dev@spark.apache.org, Reynold Xin r...@databricks.com
Subject
Re: HA support for Spark






Spark Streaming essentially does this by saving the DAG of DStreams, which 
can deterministically regenerate the DAG of RDDs upon recovery from 
failure. Along with that the progress information (which batches have 
finished, which batches are queued, etc.) is also saved, so that upon 
recovery the system can restart from where it was before failure. This was 
conceptually easy to do because the RDDs are very deterministically 
generated in every batch. Extending this to a very general Spark program 
with arbitrary RDD computations is definitely conceptually possible but 
not that easy to do.

On Wed, Dec 10, 2014 at 7:34 PM, Jun Feng Liu liuj...@cn.ibm.com wrote:
Right, perhaps also need preserve some DAG information? I am wondering if 
there is any work around this.


Sandy Ryza ---2014-12-11 01:36:35---Sandy Ryza sandy.r...@cloudera.com


Sandy Ryza sandy.r...@cloudera.com  
2014-12-11 01:34



To

Jun Feng Liu/China/IBM@IBMCN, 

cc

Reynold Xin r...@databricks.com, dev@spark.apache.org 
dev@spark.apache.org

Subject

Re: HA support for Spark





I think that if we were able to maintain the full set of created RDDs as
well as some scheduler and block manager state, it would be enough for 
most
apps to recover.

On Wed, Dec 10, 2014 at 5:30 AM, Jun Feng Liu liuj...@cn.ibm.com wrote:

 Well, it should not be mission impossible thinking there are so many HA
 solution existing today. I would interest to know if there is any 
specific
 difficult.

 Best Regards


 *Jun Feng Liu*
 IBM China Systems  Technology Laboratory in Beijing

   --
  [image: 2D barcode - encoded with contact information] *Phone: 
*86-10-82452683

 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com
 [image: IBM]

 BLD 28,ZGC Software Park
 No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
 China





  *Reynold Xin r...@databricks.com r...@databricks.com*

 2014/12/10 16:30
   To
 Jun Feng Liu/China/IBM@IBMCN,
 cc
 dev@spark.apache.org dev@spark.apache.org
 Subject
 Re: HA support for Spark




 This would be plausible for specific purposes such as Spark streaming or
 Spark SQL, but I don't think it is doable for general Spark driver since 
it
 is just a normal JVM process with arbitrary program state.

 On Wed, Dec 10, 2014 at 12:25 AM, Jun Feng Liu liuj...@cn.ibm.com 
wrote:

  Do we have any high availability support in Spark driver level? For
  example, if we want spark drive can move to another node continue
 execution
  when failure happen. I can see the RDD checkpoint can help to
 serialization
  the status of RDD. I can image to load the check point from another 
node
  when error happen, but seems like will lost track all tasks status or
 even
  executor information that maintain in spark context. I am not sure if
 there
  is any existing stuff I can leverage to do that. thanks for any 
suggests
 
  Best Regards
 
 
  *Jun Feng Liu*
  IBM China Systems  Technology Laboratory in Beijing
 
--
   [image: 2D barcode - encoded with contact information] *Phone:
 *86-10-82452683
 
  * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com
  [image: IBM]
 
  BLD 28,ZGC Software Park
  No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
  China
 
 
 
 
 






Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-11 Thread Sean Owen
Signatures and checksums are OK. License and notice still looks fine.
The plain-vanilla source release compiles with Maven 3.2.1 and passes
tests, on OS X 10.10 + Java 8.

On Wed, Dec 10, 2014 at 9:08 PM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.2.0!

 The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.2.0-rc2/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1055/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/

 Please vote on releasing this package as Apache Spark 1.2.0!

 The vote is open until Saturday, December 13, at 21:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.2.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == What justifies a -1 vote for this release? ==
 This vote is happening relatively late into the QA period, so
 -1 votes should only occur for significant regressions from
 1.0.2. Bugs already present in 1.1.X, minor
 regressions, or bugs related to new features will not block this
 release.

 == What default changes should I be aware of? ==
 1. The default value of spark.shuffle.blockTransferService has been
 changed to netty
 -- Old behavior can be restored by switching to nio

 2. The default value of spark.shuffle.manager has been changed to sort.
 -- Old behavior can be restored by setting spark.shuffle.manager to hash.

 == How does this differ from RC1 ==
 This has fixes for a handful of issues identified - some of the
 notable fixes are:

 [Core]
 SPARK-4498: Standalone Master can fail to recognize completed/failed
 applications

 [SQL]
 SPARK-4552: Query for empty parquet table in spark sql hive get
 IllegalArgumentException
 SPARK-4753: Parquet2 does not prune based on OR filters on partition columns
 SPARK-4761: With JDBC server, set Kryo as default serializer and
 disable reference tracking
 SPARK-4785: When called with arguments referring column fields, PMOD throws 
 NPE

 - Patrick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-11 Thread Reynold Xin
+1

Tested on OS X.

On Wednesday, December 10, 2014, Patrick Wendell pwend...@gmail.com wrote:

 Please vote on releasing the following candidate as Apache Spark version
 1.2.0!

 The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):

 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.2.0-rc2/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1055/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/

 Please vote on releasing this package as Apache Spark 1.2.0!

 The vote is open until Saturday, December 13, at 21:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.2.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == What justifies a -1 vote for this release? ==
 This vote is happening relatively late into the QA period, so
 -1 votes should only occur for significant regressions from
 1.0.2. Bugs already present in 1.1.X, minor
 regressions, or bugs related to new features will not block this
 release.

 == What default changes should I be aware of? ==
 1. The default value of spark.shuffle.blockTransferService has been
 changed to netty
 -- Old behavior can be restored by switching to nio

 2. The default value of spark.shuffle.manager has been changed to sort.
 -- Old behavior can be restored by setting spark.shuffle.manager to
 hash.

 == How does this differ from RC1 ==
 This has fixes for a handful of issues identified - some of the
 notable fixes are:

 [Core]
 SPARK-4498: Standalone Master can fail to recognize completed/failed
 applications

 [SQL]
 SPARK-4552: Query for empty parquet table in spark sql hive get
 IllegalArgumentException
 SPARK-4753: Parquet2 does not prune based on OR filters on partition
 columns
 SPARK-4761: With JDBC server, set Kryo as default serializer and
 disable reference tracking
 SPARK-4785: When called with arguments referring column fields, PMOD
 throws NPE

 - Patrick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org javascript:;
 For additional commands, e-mail: dev-h...@spark.apache.org javascript:;




Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-11 Thread Sandy Ryza
+1 (non-binding).  Tested on Ubuntu against YARN.

On Thu, Dec 11, 2014 at 9:38 AM, Reynold Xin r...@databricks.com wrote:

 +1

 Tested on OS X.

 On Wednesday, December 10, 2014, Patrick Wendell pwend...@gmail.com
 wrote:

  Please vote on releasing the following candidate as Apache Spark version
  1.2.0!
 
  The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):
 
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.2.0-rc2/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1055/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/
 
  Please vote on releasing this package as Apache Spark 1.2.0!
 
  The vote is open until Saturday, December 13, at 21:00 UTC and passes
  if a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.2.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == What justifies a -1 vote for this release? ==
  This vote is happening relatively late into the QA period, so
  -1 votes should only occur for significant regressions from
  1.0.2. Bugs already present in 1.1.X, minor
  regressions, or bugs related to new features will not block this
  release.
 
  == What default changes should I be aware of? ==
  1. The default value of spark.shuffle.blockTransferService has been
  changed to netty
  -- Old behavior can be restored by switching to nio
 
  2. The default value of spark.shuffle.manager has been changed to
 sort.
  -- Old behavior can be restored by setting spark.shuffle.manager to
  hash.
 
  == How does this differ from RC1 ==
  This has fixes for a handful of issues identified - some of the
  notable fixes are:
 
  [Core]
  SPARK-4498: Standalone Master can fail to recognize completed/failed
  applications
 
  [SQL]
  SPARK-4552: Query for empty parquet table in spark sql hive get
  IllegalArgumentException
  SPARK-4753: Parquet2 does not prune based on OR filters on partition
  columns
  SPARK-4761: With JDBC server, set Kryo as default serializer and
  disable reference tracking
  SPARK-4785: When called with arguments referring column fields, PMOD
  throws NPE
 
  - Patrick
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org javascript:;
  For additional commands, e-mail: dev-h...@spark.apache.org
 javascript:;
 
 



Evaluation Metrics for Spark's MLlib

2014-12-11 Thread kidynamit
Hi, 

I would like to contribute to Spark's Machine Learning library by adding
evaluation metrics that would be used to gauge the accuracy of a model given
a certain features' set. In particular, I seek to contribute the k-fold
validation metrics, f-beta metric among others on top of the current MLlib
framework available.

Please assist in steps I could take to contribute in this manner. 

Regards, 
kidynamit



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Evaluation-Metrics-for-Spark-s-MLlib-tp9727.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Where are the docs for the SparkSQL DataTypes?

2014-12-11 Thread Alessandro Baretta
Michael  other Spark SQL junkies,

As I read through the Spark API docs, in particular those for the
org.apache.spark.sql package, I can't seem to find details about the Scala
classes representing the various SparkSQL DataTypes, for instance
DecimalType. I find DataType classes in org.apache.spark.sql.api.java, but
they don't seem to match the similarly named scala classes. For instance,
DecimalType is documented as having a nullary constructor, but if I try to
construct an instance of org.apache.spark.sql.DecimalType without any
parameters, the compiler complains about the lack of a precisionInfo field,
which I have discovered can be passed in as None. Where is all this stuff
documented?

Alex


Re: Evaluation Metrics for Spark's MLlib

2014-12-11 Thread Joseph Bradley
Hi, I'd recommend starting by checking out the existing helper
functionality for these tasks.  There are helper methods to do K-fold
cross-validation in MLUtils:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala

The experimental spark.ml API in the Spark 1.2 release (in branch-1.2 and
master) has a CrossValidator class which does this more automatically:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala

There are also a few evaluation metrics implemented:
https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation

There definitely could be more metrics and/or better APIs to make it easier
to evaluate models on RDDs.  If you spot such cases, I'd recommend opening
up JIRAs for the new features or improvements to get some feedback before
sending PRs:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Hope this helps  looking forward to the contributions!
Joseph

On Thu, Dec 11, 2014 at 4:41 AM, kidynamit paul.mwanj...@gmail.com wrote:

 Hi,

 I would like to contribute to Spark's Machine Learning library by adding
 evaluation metrics that would be used to gauge the accuracy of a model
 given
 a certain features' set. In particular, I seek to contribute the k-fold
 validation metrics, f-beta metric among others on top of the current MLlib
 framework available.

 Please assist in steps I could take to contribute in this manner.

 Regards,
 kidynamit



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Evaluation-Metrics-for-Spark-s-MLlib-tp9727.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Tachyon in Spark

2014-12-11 Thread Andrew Ash
I'm interested in understanding this as well.  One of the main ways Tachyon
is supposed to realize performance gains without sacrificing durability is
by storing the lineage of data rather than full copies of it (similar to
Spark).  But if Spark isn't sending lineage information into Tachyon, then
I'm not sure how this isn't a durability concern.

On Wed, Dec 10, 2014 at 5:47 AM, Jun Feng Liu liuj...@cn.ibm.com wrote:

 Dose Spark today really leverage Tachyon linage to process data? It seems
 like the application should call createDependency function in TachyonFS
 to create a new linage node. But I did not find any place call that in
 Spark code. Did I missed anything?

 Best Regards


 *Jun Feng Liu*
 IBM China Systems  Technology Laboratory in Beijing

   --
  [image: 2D barcode - encoded with contact information] *Phone: 
 *86-10-82452683

 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com
 [image: IBM]

 BLD 28,ZGC Software Park
 No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
 China







Re: Tachyon in Spark

2014-12-11 Thread Reynold Xin
I don't think the lineage thing is even turned on in Tachyon - it was
mostly a research prototype, so I don't think it'd make sense for us to use
that.


On Thu, Dec 11, 2014 at 3:51 PM, Andrew Ash and...@andrewash.com wrote:

 I'm interested in understanding this as well.  One of the main ways Tachyon
 is supposed to realize performance gains without sacrificing durability is
 by storing the lineage of data rather than full copies of it (similar to
 Spark).  But if Spark isn't sending lineage information into Tachyon, then
 I'm not sure how this isn't a durability concern.

 On Wed, Dec 10, 2014 at 5:47 AM, Jun Feng Liu liuj...@cn.ibm.com wrote:

  Dose Spark today really leverage Tachyon linage to process data? It seems
  like the application should call createDependency function in TachyonFS
  to create a new linage node. But I did not find any place call that in
  Spark code. Did I missed anything?
 
  Best Regards
 
 
  *Jun Feng Liu*
  IBM China Systems  Technology Laboratory in Beijing
 
--
   [image: 2D barcode - encoded with contact information] *Phone:
 *86-10-82452683
 
  * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com
  [image: IBM]
 
  BLD 28,ZGC Software Park
  No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
  China
 
 
 
 
 



running the Terasort example

2014-12-11 Thread Tim Harsch
Hi all,
I just joined the list, so I don¹t have a message history that would allow
me to reply to this post:
http://apache-spark-developers-list.1001551.n3.nabble.com/Terasort-example-
td9284.html

I am interested in running the terasort example.  I cloned the repo
https://github.com/ehiggs/spark and did checkout of the terasort branch.
In the above referenced post Ewan gives the example

# Generate 1M 100 byte records:
  ./bin/run-example terasort.TeraGen 100M ~/data/terasort_in


I don¹t see a ³run-example² in that repo.  I¹m sure I am missing something
basic, or less likely, maybe some changes weren¹t pushed?

Thanks for any help,
Tim


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: Where are the docs for the SparkSQL DataTypes?

2014-12-11 Thread Cheng, Hao
Part of it can be found at:
https://github.com/apache/spark/pull/3429/files#diff-f88c3e731fcb17b1323b778807c35b38R34
 
Sorry it's a TO BE reviewed PR, but still should be informative.

Cheng Hao

-Original Message-
From: Alessandro Baretta [mailto:alexbare...@gmail.com] 
Sent: Friday, December 12, 2014 6:37 AM
To: Michael Armbrust; dev@spark.apache.org
Subject: Where are the docs for the SparkSQL DataTypes?

Michael  other Spark SQL junkies,

As I read through the Spark API docs, in particular those for the 
org.apache.spark.sql package, I can't seem to find details about the Scala 
classes representing the various SparkSQL DataTypes, for instance DecimalType. 
I find DataType classes in org.apache.spark.sql.api.java, but they don't seem 
to match the similarly named scala classes. For instance, DecimalType is 
documented as having a nullary constructor, but if I try to construct an 
instance of org.apache.spark.sql.DecimalType without any parameters, the 
compiler complains about the lack of a precisionInfo field, which I have 
discovered can be passed in as None. Where is all this stuff documented?

Alex

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Is there any document to explain how to build the hive jars for spark?

2014-12-11 Thread Yi Tian

Hi, all

We found some bugs in hive-0.12, but we could not wait for hive 
community fixing them.


We want to fix these bugs in our lab and build a new release which could 
be recognized by spark.


As we know, spark depends on a special release of hive, like:

|dependency
  groupIdorg.spark-project.hive/groupId
  artifactIdhive-metastore/artifactId
  version${hive.version}/version
/dependency
|

The different between |org.spark-project.hive| and |org.apache.hive| was 
described by Patrick:


|There are two differences:

1. We publish hive with a shaded protobuf dependency to avoid
conflicts with some Hadoop versions.
2. We publish a proper hive-exec jar that only includes hive packages.
The upstream version of hive-exec bundles a bunch of other random
dependencies in it which makes it really hard for third-party projects
to use it.
|

Is there any document to guide us how to build the hive jars for spark?

Any help would be greatly appreciated.

​


Re: Where are the docs for the SparkSQL DataTypes?

2014-12-11 Thread Alessandro Baretta
Thanks. This is useful.

Alex

On Thu, Dec 11, 2014 at 4:35 PM, Cheng, Hao hao.ch...@intel.com wrote:

 Part of it can be found at:

 https://github.com/apache/spark/pull/3429/files#diff-f88c3e731fcb17b1323b778807c35b38R34

 Sorry it's a TO BE reviewed PR, but still should be informative.

 Cheng Hao

 -Original Message-
 From: Alessandro Baretta [mailto:alexbare...@gmail.com]
 Sent: Friday, December 12, 2014 6:37 AM
 To: Michael Armbrust; dev@spark.apache.org
 Subject: Where are the docs for the SparkSQL DataTypes?

 Michael  other Spark SQL junkies,

 As I read through the Spark API docs, in particular those for the
 org.apache.spark.sql package, I can't seem to find details about the Scala
 classes representing the various SparkSQL DataTypes, for instance
 DecimalType. I find DataType classes in org.apache.spark.sql.api.java, but
 they don't seem to match the similarly named scala classes. For instance,
 DecimalType is documented as having a nullary constructor, but if I try to
 construct an instance of org.apache.spark.sql.DecimalType without any
 parameters, the compiler complains about the lack of a precisionInfo field,
 which I have discovered can be passed in as None. Where is all this stuff
 documented?

 Alex