SparkR Reading Tables from Hive

2015-06-08 Thread Eskilson,Aleksander
Hi there,

I’m testing out the new SparkR-Hive interop right now. I’m noticing an apparent 
disconnect between the Hive store I have my data loaded and the store that 
sparkRHIve.init() connects to. For example, in beeline:

0: jdbc:hive2://quickstart.cloudera:1 show databases;
+---+--+
| database_name |
+---+--+
| default   |
+---+--+
0: jdbc:hive2://quickstart.cloudera:1 show tables;
+---+--+
| tab_name  |
+---+--+
| my_table  |
+---+--+

But in sparkR:

 hqlContext - sparkRHive.init(sc)
 showDF(sql(hqlContext, “show databases”))
+-+
| result  |
+-+
| default |
+-+
 showDF(tables(hqlContext, “default”))
+---+-+
+ tableName | isTemporary |
+---+-+
+---+-+
 showDF(sql(hqlContext, “show tables”))
+---+-+
+ tableName | isTemporary |
+---+-+
+---+-+

The data in my_table was landed into Hive from a CSV via kite-dataset. The 
installation of Spark I’m working with was built separately, and operates as 
standalone. Could it be that sparkRHive.init() is getting the wrong address of 
the Hive metastore? How could I peer into the context and see what the address 
is set to, and if it’s wrong, reset it?

Ultimately, I’d like to be able to read my_table from Hive into a SparkR 
DataFrame which ought to be possible with
 result - sql(hqlContext, “SELECT * FROM my_table”)
But this fails with:
org.apache.spark.sql.AnalysisException: no such table my_table; line 1 pos 14
which is expected, I suppose, since we don’t see the table in the listing above.

Any thoughts?

Thanks,
Alek Eskilson

CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.


Re: SparkR Reading Tables from Hive

2015-06-08 Thread Eskilson,Aleksander
Resolved, my hive-site.xml wasn’t in the conf folder. I can load tables into 
DataFrames as expected.

Thanks,
Alek

From: Eskilson, Aleksander Eskilson 
alek.eskil...@cerner.commailto:alek.eskil...@cerner.com
Date: Monday, June 8, 2015 at 3:38 PM
To: dev@spark.apache.orgmailto:dev@spark.apache.org 
dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: SparkR Reading Tables from Hive

Hi there,

I’m testing out the new SparkR-Hive interop right now. I’m noticing an apparent 
disconnect between the Hive store I have my data loaded and the store that 
sparkRHIve.init() connects to. For example, in beeline:

0: jdbc:hive2://quickstart.cloudera:1 show databases;
+---+--+
| database_name |
+---+--+
| default   |
+---+--+
0: jdbc:hive2://quickstart.cloudera:1 show tables;
+---+--+
| tab_name  |
+---+--+
| my_table  |
+---+--+

But in sparkR:

 hqlContext - sparkRHive.init(sc)
 showDF(sql(hqlContext, “show databases”))
+-+
| result  |
+-+
| default |
+-+
 showDF(tables(hqlContext, “default”))
+---+-+
+ tableName | isTemporary |
+---+-+
+---+-+
 showDF(sql(hqlContext, “show tables”))
+---+-+
+ tableName | isTemporary |
+---+-+
+---+-+

The data in my_table was landed into Hive from a CSV via kite-dataset. The 
installation of Spark I’m working with was built separately, and operates as 
standalone. Could it be that sparkRHive.init() is getting the wrong address of 
the Hive metastore? How could I peer into the context and see what the address 
is set to, and if it’s wrong, reset it?

Ultimately, I’d like to be able to read my_table from Hive into a SparkR 
DataFrame which ought to be possible with
 result - sql(hqlContext, “SELECT * FROM my_table”)
But this fails with:
org.apache.spark.sql.AnalysisException: no such table my_table; line 1 pos 14
which is expected, I suppose, since we don’t see the table in the listing above.

Any thoughts?

Thanks,
Alek Eskilson
CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.


Re: SparkR Reading Tables from Hive

2015-06-08 Thread Shivaram Venkataraman
Thanks for the confirmation - I was just going to send a pointer to the
documentation that talks about hive-site.xml.
http://people.apache.org/~pwendell/spark-releases/latest/sql-programming-guide.html#hive-tables

Thanks
Shivaram

On Mon, Jun 8, 2015 at 1:57 PM, Eskilson,Aleksander 
alek.eskil...@cerner.com wrote:

  Resolved, my hive-site.xml wasn’t in the conf folder. I can load tables
 into DataFrames as expected.

  Thanks,
 Alek

   From: Eskilson, Aleksander Eskilson alek.eskil...@cerner.com
 Date: Monday, June 8, 2015 at 3:38 PM
 To: dev@spark.apache.org dev@spark.apache.org
 Subject: SparkR Reading Tables from Hive

   Hi there,

  I’m testing out the new SparkR-Hive interop right now. I’m noticing an
 apparent disconnect between the Hive store I have my data loaded and the
 store that sparkRHIve.init() connects to. For example, in beeline:

  0: jdbc:hive2://quickstart.cloudera:1 show databases;
  +---+--+
  | database_name |
  +---+--+
  | default   |
  +---+--+
  0: jdbc:hive2://quickstart.cloudera:1 show tables;
  +---+--+
 | tab_name  |
 +---+--+
 | my_table  |
 +---+--+

  But in sparkR:

   hqlContext - sparkRHive.init(sc)
   showDF(sql(hqlContext, “show databases”))
  +-+
  | result  |
  +-+
  | default |
  +-+
  showDF(tables(hqlContext, “default”))
 +---+-+
 + tableName | isTemporary |
 +---+-+
 +---+-+
  showDF(sql(hqlContext, “show tables”))
  +---+-+
 + tableName | isTemporary |
 +---+-+
 +---+-+

  The data in my_table was landed into Hive from a CSV via kite-dataset.
 The installation of Spark I’m working with was built separately, and
 operates as standalone. Could it be that sparkRHive.init() is getting the
 wrong address of the Hive metastore? How could I peer into the context and
 see what the address is set to, and if it’s wrong, reset it?

  Ultimately, I’d like to be able to read my_table from Hive into a SparkR
 DataFrame which ought to be possible with
  result - sql(hqlContext, “SELECT * FROM my_table”)
 But this fails with:
 org.apache.spark.sql.AnalysisException: no such table my_table; line 1 pos
 14
 which is expected, I suppose, since we don’t see the table in the listing
 above.

  Any thoughts?

  Thanks,
 Alek Eskilson
 CONFIDENTIALITY NOTICE This message and any included attachments are from
 Cerner Corporation and are intended only for the addressee. The information
 contained in this message is confidential and may constitute inside or
 non-public information under international, federal, or state securities
 laws. Unauthorized forwarding, printing, copying, distribution, or use of
 such information is strictly prohibited and may be unlawful. If you are not
 the addressee, please promptly delete this message and notify the sender of
 the delivery error by e-mail or you may call Cerner's corporate offices in
 Kansas City, Missouri, U.S.A at (+1) (816)221-1024.



Fwd: pull requests no longer closing by commit messages with closes #xxxx

2015-06-08 Thread Reynold Xin
FYI.

-- Forwarded message --
From: John Greet (GitHub Staff) supp...@github.com
Date: Mon, Jun 8, 2015 at 5:50 PM
Subject: Re: pull requests no longer closing by commit messages with
closes #
To: Reynold Xin r...@databricks.com


Hi Reynold,

The problem here is that the commits closing those pull requests were
fetched by our mirroring process, which doesn't have permission to close
issues, instead of pushed by a user in the apache GitHub organization.

Usually the repository receives regular, I assume automated, pushes to its
master branch, but there was a gap in those pushes between 2pm PDT on June
5th and 1:16 PM PDT June 7th. This happened at least once before back in
November. Now that those pushes have resumed pull requests are closing
normally once again.

Let us know if you have any other questions.

Cheers,
John



 I'm a committer on Apache Spark (the most active open source project in
the data space). We use GitHub as the primary way to accept contributions.
We use a custom merge script to merge pull requests rather than GitHub's
merge button in order to preserve a linear commit history. Part of the
merge script relies on the closes # feature to close the
corresponding pull requests.

 I noticed recently that pull requests are no longer automatically closed,
even if the commits are merged with the message closes #. Here are
two recent examples:

 https://github.com/apache/spark/pull/6670
 https://github.com/apache/spark/pull/6689

 Can you take a look at what's going on? Thanks.


Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-08 Thread Denny Lee
+1

On Mon, Jun 8, 2015 at 17:51 Wang, Daoyuan daoyuan.w...@intel.com wrote:

 +1

 -Original Message-
 From: Patrick Wendell [mailto:pwend...@gmail.com]
 Sent: Wednesday, June 03, 2015 1:47 PM
 To: dev@spark.apache.org
 Subject: Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

 He all - a tiny nit from the last e-mail. The tag is v1.4.0-rc4. The exact
 commit and all other information is correct. (thanks Shivaram who pointed
 this out).

 On Tue, Jun 2, 2015 at 8:53 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Please vote on releasing the following candidate as Apache Spark version
 1.4.0!
 
  The tag to be voted on is v1.4.0-rc3 (commit 22596c5):
  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
  22596c534a38cfdda91aef18aa9037ab101e4251
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-bin/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  [published as version: 1.4.0]
  https://repository.apache.org/content/repositories/orgapachespark-
  /
  [published as version: 1.4.0-rc4]
  https://repository.apache.org/content/repositories/orgapachespark-1112
  /
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs
  /
 
  Please vote on releasing this package as Apache Spark 1.4.0!
 
  The vote is open until Saturday, June 06, at 05:00 UTC and passes if a
  majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.4.0 [ ] -1 Do not
  release this package because ...
 
  To learn more about Apache Spark, please see http://spark.apache.org/
 
  == What has changed since RC3 ==
  In addition to may smaller fixes, three blocker issues were fixed:
  4940630 [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make
  metadataHive get constructed too early
  6b0f615 [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise()
  78a6723 [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be
  singleton
 
  == How can I help test this release? == If you are a Spark user, you
  can help us test this release by taking a Spark 1.3 workload and
  running on this release candidate, then reporting any regressions.
 
  == What justifies a -1 vote for this release? == This vote is
  happening towards the end of the 1.4 QA period, so -1 votes should
  only occur for significant regressions from 1.3.1.
  Bugs already present in 1.3.X, minor regressions, or bugs related to
  new features will not block this release.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional
 commands, e-mail: dev-h...@spark.apache.org




RE: [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-08 Thread Wang, Daoyuan
+1

-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com] 
Sent: Wednesday, June 03, 2015 1:47 PM
To: dev@spark.apache.org
Subject: Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

He all - a tiny nit from the last e-mail. The tag is v1.4.0-rc4. The exact 
commit and all other information is correct. (thanks Shivaram who pointed this 
out).

On Tue, Jun 2, 2015 at 8:53 PM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.4.0!

 The tag to be voted on is v1.4.0-rc3 (commit 22596c5):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 22596c534a38cfdda91aef18aa9037ab101e4251

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 [published as version: 1.4.0]
 https://repository.apache.org/content/repositories/orgapachespark-
 /
 [published as version: 1.4.0-rc4]
 https://repository.apache.org/content/repositories/orgapachespark-1112
 /

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs
 /

 Please vote on releasing this package as Apache Spark 1.4.0!

 The vote is open until Saturday, June 06, at 05:00 UTC and passes if a 
 majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.4.0 [ ] -1 Do not 
 release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/

 == What has changed since RC3 ==
 In addition to may smaller fixes, three blocker issues were fixed:
 4940630 [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make 
 metadataHive get constructed too early
 6b0f615 [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise()
 78a6723 [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be 
 singleton

 == How can I help test this release? == If you are a Spark user, you 
 can help us test this release by taking a Spark 1.3 workload and 
 running on this release candidate, then reporting any regressions.

 == What justifies a -1 vote for this release? == This vote is 
 happening towards the end of the 1.4 QA period, so -1 votes should 
 only occur for significant regressions from 1.3.1.
 Bugs already present in 1.3.X, minor regressions, or bugs related to 
 new features will not block this release.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional 
commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-08 Thread saurfang
+1

Build for Hadoop 2.4. Run a few jobs on YARN and tested spark.sql.unsafe
whose performance seems great!



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-4-0-RC4-tp12582p12671.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[SparkSQL ] What is Exchange in physical plan for ?

2015-06-08 Thread invkrh
Hi,

DataFrame.explain() shows the physical plan of a query. I noticed there are
a lot of `Exchange`s in it, like below:

Project [period#20L,categoryName#0,regionName#10,action#15,list_id#16L]
 ShuffledHashJoin [region#18], [regionCode#9], BuildRight
  Exchange (HashPartitioning [region#18], 12)
   Project [categoryName#0,list_id#16L,period#20L,action#15,region#18]
ShuffledHashJoin [refCategoryID#3], [category#17], BuildRight
 Exchange (HashPartitioning [refCategoryID#3], 12)
  Project [categoryName#0,refCategoryID#3]
   PhysicalRDD
[categoryName#0,familyName#1,parentRefCategoryID#2,refCategoryID#3],
MapPartitionsRDD[5] at mapPartitions at SQLContext.scala:439
 Exchange (HashPartitioning [category#17], 12)
  Project [timestamp_sec#13L AS
period#20L,category#17,region#18,action#15,list_id#16L]
   PhysicalRDD
[syslog#12,timestamp_sec#13L,timestamp_usec#14,action#15,list_id#16L,category#17,region#18,expiration_time#19],
MapPartitionsRDD[16] at map at SQLContext.scala:394
  Exchange (HashPartitioning [regionCode#9], 12)
   Project [regionName#10,regionCode#9]
PhysicalRDD
[cityName#4,countryCode#5,countryName#6,dptCode#7,dptName#8,regionCode#9,regionName#10,zipCode#11],
MapPartitionsRDD[11] at mapPartitions at SQLContext.scala:439

I find also its class:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala.

So what does it mean ? 

Thank you.

Hao.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSQL-What-is-Exchange-in-physical-plan-for-tp12659.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: [SparkSQL ] What is Exchange in physical plan for ?

2015-06-08 Thread Cheng, Hao
It means the data shuffling, and its arguments also show the partitioning 
strategy.

-Original Message-
From: invkrh [mailto:inv...@gmail.com] 
Sent: Monday, June 8, 2015 9:34 PM
To: dev@spark.apache.org
Subject: [SparkSQL ] What is Exchange in physical plan for ?

Hi,

DataFrame.explain() shows the physical plan of a query. I noticed there are a 
lot of `Exchange`s in it, like below:

Project [period#20L,categoryName#0,regionName#10,action#15,list_id#16L]
 ShuffledHashJoin [region#18], [regionCode#9], BuildRight
  Exchange (HashPartitioning [region#18], 12)
   Project [categoryName#0,list_id#16L,period#20L,action#15,region#18]
ShuffledHashJoin [refCategoryID#3], [category#17], BuildRight
 Exchange (HashPartitioning [refCategoryID#3], 12)
  Project [categoryName#0,refCategoryID#3]
   PhysicalRDD
[categoryName#0,familyName#1,parentRefCategoryID#2,refCategoryID#3],
MapPartitionsRDD[5] at mapPartitions at SQLContext.scala:439
 Exchange (HashPartitioning [category#17], 12)
  Project [timestamp_sec#13L AS
period#20L,category#17,region#18,action#15,list_id#16L]
   PhysicalRDD
[syslog#12,timestamp_sec#13L,timestamp_usec#14,action#15,list_id#16L,category#17,region#18,expiration_time#19],
MapPartitionsRDD[16] at map at SQLContext.scala:394
  Exchange (HashPartitioning [regionCode#9], 12)
   Project [regionName#10,regionCode#9]
PhysicalRDD
[cityName#4,countryCode#5,countryName#6,dptCode#7,dptName#8,regionCode#9,regionName#10,zipCode#11],
MapPartitionsRDD[11] at mapPartitions at SQLContext.scala:439

I find also its class:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala.

So what does it mean ? 

Thank you.

Hao.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSQL-What-is-Exchange-in-physical-plan-for-tp12659.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional 
commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-08 Thread Patrick Wendell
Hi All,

Thanks for the continued voting! I'm going to leave this thread open
for another few days to continue to collect feedback.

- Patrick

On Tue, Jun 2, 2015 at 8:53 PM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.4.0!

 The tag to be voted on is v1.4.0-rc3 (commit 22596c5):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 22596c534a38cfdda91aef18aa9037ab101e4251

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 [published as version: 1.4.0]
 https://repository.apache.org/content/repositories/orgapachespark-/
 [published as version: 1.4.0-rc4]
 https://repository.apache.org/content/repositories/orgapachespark-1112/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs/

 Please vote on releasing this package as Apache Spark 1.4.0!

 The vote is open until Saturday, June 06, at 05:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.4.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == What has changed since RC3 ==
 In addition to may smaller fixes, three blocker issues were fixed:
 4940630 [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make
 metadataHive get constructed too early
 6b0f615 [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise()
 78a6723 [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be singleton

 == How can I help test this release? ==
 If you are a Spark user, you can help us test this release by
 taking a Spark 1.3 workload and running on this release candidate,
 then reporting any regressions.

 == What justifies a -1 vote for this release? ==
 This vote is happening towards the end of the 1.4 QA period,
 so -1 votes should only occur for significant regressions from 1.3.1.
 Bugs already present in 1.3.X, minor regressions, or bugs related
 to new features will not block this release.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[ml] Why all model classes are final?

2015-06-08 Thread Peter Rudenko
Hi, previously all the models in ml package were private to package, so 
if i need to customize some models i inherit them in org.apache.spark.ml 
package in my project. But now new models 
(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala#L46) 
are final classes. So if i need to customize 1 line or so, i need to 
redefine the whole class. Any reasons to do so? As a developer,i 
understand all the risks of using Developer/Alpha API. That's why i'm 
using spark, because it provides a building blocks that i could easily 
customize and combine for my need.


Thanks,
Peter Rudenko

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Stages with non-arithmetic numbering Timing metrics in event logs

2015-06-08 Thread Imran Rashid
Hi Mike,

all good questions, let me take a stab at answering them:

1. Event Logs + Stages:

Its normal for stages to get skipped if they are shuffle map stages, which
get read multiple times.  Eg., here's a little example program I wrote
earlier to demonstrate this: d3 doesn't need to be re-shuffled since each
time its read w/ the same partitioner.  So skipping stages in this way is a
good thing:

val partitioner = new org.apache.spark.HashPartitioner(10)
val d3 = sc.parallelize(1 to 100).map { x = (x % 10) -
x}.partitionBy(partitioner)
(0 until 5).foreach { idx =
  val otherData = sc.parallelize(1 to (idx * 100)).map{ x = (x % 10) -
x}.partitionBy(partitioner)
  println(idx +  ---  + otherData.join(d3).count())
}

If you run this, f you look in the UI you'd see that all jobs except for
the first one have one stage that is skipped.  You will also see this in
the log:

15/06/08 10:52:37 INFO DAGScheduler: Parents of final stage: List(Stage 12,
Stage 13)

15/06/08 10:52:37 INFO DAGScheduler: Missing parents: List(Stage 13)

Admittedly that is not very clear, but that is sort of indicating to you
that the DAGScheduler first created stage 12 as a necessary step, and then
later on changed its mind by realizing that everything it needed for stage
12 already existed, so there was nothing to do.


2. Extracting Event Log Information

maybe you are interested in SparkListener ? Though unfortunately, I don't
know of a good blog post describing it, hopefully the docs are clear ...

3. Time Metrics in Spark Event Log

This is a great question.  I *think* the only exception is that t_gc is
really overlapped with t_exec.  So I think you should really expect

(t2 - t1)  (t_ser + t_deser + t_exec)

I am not 100% sure about this, though.  I'd be curious if that was
constraint was ever violated.


As for your question on shuffle read vs. shuffle write time -- I wouldn't
necessarily expect the same stage to have times for both shuffle read 
shuffle write -- in the simplest case, you'll have shuffle write times in
one, and shuffle read times in the next one.  But even taking that into
account, there is a difference in the way they work  are measured.
 shuffle read operations are pipelined and the way we measure shuffle read,
its just how much time is spent *waiting* for network transfer.  It could
be that there is no (measurable) wait time b/c the next blocks are fetched
before they are needed.  Shuffle writes occur in the normal task execution
thread, though, so we (try to) measure all of it.


On Sun, Jun 7, 2015 at 11:12 PM, Mike Hynes 91m...@gmail.com wrote:

 Hi Patrick and Akhil,

 Thank you both for your responses. This is a bit of an extended email,
 but I'd like to:
 1. Answer your (Patrick) note about the missing stages since the IDs
 do (briefly) appear in the event logs
 2. Ask for advice/experience with extracting information from the
 event logs in a columnar, delimiter-separated format.
 3. Ask about the time metrics reported in the event logs; currently,
 the elapsed time for a task does not equal the sum of the times for
 its components

 1. Event Logs + Stages:
 =

 As I said before, In the spark logs (the log4j configurable ones from
 the driver), I only see references to some stages, where the stage IDs
 are not arithmetically increasing. In the event logs, however, I will
 see reference to *every* stage, although not all stages will have
 tasks associated with them.

 For instance, to examine the actual stages that have tasks, you can
 see missing stages:
 # grep -E 'Event:SparkListenerTaskEnd' app.log \
 #   | grep -Eo 'Stage ID:[[:digit:]]+'  \
 #   | sort -n|uniq | head -n 5
 Stage ID:0
 Stage ID:1
 Stage ID:10
 Stage ID:11
 Stage ID:110

 However, these missing stages *do* appear in the event logs as Stage
 IDs in the jobs submitted, i.e: for
 # grep -E 'Event:SparkListenerJobStart' app.log | grep -Eo 'Stage
 IDs:\[.*\]' | head -n 5
 Stage IDs:[0,1,2]
 Stage IDs:[5,3,4]
 Stage IDs:[6,7,8]
 Stage IDs:[9,10,11]
 Stage IDs:[12,13,14]

 I do not know if this amounts to a bug, since I am not familiar with
 the scheduler in detail. The stages have seemingly been created
 somewhere in the DAG, but then have no associated tasks and never
 appear again.

 2. Extracting Event Log Information
 
 Currently we are running scalability tests, and are finding very poor
 scalability for certain block matrix algorithms. I would like to have
 finer detail about the communication time and bandwidth when data is
 transferred between nodes.

 I would really just like to have a file with nothing but task info in
 a format such as:
 timestamp (ms), task ID, hostname, execution time (ms), GC time (ms), ...
 0010294, 1, slave-1, 503, 34, ...
 0010392, 2, slave-2, 543, 32, ...
 and similarly for jobs/stages/rdd_memory/shuffle output/etc.

 I have extracted the relevant time fields from the spark event logs
 with a sed script, but I wonder if 

[sample code] deeplearning4j for Spark ML (@DeveloperAPI)

2015-06-08 Thread Eron Wright

The deeplearning4j framework provides a variety of distributed, neural 
network-based learning algorithms, including convolutional nets, deep 
auto-encoders, deep-belief nets, and recurrent nets.  We’re working on 
integration with the Spark ML pipeline, leveraging the developer API.   This 
announcement is to share some code and get feedback from the Spark community.

The integration code is located in the dl4j-spark-ml module in the 
deeplearning4j repository.

Major aspects of the integration work:
ML algorithms.  To bind the dl4j algorithms to the ML pipeline, we developed a 
new classifier and a new unsupervised learning estimator.   
ML attributes. We strove to interoperate well with other pipeline components.   
ML Attributes are column-level metadata enabling information sharing between 
pipeline components.See here how the classifier reads label metadata from a 
column provided by the new StringIndexer.
Large binary data.  It is challenging to work with large binary data in Spark.  
 An effective approach is to leverage PrunedScan and to carefully control 
partition sizes.  Here we explored this with a custom data source based on the 
new relation API.   
Column-based record readers.  Here we explored how to construct rows from a 
Hadoop input split by composing a number of column-level readers, with pruning 
support.
UDTs.   With Spark SQL it is possible to introduce new data types.   We 
prototyped an experimental Tensor type, here.
Spark Package.   We developed a spark package to make it easy to use the dl4j 
framework in spark-shell and with spark-submit.  See the 
deeplearning4j/dl4j-spark-ml repository for useful snippets involving the 
sbt-spark-package plugin.
Example code.   Examples demonstrate how the standardized ML API simplifies 
interoperability, such as with label preprocessing and feature scaling.   See 
the deeplearning4j/dl4j-spark-ml-examples repository for an expanding set of 
example pipelines.
Hope this proves useful to the community as we transition to exciting new 
concepts in Spark SQL and Spark ML.   Meanwhile, we have Spark working with 
multiple GPUs on AWS and we're looking forward to optimizations that will speed 
neural net training even more. 

Eron Wright
Contributor | deeplearning4j.org